Harnesses

Snapshot: 2026-05-13 15:38:48 UTC · 11 fiz harness profiles shown · external leaderboard for the same models below

Each row holds the model constant (Sonnet 4.6, GPT-5.4-mini, etc.) and varies the agent loop. Native CLI profiles (claude-native-*, codex-native-*) run their own harness directly. fiz-harness-* profiles use fiz as a measurement wrapper around the same CLI. fiz-openrouter-* / fiz-openai-* profiles call the model's API directly through fiz's built-in loop. A delta between these is harness loss, isolated from model loss.

Pass-rate (with external comparators)

Profile / Submission	canary (3 tasks)	openai-cheap (35 tasks)	full (15 tasks)	all (89 tasks)	Provider
claude-native-sonnet-4-6	0.0% (0/1)	0.0% (0/3)	0.0% (0/2)	0.0% (0/3)	anthropic
claude-sonnet-4-6	33.3% (1/3)	8.6% (3/35)	13.3% (2/15)	3.4% (3/89)	openrouter
codex-native-gpt-5-4-mini	100.0% (1/1)	100.0% (3/3)	100.0% (2/2)	100.0% (3/3)	openai
fiz-harness-claude-sonnet-4-6	0.0% (0/3)	0.0% (0/11)	0.0% (0/15)	0.0% (0/15)	openrouter
fiz-harness-codex-gpt-5-4-mini	100.0% (3/3)	27.3% (3/11)	20.0% (3/15)	20.0% (3/15)	openrouter
fiz-harness-opencode-gpt-5-4-mini	0.0% (0/3)	0.0% (0/3)	0.0% (0/3)	0.0% (0/3)	openrouter
fiz-harness-pi-gpt-5-4-mini	0.0% (0/3)	0.0% (0/3)	0.0% (0/3)	0.0% (0/3)	openrouter
fiz-openrouter-claude-sonnet-4-6	100.0% (3/3)	27.3% (3/11)	20.0% (3/15)	20.0% (3/15)	openrouter
fiz-openrouter-gpt-5-4-mini	66.7% (2/3)	18.2% (2/11)	13.3% (2/15)	13.3% (2/15)	openrouter
gpt-5-4-mini-openrouter	100.0% (1/1)	100.0% (3/3)	100.0% (2/2)	100.0% (3/3)	openrouter
gpt-5-mini	0.0% (0/1)	0.0% (0/3)	0.0% (0/2)	0.0% (0/3)	openai-compat
External leaderboard (HF)
Crux__Claude-Opus-4.6	100.0% (3/3)	100.0% (35/35)	100.0% (15/15)	96.6% (86/89)	external
Forge__GPT-5.4	100.0% (3/3)	94.3% (33/35)	93.3% (14/15)	89.9% (80/89)	external
Junie_CLI__Gemini-3-Flash-Preview-Gemini-3.1-Pro-Preview-Claude-Opus-4.6-GPT-5.3-Codex	100.0% (3/3)	97.1% (34/35)	100.0% (15/15)	88.8% (79/89)	external
Mux__GPT-5.3-Codex	66.7% (2/3)	94.3% (33/35)	93.3% (14/15)	88.8% (79/89)	external
OpenSage__GPT-5.3-Codex	100.0% (3/3)	100.0% (35/35)	100.0% (15/15)	88.8% (79/89)	external
Droid__GPT-5.3-Codex	100.0% (3/3)	91.4% (32/35)	100.0% (15/15)	87.6% (78/89)	external
Judy__Claude-Opus-4.6	100.0% (3/3)	94.3% (33/35)	100.0% (15/15)	87.6% (78/89)	external
OB-1_GPT-5.4-GPT-5.3-Codex-Claude-Opus-4.5-Claude-Opus-4.6	100.0% (3/3)	94.3% (33/35)	100.0% (15/15)	86.5% (77/89)	external
Terminus-KIRA__Claude-Opus-4.6	100.0% (3/3)	97.1% (34/35)	93.3% (14/15)	86.5% (77/89)	external
Capy__Claude-Opus-4.6	100.0% (3/3)	91.4% (32/35)	100.0% (15/15)	85.4% (76/89)	external
Simple-Codex__GPT-5.3-Codex	66.7% (2/3)	94.3% (33/35)	86.7% (13/15)	85.4% (76/89)	external
Deep-Agents__GPT-5.2-Codex	100.0% (3/3)	88.6% (31/35)	100.0% (15/15)	84.3% (75/89)	external
CodeBrain-1__GPT-5.3-Codex	66.7% (2/3)	88.6% (31/35)	93.3% (14/15)	84.3% (75/89)	external
Droid__Claude-Opus-4.6	100.0% (3/3)	91.4% (32/35)	93.3% (14/15)	83.1% (74/89)	external
OB-1_GPT-5.3-Codex-Claude-Opus-4.5-Claude-Opus-4.6	66.7% (2/3)	91.4% (32/35)	86.7% (13/15)	82.0% (73/89)	external
Terminus2__GPT-5.3-Codex	100.0% (3/3)	94.3% (33/35)	93.3% (14/15)	80.9% (72/89)	external
Mux__Claude-Opus-4.6	66.7% (2/3)	82.9% (29/35)	80.0% (12/15)	78.7% (70/89)	external
Terminus2__Claude-Opus-4.6	66.7% (2/3)	88.6% (31/35)	86.7% (13/15)	77.5% (69/89)	external
IndusAGICodingAgent__gpt-5.3-codex	33.3% (1/3)	82.9% (29/35)	86.7% (13/15)	77.5% (69/89)	external
Mux__GPT-5.2	66.7% (2/3)	68.6% (24/35)	73.3% (11/15)	62.1% (54/87)	external

Detailed metrics

Profile	Harness	Attempts	Real	pass@1	pass@k	med turns	med in	med out	med wall (s)	cost ($)	p50 TTFT (s)	p50 decode (tok/s)
claude-native-sonnet-4-6	Claude Code (native CLI)	15	0	0.0%	0.0%	—	—	—	—	0.000	—	—
claude-sonnet-4-6	fiz (built-in agent loop)	102	0	14.0%	3.4%	—	—	—	—	0.000	1.95	824.5
codex-native-gpt-5-4-mini	Codex (native CLI)	15	0	91.7%	100.0%	—	—	—	—	0.000	—	—
fiz-harness-claude-sonnet-4-6	Claude Code (wrapped by fiz)	94	0	0.0%	0.0%	—	—	—	—	0.000	—	—
fiz-harness-codex-gpt-5-4-mini	Codex (wrapped by fiz)	94	0	15.3%	20.0%	—	—	—	—	0.000	—	—
fiz-harness-opencode-gpt-5-4-mini	OpenCode (wrapped by fiz)	23	0	0.0%	0.0%	—	—	—	—	0.000	—	—
fiz-harness-pi-gpt-5-4-mini	Pi (wrapped by fiz)	22	0	0.0%	0.0%	—	—	—	—	0.000	—	—
fiz-openrouter-claude-sonnet-4-6	fiz (built-in agent loop)	91	15	22.2%	20.0%	11	166,505	2,182	135	0.574	1.89	1474.7
fiz-openrouter-gpt-5-4-mini	fiz (built-in agent loop)	91	14	6.9%	13.3%	8	32,542	886	108	0.053	0.78	177.7
gpt-5-4-mini-openrouter	fiz (built-in agent loop)	15	0	46.7%	100.0%	—	—	—	—	0.000	1.04	194.4
gpt-5-mini	fiz (built-in agent loop)	17	0	—	0.0%	—	—	—	—	0.000	—	—

Side-by-side coverage and gaps

Side-by-side coverage today

The harness page holds the model constant and reads the difference between rows as harness loss. Two model families currently have enough profiles wired up to read that delta:

Sonnet 4.6 (three paths, no clean comparison yet). - claude-native-sonnet-4-6 — Claude Code's own CLI, no fiz involvement. 15 attempts, 0 real reps in the latest sweep (all invalid_setup). - fiz-harness-claude-sonnet-4-6 — fiz wraps the Claude Code CLI. 202 attempts, 0 real reps (same blocker). - fiz-openrouter-claude-sonnet-4-6 — fiz's built-in agent loop talking to Sonnet through OpenRouter. 199 attempts, 15 real reps, 22.2 % pass@1 on the partial openai-cheap cell (3 of 11 unique tasks solved any-rep).

The only row currently producing graded data is the OpenRouter built-in path. Until the two CLI-wrapping profiles get past invalid_setup, we cannot put a number on Claude Code's harness loss versus fiz's loop on Sonnet — the comparison we most want to make and the one most blocked.

GPT-5.4-mini (three paths, partial side-by-side). - codex-native-gpt-5-4-mini — Codex CLI native, only 1 of 3 canary tasks attempted but 100 % pass@k on what it touched (very small sample). - fiz-harness-codex-gpt-5-4-mini — fiz wraps Codex. 202 attempts, 0 real reps. Reports pass@1 15.3% from the binary success path even though no token-level data flowed. - fiz-openrouter-gpt-5-4-mini — fiz's built-in loop direct to OpenRouter. 199 attempts, 14 real reps, 6.7 % pass@k on the partial cell. - gpt-5-4-mini-openrouter — older fiz built-in profile, 100 % pass@k on a 3-task canary only.

The reading here is more legible than Sonnet but still preliminary: native Codex looks stronger on the canary than fiz's built-in loop on the same model on a wider task set, but the canary is 3 tasks and the fiz cell is 11 — the comparison becomes diagnostic only once the wrapped Codex profile stops invalidating.

What to compare next

These are the comparisons we cannot make today because one side of the side-by-side is missing. Adding the listed fiz profile would close the gap.

Model	Have	Missing fiz profile	Why it would matter
Claude Sonnet 4.6	`claude-native-sonnet-4-6` (when un-blocked) and `fiz-openrouter-claude-sonnet-4-6`	A working `fiz-harness-claude-sonnet-4-6` cell with real reps	Direct read of Claude Code's harness loss versus a vendor-direct fiz loop on the same model and provider key.
GPT-5.4-mini	`codex-native-gpt-5-4-mini` (canary only) and `fiz-openrouter-gpt-5-4-mini`	`codex-native-gpt-5-4-mini` extended to the full `openai-cheap` 35-task cell	The Codex wrapper looks strong on the canary. Without the wider cell we cannot tell whether the canary picked easy tasks or whether the wrapper outperforms the OpenRouter loop. Cheap to do (≈ $9).
GPT-5.5	`fiz-openai-gpt-5-5` (89 tasks, 24 % pass@k)	`fiz-openrouter-gpt-5-5` cell on the same `openai-cheap` subset	Lets us check whether OpenAI-native vs OpenRouter routing is responsible for any pass-rate delta on the same model, separate from harness.
Qwen3.6-27B (frontier reasoning profile)	All three Qwen provider rows on the providers page	A `fiz-harness-codex-qwen-3-6-27b` profile (Codex CLI configured against an OpenAI-compatible Qwen provider)	The only Qwen rows today use fiz's built-in loop. A second harness on the same Qwen weights would let the providers-page numbers be re-read as model-loss vs harness-loss.
Opus 4.6	External leaderboard (Crux, Judy, Capy, Droid, Mux, Terminus2 all reporting > 78 % on `all`)	Any fiz profile on Opus 4.6 (built-in loop or harness wrapper)	We have zero internal coverage of the model that tops the leaderboard. Without it the gap between fiz and the best external row mixes harness loss and model loss into a single unreadable number.

Across these five rows the cheapest two (mini extended to 35 tasks, and getting the wrapped-Codex profile producing real data) are the highest-leverage additions: both run well under $20 in API spend and unblock the only model where we have profiles on three different harnesses.