Skip to content

Harnesses

Snapshot: 2026-05-13 15:38:48 UTC · 11 fiz harness profiles shown · external leaderboard for the same models below

Each row holds the model constant (Sonnet 4.6, GPT-5.4-mini, etc.) and varies the agent loop. Native CLI profiles (claude-native-*, codex-native-*) run their own harness directly. fiz-harness-* profiles use fiz as a measurement wrapper around the same CLI. fiz-openrouter-* / fiz-openai-* profiles call the model's API directly through fiz's built-in loop. A delta between these is harness loss, isolated from model loss.

Pass-rate (with external comparators)

Profile / Submissioncanary (3 tasks)openai-cheap (35 tasks)full (15 tasks)all (89 tasks)Provider
claude-native-sonnet-4-60.0% (0/1)0.0% (0/3)0.0% (0/2)0.0% (0/3)anthropic
claude-sonnet-4-633.3% (1/3)8.6% (3/35)13.3% (2/15)3.4% (3/89)openrouter
codex-native-gpt-5-4-mini100.0% (1/1)100.0% (3/3)100.0% (2/2)100.0% (3/3)openai
fiz-harness-claude-sonnet-4-60.0% (0/3)0.0% (0/11)0.0% (0/15)0.0% (0/15)openrouter
fiz-harness-codex-gpt-5-4-mini100.0% (3/3)27.3% (3/11)20.0% (3/15)20.0% (3/15)openrouter
fiz-harness-opencode-gpt-5-4-mini0.0% (0/3)0.0% (0/3)0.0% (0/3)0.0% (0/3)openrouter
fiz-harness-pi-gpt-5-4-mini0.0% (0/3)0.0% (0/3)0.0% (0/3)0.0% (0/3)openrouter
fiz-openrouter-claude-sonnet-4-6100.0% (3/3)27.3% (3/11)20.0% (3/15)20.0% (3/15)openrouter
fiz-openrouter-gpt-5-4-mini66.7% (2/3)18.2% (2/11)13.3% (2/15)13.3% (2/15)openrouter
gpt-5-4-mini-openrouter100.0% (1/1)100.0% (3/3)100.0% (2/2)100.0% (3/3)openrouter
gpt-5-mini0.0% (0/1)0.0% (0/3)0.0% (0/2)0.0% (0/3)openai-compat
External leaderboard (HF)
Crux__Claude-Opus-4.6100.0% (3/3)100.0% (35/35)100.0% (15/15)96.6% (86/89)external
Forge__GPT-5.4100.0% (3/3)94.3% (33/35)93.3% (14/15)89.9% (80/89)external
Junie_CLI__Gemini-3-Flash-Preview-Gemini-3.1-Pro-Preview-Claude-Opus-4.6-GPT-5.3-Codex100.0% (3/3)97.1% (34/35)100.0% (15/15)88.8% (79/89)external
Mux__GPT-5.3-Codex66.7% (2/3)94.3% (33/35)93.3% (14/15)88.8% (79/89)external
OpenSage__GPT-5.3-Codex100.0% (3/3)100.0% (35/35)100.0% (15/15)88.8% (79/89)external
Droid__GPT-5.3-Codex100.0% (3/3)91.4% (32/35)100.0% (15/15)87.6% (78/89)external
Judy__Claude-Opus-4.6100.0% (3/3)94.3% (33/35)100.0% (15/15)87.6% (78/89)external
OB-1_GPT-5.4-GPT-5.3-Codex-Claude-Opus-4.5-Claude-Opus-4.6100.0% (3/3)94.3% (33/35)100.0% (15/15)86.5% (77/89)external
Terminus-KIRA__Claude-Opus-4.6100.0% (3/3)97.1% (34/35)93.3% (14/15)86.5% (77/89)external
Capy__Claude-Opus-4.6100.0% (3/3)91.4% (32/35)100.0% (15/15)85.4% (76/89)external
Simple-Codex__GPT-5.3-Codex66.7% (2/3)94.3% (33/35)86.7% (13/15)85.4% (76/89)external
Deep-Agents__GPT-5.2-Codex100.0% (3/3)88.6% (31/35)100.0% (15/15)84.3% (75/89)external
CodeBrain-1__GPT-5.3-Codex66.7% (2/3)88.6% (31/35)93.3% (14/15)84.3% (75/89)external
Droid__Claude-Opus-4.6100.0% (3/3)91.4% (32/35)93.3% (14/15)83.1% (74/89)external
OB-1_GPT-5.3-Codex-Claude-Opus-4.5-Claude-Opus-4.666.7% (2/3)91.4% (32/35)86.7% (13/15)82.0% (73/89)external
Terminus2__GPT-5.3-Codex100.0% (3/3)94.3% (33/35)93.3% (14/15)80.9% (72/89)external
Mux__Claude-Opus-4.666.7% (2/3)82.9% (29/35)80.0% (12/15)78.7% (70/89)external
Terminus2__Claude-Opus-4.666.7% (2/3)88.6% (31/35)86.7% (13/15)77.5% (69/89)external
IndusAGICodingAgent__gpt-5.3-codex33.3% (1/3)82.9% (29/35)86.7% (13/15)77.5% (69/89)external
Mux__GPT-5.266.7% (2/3)68.6% (24/35)73.3% (11/15)62.1% (54/87)external

Detailed metrics

ProfileHarnessAttemptsRealpass@1pass@kmed turnsmed inmed outmed wall (s)cost ($)p50 TTFT (s)p50 decode (tok/s)
claude-native-sonnet-4-6Claude Code (native CLI)1500.0%0.0%0.000
claude-sonnet-4-6fiz (built-in agent loop)102014.0%3.4%0.0001.95824.5
codex-native-gpt-5-4-miniCodex (native CLI)15091.7%100.0%0.000
fiz-harness-claude-sonnet-4-6Claude Code (wrapped by fiz)9400.0%0.0%0.000
fiz-harness-codex-gpt-5-4-miniCodex (wrapped by fiz)94015.3%20.0%0.000
fiz-harness-opencode-gpt-5-4-miniOpenCode (wrapped by fiz)2300.0%0.0%0.000
fiz-harness-pi-gpt-5-4-miniPi (wrapped by fiz)2200.0%0.0%0.000
fiz-openrouter-claude-sonnet-4-6fiz (built-in agent loop)911522.2%20.0%11166,5052,1821350.5741.891474.7
fiz-openrouter-gpt-5-4-minifiz (built-in agent loop)91146.9%13.3%832,5428861080.0530.78177.7
gpt-5-4-mini-openrouterfiz (built-in agent loop)15046.7%100.0%0.0001.04194.4
gpt-5-minifiz (built-in agent loop)1700.0%0.000

Side-by-side coverage and gaps

Side-by-side coverage today

The harness page holds the model constant and reads the difference between rows as harness loss. Two model families currently have enough profiles wired up to read that delta:

Sonnet 4.6 (three paths, no clean comparison yet). - claude-native-sonnet-4-6 — Claude Code's own CLI, no fiz involvement. 15 attempts, 0 real reps in the latest sweep (all invalid_setup). - fiz-harness-claude-sonnet-4-6 — fiz wraps the Claude Code CLI. 202 attempts, 0 real reps (same blocker). - fiz-openrouter-claude-sonnet-4-6 — fiz's built-in agent loop talking to Sonnet through OpenRouter. 199 attempts, 15 real reps, 22.2 % pass@1 on the partial openai-cheap cell (3 of 11 unique tasks solved any-rep).

The only row currently producing graded data is the OpenRouter built-in path. Until the two CLI-wrapping profiles get past invalid_setup, we cannot put a number on Claude Code's harness loss versus fiz's loop on Sonnet — the comparison we most want to make and the one most blocked.

GPT-5.4-mini (three paths, partial side-by-side). - codex-native-gpt-5-4-mini — Codex CLI native, only 1 of 3 canary tasks attempted but 100 % pass@k on what it touched (very small sample). - fiz-harness-codex-gpt-5-4-mini — fiz wraps Codex. 202 attempts, 0 real reps. Reports pass@1 15.3% from the binary success path even though no token-level data flowed. - fiz-openrouter-gpt-5-4-mini — fiz's built-in loop direct to OpenRouter. 199 attempts, 14 real reps, 6.7 % pass@k on the partial cell. - gpt-5-4-mini-openrouter — older fiz built-in profile, 100 % pass@k on a 3-task canary only.

The reading here is more legible than Sonnet but still preliminary: native Codex looks stronger on the canary than fiz's built-in loop on the same model on a wider task set, but the canary is 3 tasks and the fiz cell is 11 — the comparison becomes diagnostic only once the wrapped Codex profile stops invalidating.

What to compare next

These are the comparisons we cannot make today because one side of the side-by-side is missing. Adding the listed fiz profile would close the gap.

ModelHaveMissing fiz profileWhy it would matter
Claude Sonnet 4.6claude-native-sonnet-4-6 (when un-blocked) and fiz-openrouter-claude-sonnet-4-6A working fiz-harness-claude-sonnet-4-6 cell with real repsDirect read of Claude Code's harness loss versus a vendor-direct fiz loop on the same model and provider key.
GPT-5.4-minicodex-native-gpt-5-4-mini (canary only) and fiz-openrouter-gpt-5-4-minicodex-native-gpt-5-4-mini extended to the full openai-cheap 35-task cellThe Codex wrapper looks strong on the canary. Without the wider cell we cannot tell whether the canary picked easy tasks or whether the wrapper outperforms the OpenRouter loop. Cheap to do (≈ $9).
GPT-5.5fiz-openai-gpt-5-5 (89 tasks, 24 % pass@k)fiz-openrouter-gpt-5-5 cell on the same openai-cheap subsetLets us check whether OpenAI-native vs OpenRouter routing is responsible for any pass-rate delta on the same model, separate from harness.
Qwen3.6-27B (frontier reasoning profile)All three Qwen provider rows on the providers pageA fiz-harness-codex-qwen-3-6-27b profile (Codex CLI configured against an OpenAI-compatible Qwen provider)The only Qwen rows today use fiz's built-in loop. A second harness on the same Qwen weights would let the providers-page numbers be re-read as model-loss vs harness-loss.
Opus 4.6External leaderboard (Crux, Judy, Capy, Droid, Mux, Terminus2 all reporting > 78 % on all)Any fiz profile on Opus 4.6 (built-in loop or harness wrapper)We have zero internal coverage of the model that tops the leaderboard. Without it the gap between fiz and the best external row mixes harness loss and model loss into a single unreadable number.

Across these five rows the cheapest two (mini extended to 35 tasks, and getting the wrapped-Codex profile producing real data) are the highest-leverage additions: both run well under $20 in API spend and unblock the only model where we have profiles on three different harnesses.