Skip to content

Models

Snapshot: 2026-05-13 15:38:48 UTC · 13 model profiles shown

Each row is fiz running its own built-in agent loop against a different model. Where possible we report on the openai-cheap subset (35 tasks) so the cost gate doesn't bias the model selection — frontier hosted models are typically too expensive to run with k=5 reps across all 89 TB-2.1 tasks.

Pass-rate

Profile / Submissioncanary (3 tasks)openai-cheap (35 tasks)full (15 tasks)all (89 tasks)Provider
claude-native-sonnet-4-60.0% (0/1)0.0% (0/3)0.0% (0/2)0.0% (0/3)anthropic
claude-sonnet-4-633.3% (1/3)8.6% (3/35)13.3% (2/15)3.4% (3/89)openrouter
codex-native-gpt-5-4-mini100.0% (1/1)100.0% (3/3)100.0% (2/2)100.0% (3/3)openai
fiz-openai-gpt-5-5100.0% (3/3)42.9% (15/35)100.0% (15/15)24.7% (22/89)openai
fiz-openrouter-claude-sonnet-4-6100.0% (3/3)27.3% (3/11)20.0% (3/15)20.0% (3/15)openrouter
fiz-openrouter-gpt-5-4-mini66.7% (2/3)18.2% (2/11)13.3% (2/15)13.3% (2/15)openrouter
gpt-5-3-mini0.0% (0/1)0.0% (0/2)0.0% (0/2)0.0% (0/2)
gpt-5-4-mini-openrouter100.0% (1/1)100.0% (3/3)100.0% (2/2)100.0% (3/3)openrouter
gpt-5-mini0.0% (0/1)0.0% (0/3)0.0% (0/2)0.0% (0/3)openai-compat
sindri-vllm100.0% (3/3)69.7% (23/33)66.7% (10/15)34.1% (28/82)
sindri-llamacpp66.7% (2/3)70.6% (12/17)75.0% (6/8)71.0% (22/31)
local-ds4-deepseek-v4-flash50.0% (1/2)66.7% (10/15)62.5% (5/8)44.4% (16/36)ds4
local-omlx-qwen3-6-27b-openai-compat33.3% (1/3)8.6% (3/35)13.3% (2/15)3.4% (3/89)

Detailed metrics

ProfileHarnessAttemptsRealpass@1pass@kmed turnsmed inmed outmed wall (s)cost ($)p50 TTFT (s)p50 decode (tok/s)
claude-native-sonnet-4-6Claude Code (native CLI)1500.0%0.0%0.000
claude-sonnet-4-6fiz (built-in agent loop)102014.0%3.4%0.0001.95824.5
codex-native-gpt-5-4-miniCodex (native CLI)15091.7%100.0%0.000
fiz-openai-gpt-5-5fiz (built-in agent loop)5219824.9%24.7%1242,5812,3981790.8400.07292.5
fiz-openrouter-claude-sonnet-4-6fiz (built-in agent loop)911522.2%20.0%11166,5052,1821350.5741.891474.7
fiz-openrouter-gpt-5-4-minifiz (built-in agent loop)91146.9%13.3%832,5428861080.0530.78177.7
gpt-5-3-minifiz (built-in agent loop)1400.0%0.000
gpt-5-4-mini-openrouterfiz (built-in agent loop)15046.7%100.0%0.0001.04194.4
gpt-5-minifiz (built-in agent loop)1700.0%0.000
sindri-vllmfiz (built-in agent loop)19710428.7%34.1%1162,9403,6238450.00018.9866.3
sindri-llamacppfiz (built-in agent loop)533949.1%71.0%23222,8063,3194440.0002.4020.3
local-ds4-deepseek-v4-flashfiz (built-in agent loop)524143.5%44.4%31537,48710,53024890.00030.4023.3
local-omlx-qwen3-6-27b-openai-compatfiz (built-in agent loop)27605.4%3.4%0.0009.4418.1

Cost to extend coverage

Estimates below use observed per-run costs from profiles that already produced real reps. Prices come from the benchmark profile registry. Where a profile has no real reps yet, the estimate is a back-of-envelope pricing × median-tokens from a comparable profile on the same model.

What it would cost to close the gaps

Three of the model rows on the table above carry zero real reps because their setups never produced a non-invalid_setup trial in the latest sweep (claude-native-sonnet-4-6, claude-sonnet-4-6 openrouter built-in, gpt-5-mini). Two other rows (fiz-openai-gpt-5-5, fiz-openrouter-claude-sonnet-4-6) have partial coverage on the 35-task openai-cheap subset but have not run --reps 5 against the full subset. The estimates below are what it would take in pure model spend to bring each profile to a full --reps 5 × 35-task cell on the cheap subset.

ProfileSource $/runSubset cost (35 × 5 reps)Notes
fiz-openai-gpt-5-5$0.84 (98 real reps observed)≈ $147Most expensive; openai-cheap was sized to keep this under ~$1/run, real number landed close.
fiz-openrouter-gpt-5-5not yet run≈ $147 est.Same model + pricing as above; assume identical token use until measured.
fiz-openrouter-claude-sonnet-4-6$0.57 (15 real reps observed)≈ $100Median in-tokens 166k drives the cost; cached-input pricing ($0.30/Mtok) would cut this if prefix-cache hits land.
fiz-harness-claude-sonnet-4-6unknown (0 real reps)≈ $100 est.Wrapper path; assume same token profile as the direct OpenRouter Sonnet profile. Currently blocked on invalid_setup, not on cost.
fiz-openrouter-gpt-5-4-mini$0.053 (14 real reps observed)≈ $9Already cheap; the bottleneck is reliability, not budget.
fiz-harness-codex-gpt-5-4-miniunknown (0 real reps)≈ $9 est.Same model + pricing as the direct mini profile. Reliability gating, not cost.

To extend everything outside Qwen to a full --reps 5 × 35-task cell on openai-cheap, total spend is ≈ $510 in API costs (Sonnet ≈ $200 across two paths, GPT-5.5 ≈ $295 across two paths, mini ≈ $18 across two paths). The all (89-task) subset for the GPT-5.5 / Sonnet rows would be roughly 2.5× that — ≈ $1.3k for the frontier hosted models, which is why those rows stay on the cheap subset by default.

The cheap rows (mini, Qwen via OpenRouter) are not budget-gated; they are blocked on the same invalid_setup issues that produced 108 of 199 attempts in the table above. Fixing that infrastructure issue unlocks coverage, not money.

Model power vs pass-rate

Each marker is a profile (or external submission) plotted at its model-power score (1 = weak, 10 = frontier per scripts/benchmark/terminalbench_model_power.json) against pass@k on the all subset. Marker size scales with median turns (larger = the agent worked harder before converging or giving up). Distance below the trend line at a given x-value is the harness loss for that profile: how much pass-rate the profile leaves on the floor relative to what the underlying model delivers elsewhere.

model-power-scatter.svg