Models
Each row is fiz running its own built-in agent loop against a different model. Where possible we report on the openai-cheap subset (35 tasks) so the cost gate doesn't bias the model selection — frontier hosted models are typically too expensive to run with k=5 reps across all 89 TB-2.1 tasks.
Pass-rate
| Profile / Submission | canary (3 tasks) | openai-cheap (35 tasks) | full (15 tasks) | all (89 tasks) | Provider |
|---|---|---|---|---|---|
| claude-native-sonnet-4-6 | 0.0% | 0.0% | 0.0% | 0.0% | |
| claude-sonnet-4-6 | 33.3% | 8.6% | 13.3% | 3.4% | |
| codex-native-gpt-5-4-mini | 100.0% | 100.0% | 100.0% | 100.0% | |
| fiz-openai-gpt-5-5 | 100.0% | 42.9% | 100.0% | 24.7% | |
| fiz-openrouter-claude-sonnet-4-6 | 100.0% | 27.3% | 20.0% | 20.0% | |
| fiz-openrouter-gpt-5-4-mini | 66.7% | 18.2% | 13.3% | 13.3% | |
| gpt-5-3-mini | 0.0% | 0.0% | 0.0% | 0.0% | |
| gpt-5-4-mini-openrouter | 100.0% | 100.0% | 100.0% | 100.0% | |
| gpt-5-mini | 0.0% | 0.0% | 0.0% | 0.0% | |
| sindri-vllm | 100.0% | 69.7% | 66.7% | 34.1% | |
| sindri-llamacpp | 66.7% | 70.6% | 75.0% | 71.0% | |
| local-ds4-deepseek-v4-flash | 50.0% | 66.7% | 62.5% | 44.4% | |
| local-omlx-qwen3-6-27b-openai-compat | 33.3% | 8.6% | 13.3% | 3.4% |
Detailed metrics
| Profile | Harness | Attempts | Real | pass@1 | pass@k | med turns | med in | med out | med wall (s) | cost ($) | p50 TTFT (s) | p50 decode (tok/s) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claude-native-sonnet-4-6 | 15 | 0 | 0.0% | 0.0% | — | — | — | — | 0.000 | — | — | |
| claude-sonnet-4-6 | 102 | 0 | 14.0% | 3.4% | — | — | — | — | 0.000 | 1.95 | 824.5 | |
| codex-native-gpt-5-4-mini | 15 | 0 | 91.7% | 100.0% | — | — | — | — | 0.000 | — | — | |
| fiz-openai-gpt-5-5 | 521 | 98 | 24.9% | 24.7% | 12 | 42,581 | 2,398 | 179 | 0.840 | 0.07 | 292.5 | |
| fiz-openrouter-claude-sonnet-4-6 | 91 | 15 | 22.2% | 20.0% | 11 | 166,505 | 2,182 | 135 | 0.574 | 1.89 | 1474.7 | |
| fiz-openrouter-gpt-5-4-mini | 91 | 14 | 6.9% | 13.3% | 8 | 32,542 | 886 | 108 | 0.053 | 0.78 | 177.7 | |
| gpt-5-3-mini | 14 | 0 | — | 0.0% | — | — | — | — | 0.000 | — | — | |
| gpt-5-4-mini-openrouter | 15 | 0 | 46.7% | 100.0% | — | — | — | — | 0.000 | 1.04 | 194.4 | |
| gpt-5-mini | 17 | 0 | — | 0.0% | — | — | — | — | 0.000 | — | — | |
| sindri-vllm | 197 | 104 | 28.7% | 34.1% | 11 | 62,940 | 3,623 | 845 | 0.000 | 18.98 | 66.3 | |
| sindri-llamacpp | 53 | 39 | 49.1% | 71.0% | 23 | 222,806 | 3,319 | 444 | 0.000 | 2.40 | 20.3 | |
| local-ds4-deepseek-v4-flash | 52 | 41 | 43.5% | 44.4% | 31 | 537,487 | 10,530 | 2489 | 0.000 | 30.40 | 23.3 | |
| local-omlx-qwen3-6-27b-openai-compat | 276 | 0 | 5.4% | 3.4% | — | — | — | — | 0.000 | 9.44 | 18.1 |
Cost to extend coverage
Estimates below use observed per-run costs from profiles that already produced real reps. Prices come from the benchmark profile registry. Where a profile has no real reps yet, the estimate is a back-of-envelope pricing × median-tokens from a comparable profile on the same model.
What it would cost to close the gaps
Three of the model rows on the table above carry zero real reps because their setups never produced a non-invalid_setup trial in the latest sweep (claude-native-sonnet-4-6, claude-sonnet-4-6 openrouter built-in, gpt-5-mini). Two other rows (fiz-openai-gpt-5-5, fiz-openrouter-claude-sonnet-4-6) have partial coverage on the 35-task openai-cheap subset but have not run --reps 5 against the full subset. The estimates below are what it would take in pure model spend to bring each profile to a full --reps 5 × 35-task cell on the cheap subset.
| Profile | Source $/run | Subset cost (35 × 5 reps) | Notes |
|---|---|---|---|
fiz-openai-gpt-5-5 | $0.84 (98 real reps observed) | ≈ $147 | Most expensive; openai-cheap was sized to keep this under ~$1/run, real number landed close. |
fiz-openrouter-gpt-5-5 | not yet run | ≈ $147 est. | Same model + pricing as above; assume identical token use until measured. |
fiz-openrouter-claude-sonnet-4-6 | $0.57 (15 real reps observed) | ≈ $100 | Median in-tokens 166k drives the cost; cached-input pricing ($0.30/Mtok) would cut this if prefix-cache hits land. |
fiz-harness-claude-sonnet-4-6 | unknown (0 real reps) | ≈ $100 est. | Wrapper path; assume same token profile as the direct OpenRouter Sonnet profile. Currently blocked on invalid_setup, not on cost. |
fiz-openrouter-gpt-5-4-mini | $0.053 (14 real reps observed) | ≈ $9 | Already cheap; the bottleneck is reliability, not budget. |
fiz-harness-codex-gpt-5-4-mini | unknown (0 real reps) | ≈ $9 est. | Same model + pricing as the direct mini profile. Reliability gating, not cost. |
To extend everything outside Qwen to a full --reps 5 × 35-task cell on openai-cheap, total spend is ≈ $510 in API costs (Sonnet ≈ $200 across two paths, GPT-5.5 ≈ $295 across two paths, mini ≈ $18 across two paths). The all (89-task) subset for the GPT-5.5 / Sonnet rows would be roughly 2.5× that — ≈ $1.3k for the frontier hosted models, which is why those rows stay on the cheap subset by default.
The cheap rows (mini, Qwen via OpenRouter) are not budget-gated; they are blocked on the same invalid_setup issues that produced 108 of 199 attempts in the table above. Fixing that infrastructure issue unlocks coverage, not money.
Model power vs pass-rate
Each marker is a profile (or external submission) plotted at its model-power score (1 = weak, 10 = frontier per scripts/benchmark/terminalbench_model_power.json) against pass@k on the all subset. Marker size scales with median turns (larger = the agent worked harder before converging or giving up). Distance below the trend line at a given x-value is the harness loss for that profile: how much pass-rate the profile leaves on the floor relative to what the underlying model delivers elsewhere.