Models

Snapshot: 2026-05-13 15:38:48 UTC · 13 model profiles shown

Each row is fiz running its own built-in agent loop against a different model. Where possible we report on the openai-cheap subset (35 tasks) so the cost gate doesn't bias the model selection — frontier hosted models are typically too expensive to run with k=5 reps across all 89 TB-2.1 tasks.

Pass-rate

Profile / Submission	canary (3 tasks)	openai-cheap (35 tasks)	full (15 tasks)	all (89 tasks)	Provider
claude-native-sonnet-4-6	0.0% (0/1)	0.0% (0/3)	0.0% (0/2)	0.0% (0/3)	anthropic
claude-sonnet-4-6	33.3% (1/3)	8.6% (3/35)	13.3% (2/15)	3.4% (3/89)	openrouter
codex-native-gpt-5-4-mini	100.0% (1/1)	100.0% (3/3)	100.0% (2/2)	100.0% (3/3)	openai
fiz-openai-gpt-5-5	100.0% (3/3)	42.9% (15/35)	100.0% (15/15)	24.7% (22/89)	openai
fiz-openrouter-claude-sonnet-4-6	100.0% (3/3)	27.3% (3/11)	20.0% (3/15)	20.0% (3/15)	openrouter
fiz-openrouter-gpt-5-4-mini	66.7% (2/3)	18.2% (2/11)	13.3% (2/15)	13.3% (2/15)	openrouter
gpt-5-3-mini	0.0% (0/1)	0.0% (0/2)	0.0% (0/2)	0.0% (0/2)
gpt-5-4-mini-openrouter	100.0% (1/1)	100.0% (3/3)	100.0% (2/2)	100.0% (3/3)	openrouter
gpt-5-mini	0.0% (0/1)	0.0% (0/3)	0.0% (0/2)	0.0% (0/3)	openai-compat
sindri-vllm	100.0% (3/3)	69.7% (23/33)	66.7% (10/15)	34.1% (28/82)
sindri-llamacpp	66.7% (2/3)	70.6% (12/17)	75.0% (6/8)	71.0% (22/31)
local-ds4-deepseek-v4-flash	50.0% (1/2)	66.7% (10/15)	62.5% (5/8)	44.4% (16/36)	ds4
local-omlx-qwen3-6-27b-openai-compat	33.3% (1/3)	8.6% (3/35)	13.3% (2/15)	3.4% (3/89)

Detailed metrics

Profile	Harness	Attempts	Real	pass@1	pass@k	med turns	med in	med out	med wall (s)	cost ($)	p50 TTFT (s)	p50 decode (tok/s)
claude-native-sonnet-4-6	Claude Code (native CLI)	15	0	0.0%	0.0%	—	—	—	—	0.000	—	—
claude-sonnet-4-6	fiz (built-in agent loop)	102	0	14.0%	3.4%	—	—	—	—	0.000	1.95	824.5
codex-native-gpt-5-4-mini	Codex (native CLI)	15	0	91.7%	100.0%	—	—	—	—	0.000	—	—
fiz-openai-gpt-5-5	fiz (built-in agent loop)	521	98	24.9%	24.7%	12	42,581	2,398	179	0.840	0.07	292.5
fiz-openrouter-claude-sonnet-4-6	fiz (built-in agent loop)	91	15	22.2%	20.0%	11	166,505	2,182	135	0.574	1.89	1474.7
fiz-openrouter-gpt-5-4-mini	fiz (built-in agent loop)	91	14	6.9%	13.3%	8	32,542	886	108	0.053	0.78	177.7
gpt-5-3-mini	fiz (built-in agent loop)	14	0	—	0.0%	—	—	—	—	0.000	—	—
gpt-5-4-mini-openrouter	fiz (built-in agent loop)	15	0	46.7%	100.0%	—	—	—	—	0.000	1.04	194.4
gpt-5-mini	fiz (built-in agent loop)	17	0	—	0.0%	—	—	—	—	0.000	—	—
sindri-vllm	fiz (built-in agent loop)	197	104	28.7%	34.1%	11	62,940	3,623	845	0.000	18.98	66.3
sindri-llamacpp	fiz (built-in agent loop)	53	39	49.1%	71.0%	23	222,806	3,319	444	0.000	2.40	20.3
local-ds4-deepseek-v4-flash	fiz (built-in agent loop)	52	41	43.5%	44.4%	31	537,487	10,530	2489	0.000	30.40	23.3
local-omlx-qwen3-6-27b-openai-compat	fiz (built-in agent loop)	276	0	5.4%	3.4%	—	—	—	—	0.000	9.44	18.1

Cost to extend coverage

Estimates below use observed per-run costs from profiles that already produced real reps. Prices come from the benchmark profile registry. Where a profile has no real reps yet, the estimate is a back-of-envelope pricing × median-tokens from a comparable profile on the same model.

What it would cost to close the gaps

Three of the model rows on the table above carry zero real reps because their setups never produced a non-invalid_setup trial in the latest sweep (claude-native-sonnet-4-6, claude-sonnet-4-6 openrouter built-in, gpt-5-mini). Two other rows (fiz-openai-gpt-5-5, fiz-openrouter-claude-sonnet-4-6) have partial coverage on the 35-task openai-cheap subset but have not run --reps 5 against the full subset. The estimates below are what it would take in pure model spend to bring each profile to a full --reps 5 × 35-task cell on the cheap subset.

Profile	Source $/run	Subset cost (35 × 5 reps)	Notes
`fiz-openai-gpt-5-5`	$0.84 (98 real reps observed)	≈ $147	Most expensive; `openai-cheap` was sized to keep this under ~$1/run, real number landed close.
`fiz-openrouter-gpt-5-5`	not yet run	≈ $147 est.	Same model + pricing as above; assume identical token use until measured.
`fiz-openrouter-claude-sonnet-4-6`	$0.57 (15 real reps observed)	≈ $100	Median in-tokens 166k drives the cost; cached-input pricing ($0.30/Mtok) would cut this if prefix-cache hits land.
`fiz-harness-claude-sonnet-4-6`	unknown (0 real reps)	≈ $100 est.	Wrapper path; assume same token profile as the direct OpenRouter Sonnet profile. Currently blocked on `invalid_setup`, not on cost.
`fiz-openrouter-gpt-5-4-mini`	$0.053 (14 real reps observed)	≈ $9	Already cheap; the bottleneck is reliability, not budget.
`fiz-harness-codex-gpt-5-4-mini`	unknown (0 real reps)	≈ $9 est.	Same model + pricing as the direct mini profile. Reliability gating, not cost.

To extend everything outside Qwen to a full --reps 5 × 35-task cell on openai-cheap, total spend is ≈ $510 in API costs (Sonnet ≈ $200 across two paths, GPT-5.5 ≈ $295 across two paths, mini ≈ $18 across two paths). The all (89-task) subset for the GPT-5.5 / Sonnet rows would be roughly 2.5× that — ≈ $1.3k for the frontier hosted models, which is why those rows stay on the cheap subset by default.

The cheap rows (mini, Qwen via OpenRouter) are not budget-gated; they are blocked on the same invalid_setup issues that produced 108 of 199 attempts in the table above. Fixing that infrastructure issue unlocks coverage, not money.

Model power vs pass-rate

Each marker is a profile (or external submission) plotted at its model-power score (1 = weak, 10 = frontier per scripts/benchmark/terminalbench_model_power.json) against pass@k on the all subset. Marker size scales with median turns (larger = the agent worked harder before converging or giving up). Distance below the trend line at a given x-value is the harness loss for that profile: how much pass-rate the profile leaves on the floor relative to what the underlying model delivers elsewhere.