Benchmarks

Why benchmark

Fizeau exists because we wanted a single agent runtime where the harness and the model are independently swappable. The benchmarks here exist for the same reason: they separate harness loss from model loss so you can answer questions of the form “is it the loop that’s hurting, or the model?” with evidence.

Each benchmark exercises the agent loop the same way — same prompts, same tools, same compaction policy, same tool-call accounting. We then permute the variables:

Same model, different provider/runtime. The Qwen3.6-27B profiles route the same model through OpenRouter (cloud), vLLM int4 on a local GPU (sindri), and oMLX 8-bit on Apple silicon (vidar). A pass-rate or wall-time delta between these profiles is provider/runtime loss — the cost of how the bytes reach the model, not what the model is.
Same model, different harness. The fiz-harness-* profiles wrap Claude Code, Codex, Pi, and OpenCode through fiz so the model and the API stay constant while the agent loop changes. A delta here is harness loss.
Different models, same task. The leaderboard rows on each report page show how frontier hosted models (Claude Opus 4.6, GPT-5.4, Gemini 3 Pro) score on the same task set. That’s the upper bound a small open-weight model is measured against.

Every per-turn timing — first-token latency, decode rate, prefill time — lands on disk in line-delimited JSON. The reports below come from those logs via scripts/benchmark/generate-report.py; rerunning the script regenerates every chart and table here.

Headline numbers

2717

in-house trial reports

active profiles

external comparators

18484

leaderboard trials referenced

Same model, three providers — Qwen3.6-27B on TB-2.1 'all'

The provider/runtime stack is the only variable here. A 3.6× pass-rate gap and a 23× TTFT gap are the cost of self-hosting at this quantization. The vLLM int4 profile has the fastest decode but the slowest prefill; the OpenRouter cloud profile wins by being uniformly responsive.

Profile	pass@k	p50 TTFT	p50 decode	p50 wall	avg cost
OpenRouter (cloud)	61.8%55/89	0.89s	47 tok/s	569s	$0.143
vLLM int4 local CUDA	34.1%28/82	18.98s	66 tok/s	845s	—
llama.cpp Q3_K_XL local CUDA	32.0%24/75	1.96s	18 tok/s	487s	—
oMLX 8-bit local Apple Silicon	38.3%31/81	10.15s	15 tok/s	930s	—

Same task set, frontier models — best public submission per model on TB-2.1 'all'

Sourced from the public Hugging Face leaderboard at harborframework/terminal-bench-2-leaderboard. Ensembles (OB-1, Junie_CLI multi-model) are filtered out; for each model we pick the single-model harness with the highest pass@k.

Model	via	pass@k
Claude Opus 4.6	Crux__Claude-Opus-4.6	96.6%86/89
GPT-5.4	Forge__GPT-5.4	89.9%80/89
GPT-5.3-Codex	Mux__GPT-5.3-Codex	88.8%79/89
Gemini 3 Pro	Ante__Gemini-3-Pro-Preview	82.0%73/89
Gemini 3.1 Pro	Judy__Gemini-3.1-Pro-Preview	93.3%83/89

Raw data explorer

The benchmark workbench exposes every collected cell as a browser-side analytical table with search, sort, filters, pairwise model-gap slices, and aggregate views over model, quantization, runtime, hardware, tokens, cost, and outcomes.

What’s measured

signal	source	what it tells you
pass@k	TB-2.1 verifier	does the agent solve the task, k=5 reps, any-pass
TTFT (p50)	per-turn `llm.delta` − `llm.request`	provider prefill + queue latency under realistic context
decode tok/s (p50)	per-turn `llm.response.ts` − first delta	steady-state generation rate post-prefill
wall (p50)	trial start → trial end	total time the agent took, end-to-end
turns (p50)	count of `llm.request` per trial	how much the agent loop iterated
cost ($)	provider pricing × tokens	only meaningful for paid profiles

A turn-by-turn breakdown bucketed by input-token length (prefill scaling) is on each per-benchmark page.

Available benchmarks

The left navigation links each benchmark we run today; add a new one by editing scripts/benchmark/profiles/*.yaml and rerunning the generator.