Benchmarks
Why benchmark
Fizeau exists because we wanted a single agent runtime where the harness and the model are independently swappable. The benchmarks here exist for the same reason: they separate harness loss from model loss so you can answer questions of the form “is it the loop that’s hurting, or the model?” with evidence.
Each benchmark exercises the agent loop the same way — same prompts, same tools, same compaction policy, same tool-call accounting. We then permute the variables:
- Same model, different provider/runtime. The Qwen3.6-27B profiles route the same model through OpenRouter (cloud), vLLM int4 on a local GPU (sindri), and oMLX 8-bit on Apple silicon (vidar). A pass-rate or wall-time delta between these profiles is provider/runtime loss — the cost of how the bytes reach the model, not what the model is.
- Same model, different harness. The
fiz-harness-*profiles wrap Claude Code, Codex, Pi, and OpenCode through fiz so the model and the API stay constant while the agent loop changes. A delta here is harness loss. - Different models, same task. The leaderboard rows on each report page show how frontier hosted models (Claude Opus 4.6, GPT-5.4, Gemini 3 Pro) score on the same task set. That’s the upper bound a small open-weight model is measured against.
Every per-turn timing — first-token latency, decode rate, prefill time — lands on disk in line-delimited JSON. The reports below come from those logs via scripts/benchmark/generate-report.py; rerunning the script regenerates every chart and table here.
Headline numbers
| Profile | pass@k | p50 TTFT | p50 decode | p50 wall | avg cost |
|---|---|---|---|---|---|
| OpenRouter (cloud) | 61.8%55/89 | 0.89s | 47 tok/s | 569s | $0.143 |
| vLLM int4 local CUDA | 34.1%28/82 | 18.98s | 66 tok/s | 845s | — |
| llama.cpp Q3_K_XL local CUDA | 32.0%24/75 | 1.96s | 18 tok/s | 487s | — |
| oMLX 8-bit local Apple Silicon | 38.3%31/81 | 10.15s | 15 tok/s | 930s | — |
harborframework/terminal-bench-2-leaderboard. Ensembles (OB-1, Junie_CLI multi-model) are filtered out; for each model we pick the single-model harness with the highest pass@k.| Model | via | pass@k |
|---|---|---|
| Claude Opus 4.6 | Crux__Claude-Opus-4.6 | 96.6%86/89 |
| GPT-5.4 | Forge__GPT-5.4 | 89.9%80/89 |
| GPT-5.3-Codex | Mux__GPT-5.3-Codex | 88.8%79/89 |
| Gemini 3 Pro | Ante__Gemini-3-Pro-Preview | 82.0%73/89 |
| Gemini 3.1 Pro | Judy__Gemini-3.1-Pro-Preview | 93.3%83/89 |
Raw data explorer
The benchmark workbench exposes every collected cell as a browser-side analytical table with search, sort, filters, pairwise model-gap slices, and aggregate views over model, quantization, runtime, hardware, tokens, cost, and outcomes.
What’s measured
| signal | source | what it tells you |
|---|---|---|
| pass@k | TB-2.1 verifier | does the agent solve the task, k=5 reps, any-pass |
| TTFT (p50) | per-turn llm.delta − llm.request | provider prefill + queue latency under realistic context |
| decode tok/s (p50) | per-turn llm.response.ts − first delta | steady-state generation rate post-prefill |
| wall (p50) | trial start → trial end | total time the agent took, end-to-end |
| turns (p50) | count of llm.request per trial | how much the agent loop iterated |
| cost ($) | provider pricing × tokens | only meaningful for paid profiles |
A turn-by-turn breakdown bucketed by input-token length (prefill scaling) is on each per-benchmark page.
Available benchmarks
The left navigation links each benchmark we run today; add a new one by editing scripts/benchmark/profiles/*.yaml and rerunning the generator.