Skip to content

Benchmarks

Why benchmark

Fizeau exists because we wanted a single agent runtime where the harness and the model are independently swappable. The benchmarks here exist for the same reason: they separate harness loss from model loss so you can answer questions of the form “is it the loop that’s hurting, or the model?” with evidence.

Each benchmark exercises the agent loop the same way — same prompts, same tools, same compaction policy, same tool-call accounting. We then permute the variables:

  • Same model, different provider/runtime. The Qwen3.6-27B profiles route the same model through OpenRouter (cloud), vLLM int4 on a local GPU (sindri), and oMLX 8-bit on Apple silicon (vidar). A pass-rate or wall-time delta between these profiles is provider/runtime loss — the cost of how the bytes reach the model, not what the model is.
  • Same model, different harness. The fiz-harness-* profiles wrap Claude Code, Codex, Pi, and OpenCode through fiz so the model and the API stay constant while the agent loop changes. A delta here is harness loss.
  • Different models, same task. The leaderboard rows on each report page show how frontier hosted models (Claude Opus 4.6, GPT-5.4, Gemini 3 Pro) score on the same task set. That’s the upper bound a small open-weight model is measured against.

Every per-turn timing — first-token latency, decode rate, prefill time — lands on disk in line-delimited JSON. The reports below come from those logs via scripts/benchmark/generate-report.py; rerunning the script regenerates every chart and table here.

Headline numbers

2717
in-house trial reports
23
active profiles
45
external comparators
18484
leaderboard trials referenced
Same model, three providers — Qwen3.6-27B on TB-2.1 'all'
The provider/runtime stack is the only variable here. A 3.6× pass-rate gap and a 23× TTFT gap are the cost of self-hosting at this quantization. The vLLM int4 profile has the fastest decode but the slowest prefill; the OpenRouter cloud profile wins by being uniformly responsive.
Profilepass@kp50 TTFTp50 decodep50 wallavg cost
OpenRouter (cloud)61.8%55/890.89s47 tok/s569s$0.143
vLLM int4 local CUDA34.1%28/8218.98s66 tok/s845s
llama.cpp Q3_K_XL local CUDA32.0%24/751.96s18 tok/s487s
oMLX 8-bit local Apple Silicon38.3%31/8110.15s15 tok/s930s
Same task set, frontier models — best public submission per model on TB-2.1 'all'
Sourced from the public Hugging Face leaderboard at harborframework/terminal-bench-2-leaderboard. Ensembles (OB-1, Junie_CLI multi-model) are filtered out; for each model we pick the single-model harness with the highest pass@k.
Modelviapass@k
Claude Opus 4.6Crux__Claude-Opus-4.696.6%86/89
GPT-5.4Forge__GPT-5.489.9%80/89
GPT-5.3-CodexMux__GPT-5.3-Codex88.8%79/89
Gemini 3 ProAnte__Gemini-3-Pro-Preview82.0%73/89
Gemini 3.1 ProJudy__Gemini-3.1-Pro-Preview93.3%83/89

Raw data explorer

The benchmark workbench exposes every collected cell as a browser-side analytical table with search, sort, filters, pairwise model-gap slices, and aggregate views over model, quantization, runtime, hardware, tokens, cost, and outcomes.

What’s measured

signalsourcewhat it tells you
pass@kTB-2.1 verifierdoes the agent solve the task, k=5 reps, any-pass
TTFT (p50)per-turn llm.deltallm.requestprovider prefill + queue latency under realistic context
decode tok/s (p50)per-turn llm.response.ts − first deltasteady-state generation rate post-prefill
wall (p50)trial start → trial endtotal time the agent took, end-to-end
turns (p50)count of llm.request per trialhow much the agent loop iterated
cost ($)provider pricing × tokensonly meaningful for paid profiles

A turn-by-turn breakdown bucketed by input-token length (prefill scaling) is on each per-benchmark page.

Available benchmarks

The left navigation links each benchmark we run today; add a new one by editing scripts/benchmark/profiles/*.yaml and rerunning the generator.