Fizeau
An agentic-development runtime with its own measured loop. Other tools build on fizeau instead of writing their own — we own the harness, sampling, performance instrumentation, cost tracking, and subscription accounting so they don't have to. Local-model-first via vLLM, MLX, LM Studio, Ollama; cloud providers when you want them.
Why fizeau exists
Fizeau exists for three reasons that build on each other:
- Facilitate agentic development. A reusable, embeddable agent loop with the right primitives — tool-calling, planning, compaction, retries, session logging — so building tools doesn't mean re-implementing the loop every time.
- Make agentic work measurable. Per-turn timing, prefill vs decode breakdown, cost-per-trial, subscription-quota accounting — first-class outputs, not bolted-on observability. You can't improve prompts, agents, or providers you can't measure.
- Make local models a real option. Local serving (vLLM, MLX, LM Studio, Ollama) on the same provider surface as cloud frontier models. The benchmarks compare them honestly. Self-hosted at the right quantization is often cheaper, sometimes faster, and rarely the right answer for everything — but you can pick per workload because the data is on the table.
What it is
Fizeau is an agent runtime with a built-in agent loop (the fiz harness): it manages the prompt, tool-call protocol, file/edit/bash tooling, planning, compaction, retries, sampling, reasoning, quotas, and session log. It is not an LLM serving runtime — it does not host weights. Upstream model traffic goes to whatever provider the profile points at (OpenAI, Anthropic, OpenRouter, vLLM, oMLX, RapidMLX, native local).
Fizeau can also run as a wrapper around a different agent CLI (Claude Code, Codex, Pi, OpenCode) — the fiz-harness-* profiles in the profile catalog use this mode, where fiz handles configuration, environment, tool-call accounting, and session logging while delegating the reasoning loop to the wrapped agent. This isolates "is the agent loop hurting?" from "is the model hurting?" — same model, different harness, different profile.
For benchmark purposes, each profile in the catalog holds either axis constant and varies the other. A delta between two profiles that share a model but differ in harness is harness loss; a delta between two profiles that share a harness but differ in provider is provider/runtime loss.
Built for instrumented agent work
Every surface assumes you want to know what the medium is doing. There is no separate observability layer — the runtime emits structured per-turn timing as a first-class output.
Built-in agent loop
Tool-calling LLM loop with read, write, edit, bash, find, grep, ls, patch, task. Compaction, retry, sampler, reasoning, quotas — all wired through one provider-shaped surface.
One surface, many backends
OpenAI, Anthropic, OpenRouter, vLLM, oMLX, RapidMLX, native local. Profile definitions are YAML; benchmark deltas reflect provider/runtime, not harness drift.
TTFT, decode, prefill — per turn
Every llm.request → llm.delta → llm.response chain is timed and recorded. No sampling, no aggregation loss. Bucket by context length; attribute wall-time to prefill vs generation.
Wrap Claude Code, Codex, Pi, OpenCode
fiz-harness-* profiles route through fiz as a measurement wrapper around another agent CLI. Holds the model constant; varies the harness; isolates "is the loop hurting?" from "is the model hurting?"
JSONL session logs, replayable
Every turn, every tool call, every cost figure on disk in line-delimited JSON. fiz log to list, fiz replay to render. Replays drive the per-turn timing analysis behind every chart on this site.
Go library, no subprocess overhead
fizeau.New(...).Execute(ctx, request). Lives inside a build orchestrator (DDx) or any Go service that needs a tool-using model on its critical path.