Skip to content
Agents Comparison

Agents Comparison

A side-by-side comparison of Fizeau against every coding-agent harness that has a row in the Fizeau Terminal-Bench 2.1 leaderboard (data/aggregates.json) plus the canonical open-source CLI agents users typically choose between.

Numbers in the TB-2.1 best column are from the snapshot data/aggregates.json regenerated 2026-05-10. The figure is tasks_passed / tasks_attempted on the 89-task all subset, picking the best-scoring harness×model row for that harness. See How we measure below for what that number does and does not mean.

A ? means the field could not be confirmed from the project’s canonical source. We deliberately do not guess.

Comparison

AgentSourceLanguageLicenseDistributionEmbeddable libraryLocal-model supportBuilt-in toolsTB-2.1 bestBest row
Fizeau (fiz)easel/fizeauGoMITCLI + Go libraryYes — fizeau.New(...).Execute(ctx, req)Yes — LM Studio, Ollama, vLLM, MLX, any OpenAI-compatible9 — read, write, edit, bash, find, grep, ls, patch, task52 / 89 (pass@k, fiz-openrouter-qwen3-6-27b)1per_profile
Claude Codeanthropics/claude-codeShell + Python + TypeScriptProprietary (Anthropic Commercial Terms)CLI? (no documented programmatic API)? (not documented in README; community guides exist)? (plugins, /bug, git workflows; no enumerated list in README)46 / 89ClaudeCode__GLM-4.7
OpenAI Codex CLIopenai/codexRustApache-2.0CLI (npm, brew, binary) + /sdk directoryPartial — SDK directory present, CLI is primary surfaceYes — --oss flag plus OPENAI_API_BASE to Ollama / LM Studio / MLX2? (not enumerated in README)— (no leaderboard row for Codex CLI itself)
OpenCodesst/opencodeTypeScriptMITCLI + desktop (beta)? (npm package published; library API not documented)Yes — “not coupled to any provider … local models” per README? (not enumerated in README)46 / 88OpenCode__Claude-Opus-4.5
Gemini CLIgoogle-gemini/gemini-cliTypeScriptApache-2.0CLI (npm, brew, MacPorts, conda)Yes — @google/gemini-cli npm packageNo (Google Gemini cloud only)File ops (read/write/search), shell exec, web fetch, Google Search grounding, MCP server integration64 / 89Gemini_CLI__Gemini-3-Flash-Preview
AiderAider-AI/aiderPythonApache-2.0CLI (pip install aider-chat)? (Python package; library API not the documented surface)Yes — Ollama, LM Studio, any OpenAI-compatible API3Git auto-commit, lint, test, image/web ingest, voice input, repo map, watch mode— (no leaderboard row)
Continuecontinuedev/continueTypeScript (+ Kotlin / Python / Rust)Apache-2.0VS Code ext, JetBrains plugin, CLI (cn)? (@continuedev/cli published; library API not documented)Yes — Ollama configurable as provider in config.yamlContext providers: file, code, diff, http, terminal; plus MCP servers— (no leaderboard row)
Gooseblock/gooseRust + TypeScriptApache-2.0Desktop app, CLI, APIYes — API documented for embeddingYes — Ollama listed among 15+ providers70+ extensions via MCP— (no leaderboard row)
Forge (Code-Forge)antinomyhq/forgeRustApache-2.0CLI? (CLI-only, no documented library surface)? (cloud providers only in README)3 agents (forge / sage / muse), plus shell, git, file r/w, 3 skills (create-skill, execute-plan, github-pr-description), MCP80 / 89Forge__GPT-5.4
Junie CLIJetBrains/junieShell + PowerShell wrappersJetBrains AI Service Terms (proprietary)CLI (npm) + IDE? (npm package; library API not documented)? — BYOK for Anthropic, OpenAI, Google, xAI, OpenRouter, Copilot; local-model support listed as roadmap4? (not enumerated; MCP server install supported)79 / 89Junie_CLI__Gemini-3-Flash-Preview-Gemini-3.1-Pro-Preview-Claude-Opus-4.6-GPT-5.3-Codex
Pibadlogic/pi-monoTypeScriptMITCLI (npm)Yes — @earendil-works/pi-agent-core and pi-ai published as separate packages? (unified LLM API; local-model specifics not in README)4 — read, write, edit, bash— (no leaderboard row)

How we measure

The TB-2.1 column reports the score of each external harness’s best submission to the public Terminal-Bench 2.1 leaderboard (harborframework/terminal-bench-2-leaderboard on Hugging Face). For each Harness__Model pair we count tasks_passed / tasks_attempted on the full 89-task set, then keep the single best pair per harness. Submission dates vary; the snapshot index date is the date we last refreshed the cache.

Fizeau’s row is sourced differently: it is tasks_passed_any / tasks_touched from our local per_profile aggregate (fiz-openrouter-qwen3-6-27b is the profile with the most graded reps). That is pass@k with k=5 reps per task, not the per-rep pass@1 that the external leaderboard implicitly reports. The two numbers are not directly comparable; we report Fizeau’s anyway because it is the dominant profile in the data and a single-number summary is more useful than no summary. For apples-to-apples comparison see the Terminal-Bench 2.1 page, which exposes both metrics, the per-task slice, and the cost / wall-time / token breakdown.

External rows count one pass per task per submission. Reps are not exposed uniformly across submissions, so we cannot reconstruct per-rep pass@1 for the external column without cooperation from each submitter.

Where we don’t compare

Some agents users ask about are excluded from the table, deliberately:

  • Cursor — IDE-only product. There is no CLI or programmatic API for an external benchmark to drive, so it cannot be measured under the harness model TB-2.1 uses. It is outside the comparable surface.
  • GitHub Copilot Chat — same reason as Cursor.
  • Crux, Capy, Mux, Judy, MAYA, OB-1, OpenSage, Terminus2, Terminus-KIRA, Ante, CodeBrain-1, Deep-Agents, Droid (Factory), IndusAGICodingAgent, Simple-Codex, pilot-real, cchuter, dakou, grok-cli, BashAgent, terminus-2 — present in the leaderboard data but without a publicly canonical source repository we could verify in this pass. They are tracked in external_per_subset; their facts are unverified, so they are intentionally not in the table rather than filled with ? rows. Submit a correction (see below) and we will add them.
  • Continue, Aider, Goose, Pi — included in the table for build-vs-buy reference, even though they have not submitted to Terminal-Bench 2.1. Their TB-2.1 column reads .

How to add a row

To correct a fact or add a missing harness:

  1. Open a PR against website/content/docs/comparison/_index.md on easel/fizeau.
  2. Cite the canonical source for each fact in the row (project README, docs page, license file). Inline footnotes are the right shape.
  3. If the harness has a Terminal-Bench 2.1 leaderboard submission we missed, link the Hugging Face submission ID. The next refresh of data/aggregates.json (via scripts/benchmark/generate-report.py --refresh-leaderboard) will pick it up automatically.
  4. Do not add qualitative ranking (“best”, “easiest”, “fastest”). The table is a fact sheet; ranking belongs in the benchmarks pages where the per-task data backs it.

  1. Fizeau’s number is pass@k on 89 tasks, sourced from per_profile in aggregates.json (tasks_passed_any / tasks_touched for fiz-openrouter-qwen3-6-27b). External leaderboard rows are tasks_passed / tasks_attempted from harborframework/terminal-bench-2-leaderboard on Hugging Face. The counting rules differ — see How we measure↩︎

  2. Confirmed via OpenAI Codex docs and Ollama integration page. Codex CLI accepts any OpenAI-compatible base URL, so any local server that exposes that surface (Ollama, LM Studio, MLX, vLLM, llama.cpp’s server) works. ↩︎

  3. Confirmed via aider.chat/docs/llms.html↩︎

  4. “Expanded support for local models” listed on the JetBrains roadmap; not in the shipped 2026-Q1 beta as of this snapshot. ↩︎