Terminal-Bench 2.1
How we run it
Terminal-Bench 2.1 is a public coding-agent benchmark of 89 long-form tasks. Each task ships a prompt, an isolated Docker environment, and a deterministic verifier. An agent reads the prompt, runs shell commands, edits files inside the container, and is scored against the resulting state.
Each Fizeau profile runs through Harbor 0.3.x's installed-agent path. Harbor installs the agent runtime in the task container, runs the attempt, and then invokes the verifier separately. Profile configuration selects the provider, model, runtime, and harness without publishing private service locations. Each task runs five reps per profile; pass@1 is the per-rep success rate, and pass@k reports whether any of the five reps solved the task.
We slice the 89-task set into nested benchmarks of decreasing scope. The subset YAMLs are under scripts/benchmark/task-subset-tb21-*.yaml:
| Subset | Tasks | Selection rule |
|---|---|---|
| canary | 3 | 3-5 task canary covering SE, data-processing, and system-administration; one task per category; deterministic sort by difficulty desc then id asc |
| openai-cheap | 35 | observed native OpenAI GPT-5.5 average cost <= ~$0.90 per run where available; otherwise OpenRouter Qwen3.6 27B token count projected at GPT-5.5 pricing <= ~$1.00 per run; exclude known multi-dollar cells |
| full | 15 | filtered TB-2.1 tasks with fixed category quotas SE=5 security=3 file-ops=2 sysadmin=2 data-processing=2 debugging=1; difficulty-desc then id-asc |
| all | 89 | all 89 tasks from the Harbor terminal-bench/terminal-bench-2-1 task catalog |
Three perspectives on the same data
The trial set runs each task many ways. Each of the three sub-pages slices the data along a different axis:
- Models — fiz with its built-in agent loop across multiple models, on the cheap subset where cost lets us run real reps.
- Harnesses — same model, different agent loop. Includes external leaderboard rows for the same models in other harnesses.
- Providers — Qwen3.6-27B held constant, varying the host (cloud aggregator vs local CUDA vs Apple silicon). The harness/runtime story.
Headline observations
Qwen3.6-27B across providers (the headline question)
OpenRouter Qwen3.6-27B is the throughput reference. The local profiles bottleneck elsewhere:
- sindri-vllm (vLLM int4 on local CUDA): best decode rate, worst prefill. On agent loops with 50–150k context per turn, prefill dominates wall — explaining why the median wall is roughly 2× OpenRouter despite faster decode.
- local-omlx-qwen3-6-27b (oMLX 8-bit on Apple silicon): slow on both axes. MLX 8-bit at this model size is the rate limiter; only smaller quantization or a different runtime will move it.
Model-power signal vs harness loss
The scatter in section 6 mostly tracks the expected pattern: frontier-power models (Opus, GPT-5.5) sit at higher pass-rates than Qwen-class models. Several Qwen profiles still sit below the trend, which points to harness/runtime loss in addition to model capability.
Cost / reliability frontier
OpenRouter Qwen3.6-27B costs cash per run; local profiles cost $0 in cash but cost in wall-time and reliability. For pure budget, OR Qwen wins; for ceiling-pass tasks where reliability matters, the frontier rows on the leaderboard remain ahead of any Qwen profile regardless of plumbing.
Open questions
- The oMLX profile's input-token median runs higher than the vLLM profile on the same task set. Either the MLX server replays full conversation context where vLLM compacts, or the agent loop runs more turns before the model converges. Worth a focused trace.
sindri-vllmprefill latency is the single biggest performance lever: enabling vLLM--enable-prefix-caching(or boosting cache hit rate) should drop TTFT 5–10× and close most of the wall-time gap.
Method notes
- pass@1 = (graded reps with reward > 0) / (total graded reps). pass@k = unique tasks where any rep solved / unique tasks attempted. With reps=5 we do not report best-of-N because the reps are deliberately identical.
- Real runs = trials with
turns > 0AND any tokens flowed. Filters outinvalid_setup, network, container-startup, and zero-turn timeouts so per-trial medians (turns, tokens, wall) reflect actual model interaction. - TTFT = (first
llm.deltaevent ts) − (matchingllm.requestts) per turn, in seconds. - Decode tok/s =
output_tokens / (response.ts − first_delta.ts)per turn — post-prefill generation rate. - Both timing metrics report as median-of-per-task-medians to dampen rep variance and outlier turns. Per-bucket timing requires ≥5 turns in the bucket to plot.
- Provider-side latency (TTFT including queue and prefill) and pure decode stay separate so wall-time can be attributed to prefill vs generation.
- External leaderboard data is the count of
reward.txtfiles per submission per task onharborframework/terminal-bench-2-leaderboardon Hugging Face. We reporttasks_passed / tasks_attemptedrather than per-rep pass@1 because the leaderboard does not expose per-rep granularity uniformly.