Skip to content

Terminal-Bench 2.1

Snapshot: 2026-05-13 15:38:48 UTC · 2,717 trial reports · 23 active profiles

How we run it

Terminal-Bench 2.1 is a public coding-agent benchmark of 89 long-form tasks. Each task ships a prompt, an isolated Docker environment, and a deterministic verifier. An agent reads the prompt, runs shell commands, edits files inside the container, and is scored against the resulting state.

Each Fizeau profile runs through Harbor 0.3.x's installed-agent path. Harbor installs the agent runtime in the task container, runs the attempt, and then invokes the verifier separately. Profile configuration selects the provider, model, runtime, and harness without publishing private service locations. Each task runs five reps per profile; pass@1 is the per-rep success rate, and pass@k reports whether any of the five reps solved the task.

We slice the 89-task set into nested benchmarks of decreasing scope. The subset YAMLs are under scripts/benchmark/task-subset-tb21-*.yaml:

SubsetTasksSelection rule
canary33-5 task canary covering SE, data-processing, and system-administration; one task per category; deterministic sort by difficulty desc then id asc
openai-cheap35observed native OpenAI GPT-5.5 average cost <= ~$0.90 per run where available; otherwise OpenRouter Qwen3.6 27B token count projected at GPT-5.5 pricing <= ~$1.00 per run; exclude known multi-dollar cells
full15filtered TB-2.1 tasks with fixed category quotas SE=5 security=3 file-ops=2 sysadmin=2 data-processing=2 debugging=1; difficulty-desc then id-asc
all89all 89 tasks from the Harbor terminal-bench/terminal-bench-2-1 task catalog

Three perspectives on the same data

The trial set runs each task many ways. Each of the three sub-pages slices the data along a different axis:

  • Models — fiz with its built-in agent loop across multiple models, on the cheap subset where cost lets us run real reps.
  • Harnesses — same model, different agent loop. Includes external leaderboard rows for the same models in other harnesses.
  • Providers — Qwen3.6-27B held constant, varying the host (cloud aggregator vs local CUDA vs Apple silicon). The harness/runtime story.

Headline observations

Qwen3.6-27B across providers (the headline question)

OpenRouter Qwen3.6-27B is the throughput reference. The local profiles bottleneck elsewhere:

  • sindri-vllm (vLLM int4 on local CUDA): best decode rate, worst prefill. On agent loops with 50–150k context per turn, prefill dominates wall — explaining why the median wall is roughly 2× OpenRouter despite faster decode.
  • local-omlx-qwen3-6-27b (oMLX 8-bit on Apple silicon): slow on both axes. MLX 8-bit at this model size is the rate limiter; only smaller quantization or a different runtime will move it.

Model-power signal vs harness loss

The scatter in section 6 mostly tracks the expected pattern: frontier-power models (Opus, GPT-5.5) sit at higher pass-rates than Qwen-class models. Several Qwen profiles still sit below the trend, which points to harness/runtime loss in addition to model capability.

Cost / reliability frontier

OpenRouter Qwen3.6-27B costs cash per run; local profiles cost $0 in cash but cost in wall-time and reliability. For pure budget, OR Qwen wins; for ceiling-pass tasks where reliability matters, the frontier rows on the leaderboard remain ahead of any Qwen profile regardless of plumbing.

Open questions

  • The oMLX profile's input-token median runs higher than the vLLM profile on the same task set. Either the MLX server replays full conversation context where vLLM compacts, or the agent loop runs more turns before the model converges. Worth a focused trace.
  • sindri-vllm prefill latency is the single biggest performance lever: enabling vLLM --enable-prefix-caching (or boosting cache hit rate) should drop TTFT 5–10× and close most of the wall-time gap.

Method notes

  • pass@1 = (graded reps with reward > 0) / (total graded reps). pass@k = unique tasks where any rep solved / unique tasks attempted. With reps=5 we do not report best-of-N because the reps are deliberately identical.
  • Real runs = trials with turns > 0 AND any tokens flowed. Filters out invalid_setup, network, container-startup, and zero-turn timeouts so per-trial medians (turns, tokens, wall) reflect actual model interaction.
  • TTFT = (first llm.delta event ts) − (matching llm.request ts) per turn, in seconds.
  • Decode tok/s = output_tokens / (response.ts − first_delta.ts) per turn — post-prefill generation rate.
  • Both timing metrics report as median-of-per-task-medians to dampen rep variance and outlier turns. Per-bucket timing requires ≥5 turns in the bucket to plot.
  • Provider-side latency (TTFT including queue and prefill) and pure decode stay separate so wall-time can be attributed to prefill vs generation.
  • External leaderboard data is the count of reward.txt files per submission per task on harborframework/terminal-bench-2-leaderboard on Hugging Face. We report tasks_passed / tasks_attempted rather than per-rep pass@1 because the leaderboard does not expose per-rep granularity uniformly.