Providers
The model weights are the same across every row here — Qwen3.6-27B in some quantization. The variable is everything else: where the bytes get computed, which serving engine runs them, what sampling defaults the server applies, whether prefix-cache is hit, and how much round-trip latency the network adds.
Hostnames are abstracted to the substantive characteristics. The descriptive label captures engine + quantization + GPU/CPU + OS — enough to map to a known-good machine spec without leaking inventory.
Pass-rate
| Profile / Submission | canary (3 tasks) | openai-cheap (35 tasks) | full (15 tasks) | all (89 tasks) | Provider |
|---|---|---|---|---|---|
| local-vllm-rtx3090 | 33.3% | 16.7% | 6.7% | 12.5% | |
| local-lmstudio-qwen3-6-27b | 0.0% | 0.0% | 0.0% | 0.0% | |
| fiz-openrouter-qwen3-6-27b | 100.0% | 85.7% | 86.7% | 61.8% | |
| local-rapidmlx-qwen3-6-27b | 0.0% | 8.3% | 0.0% | 6.2% | |
| sindri-llamacpp | 66.7% | 51.6% | 40.0% | 32.0% | |
| local-omlx-qwen3-6-27b | 100.0% | 61.8% | 73.3% | 38.3% |
Detailed metrics
| Profile | Harness | Attempts | Real | pass@1 | pass@k | med turns | med in | med out | med wall (s) | cost ($) | p50 TTFT (s) | p50 decode (tok/s) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| vLLM int4 / NVIDIA GeForce RTX 5090 Laptop GPU (24 GB) / Ubuntu 24.04.4 LTS (Noble Numbat) on WSL2 / Windows 11 host | 88 | 7 | 2.9% | 12.5% | 2 | 3,049 | 1,073 | 90 | 0.000 | 30.01 | 89.4 | |
| lmstudio / NVIDIA GeForce RTX 5090 Laptop GPU (24 GB) / Ubuntu 24.04.4 LTS (Noble Numbat) on WSL2 / Windows 11 host | 267 | 0 | 0.0% | 0.0% | — | — | — | — | 0.000 | — | — | |
| OpenRouter (cloud aggregator) | 334 | 315 | 64.5% | 61.8% | 15 | 98,989 | 5,948 | 569 | 0.143 | 0.89 | 46.8 | |
| RapidMLX 8-bit / Apple M1 Max (64 GB unified) | 70 | 0 | 3.6% | 6.2% | — | — | — | — | 0.000 | 30.02 | 15.7 | |
| llama-server / NVIDIA RTX 3090 Ti (24 GB) / Ubuntu 24.04.4 LTS (Noble Numbat) on WSL2 | 75 | 67 | 32.4% | 32.0% | 13 | 85,260 | 3,361 | 487 | 0.000 | 1.96 | 18.2 | |
| oMLX 8-bit / Apple M2 Ultra (24-core CPU) (192 GB unified) | 179 | 109 | 38.8% | 38.3% | 14 | 95,916 | 5,175 | 930 | 0.000 | 10.15 | 15.4 |
Performance vs context length
Per-turn TTFT (first-token latency) and steady-state decode tok/s, bucketed by input-token length of that turn. We bucket per turn rather than per task because the agent loop's input grows monotonically inside a single task — buckets reveal how each provider scales prefill and decode under increasing context.
Buckets: 0–10k, 10–30k, 30–60k, 60–120k, 120k+ tokens. Buckets with fewer than 5 turns of data are dropped to avoid noise.
Read this as: a profile that holds steady across buckets has a working KV-cache / prefix-cache; a profile whose TTFT slopes up sharply is recomputing prefill on every turn.
TTFT (seconds, lower is better)
Decode tok/s (higher is better)
Provider details
Provider details below use public profile labels. They publish enough information to interpret the benchmark results: provider surface, runtime family, model or quantization, sampling policy, context limits, and broad hardware class for self-hosted profiles.
OpenRouter Qwen3.6-27B
- Profile:
fiz-openrouter-qwen3-6-27b - Surface: managed OpenAI-compatible API through OpenRouter.
- Model:
qwen/qwen3-6-27b-instruct. - Sampling:
temperature=0.6,top_p=0.95,top_k=20, reasoninglow. - Limits: 128k advertised context, 32k max output.
- Cost profile: low cash cost per run; current medians put it near the budget end of the hosted profiles.
- Notes: provider-side routing and caching are opaque, so this profile is best read as the managed throughput reference for Qwen3.6-27B.
sindri-vllm
- Profile:
sindri-vllm - Surface: self-hosted vLLM on a local RTX-class CUDA workstation.
- Model: Qwen3.6-27B AutoRound int4.
- Sampling:
temperature=0.6,top_p=0.95,top_k=20, reasoninglow. - Limits: 180k advertised context, 32k max output; effective usable context depends on runtime memory pressure.
- Notes: decode throughput is strong, but TTFT rises with long context. Prefix caching is the main performance lever for this profile.
sindri-llamacpp
- Profile:
sindri-llamacpp - Surface: self-hosted llama.cpp on the same local CUDA hardware class as
sindri-vllm. - Model: Qwen3.6-27B Q3_K_XL quantization.
- Sampling:
temperature=0.6,top_p=0.95,top_k=20; provider-specific reasoning hints are not sent to llama.cpp. - Notes: this profile isolates runtime and quantization differences against the same broad hardware class as the vLLM profile.
local-vllm-rtx3090
- Profile:
local-vllm-rtx3090 - Surface: self-hosted vLLM on a mobile RTX-class CUDA system.
- Model: Qwen3.6-27B AutoRound int4.
- Sampling: same as
sindri-vllm. - Notes: this profile is wired up but not yet producing enough real reps for a comparable benchmark read.
local-omlx-qwen3-6-27b
- Profile:
local-omlx-qwen3-6-27b - Surface: self-hosted oMLX on an Apple silicon workstation class.
- Model: Qwen3.6-27B MLX 8-bit.
- Sampling:
temperature=0.6,top_p=0.95,top_k=20, reasoninglow. - Limits: 128k advertised context, 32k max output.
- Notes: current results show slower TTFT and decode than the CUDA profiles at this model size.
local-rapidmlx-qwen3-6-27b
- Profile:
local-rapidmlx-qwen3-6-27b - Surface: self-hosted Rapid-MLX on an Apple silicon workstation class.
- Model: Qwen3.6-27B MLX 8-bit.
- Sampling: same as the oMLX profile.
- Notes: this profile is not yet producing comparable real reps.
local-lmstudio-qwen3-6-27b
- Profile:
local-lmstudio-qwen3-6-27b - Surface: LM Studio alternate runtime.
- Model: Qwen3.6-27B class local model.
- Notes: this profile is a placeholder until it produces real reps.