Providers

Snapshot: 2026-05-13 15:38:48 UTC · Qwen3.6-27B across 6 provider/runtime combinations

The model weights are the same across every row here — Qwen3.6-27B in some quantization. The variable is everything else: where the bytes get computed, which serving engine runs them, what sampling defaults the server applies, whether prefix-cache is hit, and how much round-trip latency the network adds.

Hostnames are abstracted to the substantive characteristics. The descriptive label captures engine + quantization + GPU/CPU + OS — enough to map to a known-good machine spec without leaking inventory.

Pass-rate

Profile / Submission	canary (3 tasks)	openai-cheap (35 tasks)	full (15 tasks)	all (89 tasks)	Provider
local-vllm-rtx3090	33.3% (1/3)	16.7% (2/12)	6.7% (1/15)	12.5% (2/16)	vllm
local-lmstudio-qwen3-6-27b	0.0% (0/3)	0.0% (0/35)	0.0% (0/15)	0.0% (0/89)	lmstudio
fiz-openrouter-qwen3-6-27b	100.0% (3/3)	85.7% (30/35)	86.7% (13/15)	61.8% (55/89)	openrouter
local-rapidmlx-qwen3-6-27b	0.0% (0/3)	8.3% (1/12)	0.0% (0/15)	6.2% (1/16)	rapid-mlx
sindri-llamacpp	66.7% (2/3)	51.6% (16/31)	40.0% (6/15)	32.0% (24/75)	llama-server
local-omlx-qwen3-6-27b	100.0% (3/3)	61.8% (21/34)	73.3% (11/15)	38.3% (31/81)	omlx

Detailed metrics

Profile	Harness	Attempts	Real	pass@1	pass@k	med turns	med in	med out	med wall (s)	cost ($)	p50 TTFT (s)	p50 decode (tok/s)
vLLM int4 / NVIDIA GeForce RTX 5090 Laptop GPU (24 GB) / Ubuntu 24.04.4 LTS (Noble Numbat) on WSL2 / Windows 11 host	fiz (built-in agent loop)	88	7	2.9%	12.5%	2	3,049	1,073	90	0.000	30.01	89.4
lmstudio / NVIDIA GeForce RTX 5090 Laptop GPU (24 GB) / Ubuntu 24.04.4 LTS (Noble Numbat) on WSL2 / Windows 11 host	fiz (built-in agent loop)	267	0	0.0%	0.0%	—	—	—	—	0.000	—	—
OpenRouter (cloud aggregator)	fiz (built-in agent loop)	334	315	64.5%	61.8%	15	98,989	5,948	569	0.143	0.89	46.8
RapidMLX 8-bit / Apple M1 Max (64 GB unified)	fiz (built-in agent loop)	70	0	3.6%	6.2%	—	—	—	—	0.000	30.02	15.7
llama-server / NVIDIA RTX 3090 Ti (24 GB) / Ubuntu 24.04.4 LTS (Noble Numbat) on WSL2	fiz (built-in agent loop)	75	67	32.4%	32.0%	13	85,260	3,361	487	0.000	1.96	18.2
oMLX 8-bit / Apple M2 Ultra (24-core CPU) (192 GB unified)	fiz (built-in agent loop)	179	109	38.8%	38.3%	14	95,916	5,175	930	0.000	10.15	15.4

Performance vs context length

Per-turn TTFT (first-token latency) and steady-state decode tok/s, bucketed by input-token length of that turn. We bucket per turn rather than per task because the agent loop's input grows monotonically inside a single task — buckets reveal how each provider scales prefill and decode under increasing context.

Buckets: 0–10k, 10–30k, 30–60k, 60–120k, 120k+ tokens. Buckets with fewer than 5 turns of data are dropped to avoid noise.

Read this as: a profile that holds steady across buckets has a working KV-cache / prefix-cache; a profile whose TTFT slopes up sharply is recomputing prefill on every turn.

TTFT (seconds, lower is better)

Decode tok/s (higher is better)

Provider details

Provider details below use public profile labels. They publish enough information to interpret the benchmark results: provider surface, runtime family, model or quantization, sampling policy, context limits, and broad hardware class for self-hosted profiles.

OpenRouter Qwen3.6-27B

Profile: fiz-openrouter-qwen3-6-27b
Surface: managed OpenAI-compatible API through OpenRouter.
Model: qwen/qwen3-6-27b-instruct.
Sampling: temperature=0.6, top_p=0.95, top_k=20, reasoning low.
Limits: 128k advertised context, 32k max output.
Cost profile: low cash cost per run; current medians put it near the budget end of the hosted profiles.
Notes: provider-side routing and caching are opaque, so this profile is best read as the managed throughput reference for Qwen3.6-27B.

sindri-vllm

Profile: sindri-vllm
Surface: self-hosted vLLM on a local RTX-class CUDA workstation.
Model: Qwen3.6-27B AutoRound int4.
Sampling: temperature=0.6, top_p=0.95, top_k=20, reasoning low.
Limits: 180k advertised context, 32k max output; effective usable context depends on runtime memory pressure.
Notes: decode throughput is strong, but TTFT rises with long context. Prefix caching is the main performance lever for this profile.

sindri-llamacpp

Profile: sindri-llamacpp
Surface: self-hosted llama.cpp on the same local CUDA hardware class as sindri-vllm.
Model: Qwen3.6-27B Q3_K_XL quantization.
Sampling: temperature=0.6, top_p=0.95, top_k=20; provider-specific reasoning hints are not sent to llama.cpp.
Notes: this profile isolates runtime and quantization differences against the same broad hardware class as the vLLM profile.

local-vllm-rtx3090

Profile: local-vllm-rtx3090
Surface: self-hosted vLLM on a mobile RTX-class CUDA system.
Model: Qwen3.6-27B AutoRound int4.
Sampling: same as sindri-vllm.
Notes: this profile is wired up but not yet producing enough real reps for a comparable benchmark read.

local-omlx-qwen3-6-27b

Profile: local-omlx-qwen3-6-27b
Surface: self-hosted oMLX on an Apple silicon workstation class.
Model: Qwen3.6-27B MLX 8-bit.
Sampling: temperature=0.6, top_p=0.95, top_k=20, reasoning low.
Limits: 128k advertised context, 32k max output.
Notes: current results show slower TTFT and decode than the CUDA profiles at this model size.

local-rapidmlx-qwen3-6-27b

Profile: local-rapidmlx-qwen3-6-27b
Surface: self-hosted Rapid-MLX on an Apple silicon workstation class.
Model: Qwen3.6-27B MLX 8-bit.
Sampling: same as the oMLX profile.
Notes: this profile is not yet producing comparable real reps.

local-lmstudio-qwen3-6-27b

Profile: local-lmstudio-qwen3-6-27b
Surface: LM Studio alternate runtime.
Model: Qwen3.6-27B class local model.
Notes: this profile is a placeholder until it produces real reps.