Skip to content

Providers

Snapshot: 2026-05-13 15:38:48 UTC · Qwen3.6-27B across 6 provider/runtime combinations

The model weights are the same across every row here — Qwen3.6-27B in some quantization. The variable is everything else: where the bytes get computed, which serving engine runs them, what sampling defaults the server applies, whether prefix-cache is hit, and how much round-trip latency the network adds.

Hostnames are abstracted to the substantive characteristics. The descriptive label captures engine + quantization + GPU/CPU + OS — enough to map to a known-good machine spec without leaking inventory.

Pass-rate

Profile / Submissioncanary (3 tasks)openai-cheap (35 tasks)full (15 tasks)all (89 tasks)Provider
local-vllm-rtx309033.3% (1/3)16.7% (2/12)6.7% (1/15)12.5% (2/16)vllm
local-lmstudio-qwen3-6-27b0.0% (0/3)0.0% (0/35)0.0% (0/15)0.0% (0/89)lmstudio
fiz-openrouter-qwen3-6-27b100.0% (3/3)85.7% (30/35)86.7% (13/15)61.8% (55/89)openrouter
local-rapidmlx-qwen3-6-27b0.0% (0/3)8.3% (1/12)0.0% (0/15)6.2% (1/16)rapid-mlx
sindri-llamacpp66.7% (2/3)51.6% (16/31)40.0% (6/15)32.0% (24/75)llama-server
local-omlx-qwen3-6-27b100.0% (3/3)61.8% (21/34)73.3% (11/15)38.3% (31/81)omlx

Detailed metrics

ProfileHarnessAttemptsRealpass@1pass@kmed turnsmed inmed outmed wall (s)cost ($)p50 TTFT (s)p50 decode (tok/s)
vLLM int4 / NVIDIA GeForce RTX 5090 Laptop GPU (24 GB) / Ubuntu 24.04.4 LTS (Noble Numbat) on WSL2 / Windows 11 hostfiz (built-in agent loop)8872.9%12.5%23,0491,073900.00030.0189.4
lmstudio / NVIDIA GeForce RTX 5090 Laptop GPU (24 GB) / Ubuntu 24.04.4 LTS (Noble Numbat) on WSL2 / Windows 11 hostfiz (built-in agent loop)26700.0%0.0%0.000
OpenRouter (cloud aggregator)fiz (built-in agent loop)33431564.5%61.8%1598,9895,9485690.1430.8946.8
RapidMLX 8-bit / Apple M1 Max (64 GB unified)fiz (built-in agent loop)7003.6%6.2%0.00030.0215.7
llama-server / NVIDIA RTX 3090 Ti (24 GB) / Ubuntu 24.04.4 LTS (Noble Numbat) on WSL2fiz (built-in agent loop)756732.4%32.0%1385,2603,3614870.0001.9618.2
oMLX 8-bit / Apple M2 Ultra (24-core CPU) (192 GB unified)fiz (built-in agent loop)17910938.8%38.3%1495,9165,1759300.00010.1515.4

Performance vs context length

Per-turn TTFT (first-token latency) and steady-state decode tok/s, bucketed by input-token length of that turn. We bucket per turn rather than per task because the agent loop's input grows monotonically inside a single task — buckets reveal how each provider scales prefill and decode under increasing context.

Buckets: 0–10k, 10–30k, 30–60k, 60–120k, 120k+ tokens. Buckets with fewer than 5 turns of data are dropped to avoid noise.

Read this as: a profile that holds steady across buckets has a working KV-cache / prefix-cache; a profile whose TTFT slopes up sharply is recomputing prefill on every turn.

TTFT (seconds, lower is better)

ttft-by-context.svg

Decode tok/s (higher is better)

decode-by-context.svg

Provider details

Provider details below use public profile labels. They publish enough information to interpret the benchmark results: provider surface, runtime family, model or quantization, sampling policy, context limits, and broad hardware class for self-hosted profiles.

OpenRouter Qwen3.6-27B

  • Profile: fiz-openrouter-qwen3-6-27b
  • Surface: managed OpenAI-compatible API through OpenRouter.
  • Model: qwen/qwen3-6-27b-instruct.
  • Sampling: temperature=0.6, top_p=0.95, top_k=20, reasoning low.
  • Limits: 128k advertised context, 32k max output.
  • Cost profile: low cash cost per run; current medians put it near the budget end of the hosted profiles.
  • Notes: provider-side routing and caching are opaque, so this profile is best read as the managed throughput reference for Qwen3.6-27B.

sindri-vllm

  • Profile: sindri-vllm
  • Surface: self-hosted vLLM on a local RTX-class CUDA workstation.
  • Model: Qwen3.6-27B AutoRound int4.
  • Sampling: temperature=0.6, top_p=0.95, top_k=20, reasoning low.
  • Limits: 180k advertised context, 32k max output; effective usable context depends on runtime memory pressure.
  • Notes: decode throughput is strong, but TTFT rises with long context. Prefix caching is the main performance lever for this profile.

sindri-llamacpp

  • Profile: sindri-llamacpp
  • Surface: self-hosted llama.cpp on the same local CUDA hardware class as sindri-vllm.
  • Model: Qwen3.6-27B Q3_K_XL quantization.
  • Sampling: temperature=0.6, top_p=0.95, top_k=20; provider-specific reasoning hints are not sent to llama.cpp.
  • Notes: this profile isolates runtime and quantization differences against the same broad hardware class as the vLLM profile.

local-vllm-rtx3090

  • Profile: local-vllm-rtx3090
  • Surface: self-hosted vLLM on a mobile RTX-class CUDA system.
  • Model: Qwen3.6-27B AutoRound int4.
  • Sampling: same as sindri-vllm.
  • Notes: this profile is wired up but not yet producing enough real reps for a comparable benchmark read.

local-omlx-qwen3-6-27b

  • Profile: local-omlx-qwen3-6-27b
  • Surface: self-hosted oMLX on an Apple silicon workstation class.
  • Model: Qwen3.6-27B MLX 8-bit.
  • Sampling: temperature=0.6, top_p=0.95, top_k=20, reasoning low.
  • Limits: 128k advertised context, 32k max output.
  • Notes: current results show slower TTFT and decode than the CUDA profiles at this model size.

local-rapidmlx-qwen3-6-27b

  • Profile: local-rapidmlx-qwen3-6-27b
  • Surface: self-hosted Rapid-MLX on an Apple silicon workstation class.
  • Model: Qwen3.6-27B MLX 8-bit.
  • Sampling: same as the oMLX profile.
  • Notes: this profile is not yet producing comparable real reps.

local-lmstudio-qwen3-6-27b

  • Profile: local-lmstudio-qwen3-6-27b
  • Surface: LM Studio alternate runtime.
  • Model: Qwen3.6-27B class local model.
  • Notes: this profile is a placeholder until it produces real reps.