ADR-007: Sampling Profiles Belong in the Model Catalog
| Date | Status | Deciders | Related | Confidence |
|---|---|---|---|---|
| 2026-04-27 | Accepted | Fizeau maintainers | ADR-005, ADR-006 | High |
Context
The native agent harness sends no sampling fields (temperature, top_p, top_k, min_p, repetition_penalty) to providers by default. The wire path at internal/provider/openai/openai.go:227-256 honors any non-nil values on agent.Options and silently omits the rest, but the only source feeding those options today is providers.<name>.sampling in user config (cmd/agent/main.go:241-258). For a fresh install, every request is field-omitted.
Field-omitted means the server picks. On oMLX (vidar, grendel) and on most OpenAI-compat local servers, an omitted temperature defaults to 0.0 — greedy / argmax decoding. This was confirmed empirically against Qwen3.6-27B-MLX-8bit on both vidar:1235 and grendel:1235 on 2026-04-27: identical bodies produced bit-identical outputs across two runs, and the server’s own “code” preset did not override client values when the client supplied them.
Greedy decoding combined with reasoning-mode Qwen3.x models causes deterministic tool-call loops — the same tool invocation with the same arguments emits repeatedly until the harness’s loop-detector aborts. This is the failure mode visible in the 2026-04-27 harness-parity run (benchmark-results/beadbench/run-20260427T122221Z-1549045/): four of five local arms failed before producing output, with the one that produced output failing on agent: identical tool calls repeated, aborting loop.
The Qwen team’s own Hugging Face model cards for the reasoning-capable Qwen3 family (Qwen3-8B, Qwen3-30B-A3B, Qwen3-235B-A22B) include an explicit warning in the enable_thinking=True section: “DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.” Our bug is the harness doing exactly what the upstream guidance says not to.
The empirical fix is a non-greedy sampler bundle. The Qwen3.6-27B model card “Best Practices” section recommends T=0.6, top_p=0.95, top_k=20 for thinking-mode precise-coding tasks, with presence_penalty=0 and repetition_penalty=1.0 (i.e., both penalties disabled). The plumbing to deliver these values exists. What is missing is policy for where the values come from.
Decision
1. Sampling is catalog policy, not user configuration
Per ADR-006, manual overrides are routing-failure signals. The same logic applies to sampling: a user reaching for --temperature or editing providers.*.sampling to make a model decode usefully is a signal that the catalog failed to carry the right defaults. The right home for sampler policy is the model catalog, exactly where reasoning_levels and reasoning_control already live.
2. Three resolution layers
Sampling values resolve through a precedence chain. Each layer can set any subset of fields; nil at all layers means the wire field is omitted and the server’s own default applies (the “let the server handle it” case is preserved as a first-class outcome).
L1 — Catalog
sampling_profiles(manifest top level). Named bundles keyed by use-case:code, eventuallycreative,tool-loop,review. The active profile is selected by the caller; v1 hardcodescode.L2 — Provider config (
providers.*.sampling) (existing). Per-(user, provider) override. Already implemented; semantics unchanged.L3 — CLI flags (deferred, not in v1). Per-request override.
Higher layers stomp lower layers per field, not per bundle. If L1 sets {T:0.6, top_p:0.95, top_k:20} and L2 sets {temperature: 0}, the resolved bundle is {T:0, top_p:0.95, top_k:20}.
A per-(model, profile) override layer between L1 and L2 is explicitly out of scope for v1. It is recoverable additively if a real model demands a profile-divergent bundle; until then it is speculation.
3. Sampling composes with reasoning at the provider seam
The resolver does not know what reasoning encoding will be sent on the wire, and reasoning-mode models can want different sampler bundles than non-thinking-mode runs of the same model. To avoid the resolver and reasoningRequestOptions (internal/provider/openai/openai.go:286) silently fighting on the same request body, the openai-compat provider is the single owner of final wire-field composition. The provider receives the resolved sampling profile and the reasoning policy and is the canonical home for any future clipping or substitution rule.
For v1, the seam ships without an active composition rule: the seeded code profile (T=0.6, top_p=0.95, top_k=20) happens to match Qwen’s published “thinking-mode precise-coding” recommendation exactly, and is also non-greedy enough to be safe in the non-thinking-mode case (slightly off the upstream non-thinking optimum, but never in the loop-inducing regime). The provider seam is established as the architectural home for (model_family × reasoning_state × profile) clipping; it just doesn’t have a rule to apply yet. When a future profile adds a value that diverges between thinking and non-thinking states for some family, the rule lives at the seam — not in the resolver, not in the catalog.
This keeps the catalog flat (one sampling_profiles.code bundle, not a code × thinking-state matrix) and concentrates the interaction logic at one site when it appears.
4. Wrapped harnesses do not honor catalog sampling
Pi, codex, and claude-code drive their own SDKs and pin samplers internally. The catalog’s sampling_profiles field is descriptive metadata only for runs that resolve to those harnesses; nothing flows on the wire. The resolver returns a zero-value profile when selection.Harness != "agent", by contract.
ModelEntry.sampling_control records this state:
client_settable(default) — provider honors all five fields.harness_pinned— provider drops everything; resolver short-circuits.partial(reserved, not implemented in v1) — provider honors a subset (e.g., Anthropic Messages API:temperature,top_p,top_konly). The honored-field list is part of the catalog entry when this state is used.
The Anthropic-direct path is partial in principle but ships in v1 as client_settable with its provider-side filter unchanged; deferring the field-list enforcement is an acceptable v1 simplification until a sampling profile is seeded that includes Anthropic-incompatible fields.
5. Telemetry: source-of-truth per request
LLMRequestData (internal/session/event.go:31) gains a sampling_source field. Values: catalog, provider_config, cli, or comma-separated when fields come from multiple layers. This is the routing-failure signal for sampling, mirroring ADR-006’s override telemetry. Without it, post-hoc analysis cannot tell whether a degraded run used catalog defaults or a stale per-provider override.
6. No client-side range validation
Out-of-range values (top_p=2.5, temperature=-1) pass through unchanged. The server is the authoritative checker and will reject. We do not maintain per-server ranges.
7. Manifest evolution and feature degradation
The model catalog is a versioned data artifact published on a cadence independent of binary releases (see plan-2026-04-10-catalog-distribution-and-refresh.md, CONVERGED). The published source of truth lives at the project’s catalog channel (https://documentdrivendx.github.io/agent/catalog by default); the embedded copy is a bootstrap snapshot used only when no external manifest is installed.
This decoupling drives three rules for any feature — including ADR-007 itself — that depends on new catalog data:
Additive schema evolution. New fields are added within the existing
version. The YAML decoder operates non-strictly (verified:internal/modelcatalog/manifest.godoes not enableKnownFields/Strictmode), so old code reading a new manifest silently ignores unknown keys, and new code reading an old manifest sees missing fields as nil/zero. Schema-version bumps are reserved for breaking changes.Graceful degradation when the installed manifest predates the feature. Code that depends on new catalog data MUST behave correctly when those fields are absent. For sampling profiles specifically: the resolver’s L1 lookup fails silently when
sampling_profiles.codeis missing, returning a zero-value bundle and falling through to L2 (provider config) or to the server’s own defaults. The user does not get the new behavior, but no run breaks.catalog_versionbumps are the user-facing pull signal. Whenever a code release introduces a feature that depends on new catalog data, the embedded manifest’scatalog_versionMUST bump alongside it. This drivesfiz catalog checkto report an update is available. Optionally, the published bundle’smin_fizeau_versioncan be raised when the manifest itself depends on new code semantics — but that direction is for protecting old binaries from new manifests, not for protecting new binaries from old manifests. (Forward-looking; no such field exists ininternal/modelcatalog/today.)One actionable nudge per feature, not silent regressions. A feature whose graceful-degraded behavior is materially worse than its catalog-on behavior MUST log a single, actionable warning at first-use noting that the installed manifest predates the feature and pointing at
fiz catalog update. For ADR-007 v1: when the resolver is called with acodeprofile name and the catalog returns no profile, log once. Repeat warnings per-process are noise; per-session-per-feature is the right granularity.
Embedded manifest edits never silently override an installed external manifest. The embedded snapshot’s role is bootstrap-only.
Consequences
Positive:
- Native agent on Qwen3.x stops hitting greedy-decoding tool-call loops. The 2026-04-27 harness-parity failure mode is addressed by a single seeded
codeprofile. - ADR-006 invariant (overrides are signals) extends cleanly to sampling: any
provider_configorclisource onsampling_sourceis a candidate datum for catalog-default tightening. - Per-(model × profile) overrides remain available as an additive change without rewrite.
Negative:
- One new manifest dimension and one new struct field add catalog complexity.
- The reasoning × sampling composition rule lives at the provider seam, not in the resolver. Anyone touching
reasoningRequestOptionsmust remember it co-owns wire fields with sampling. The provider seam is documented but the coupling is real. - Wrapped-harness rows carry a sampling field that is not enforced on the wire — descriptive metadata invites confusion in catalog audits.
Mitigations:
- The implementation bead carries a table-driven test of the per-field merge and a reasoning-active wire test pinning Qwen3 thinking-mode composition behavior before any catalog values ship.
sampling_controldefaults toclient_settable; theharness_pinnedshort-circuit is exercised in the resolver tests so the wrapped-harness contract is enforceable, not aspirational.
Reference: per-server defaults and the T=0 footgun
The decision above (“sampling is catalog policy”) is grounded in a survey of
what each server applies when the client OMITS sampler fields on a
/v1/chat/completions call. The empirical surprise is that defaults differ
sharply across servers — and that several local servers default to T=0
(greedy / argmax), which interacts badly with reasoning-mode Qwen3.x.
Server defaults (field-omitted behavior)
| Server | Default temperature | Default top_p | Default top_k | Top-level non-OpenAI fields accepted |
|---|---|---|---|---|
| oMLX (jundot/omlx, mlx-lm) | 0.0 (greedy) | 1.0 | 0 (off) | top_k, min_p, repetition_penalty YES |
| LM Studio | 1.0 | 1.0 | preset-driven (silently ignored on API path) | top_k, min_p, repeat_penalty YES |
| vLLM | 1.0 | 1.0 | -1 (off) | top_k, min_p, repetition_penalty YES |
| llama.cpp / llama-server | 0.80 | 0.95 | 40 | top_k, min_p, repeat_penalty YES |
| Ollama (OpenAI-compat endpoint) | 0.8 (Modelfile may override) | 0.9 | 40 (native API only) | top_k, min_p, repetition_penalty NO — must be set in Modelfile or sent via native /api/chat options |
| Lucebox (jondot/lucebox) | follows underlying mlx-lm; same T=0 footgun as oMLX | 1.0 | 0 (off) | (verify per release) |
oMLX and Lucebox do not auto-apply HuggingFace generation_config.json
from the model. vLLM does not unless launched with --generation-config auto. llama.cpp does not (GGUF metadata only). Ollama applies Modelfile
PARAMETER lines.
Recommended sampling for reasoning-mode model families (selected)
| Family | Temp | top_p | top_k | rep_pen | Source / notes |
|---|---|---|---|---|---|
| Qwen3 thinking | 0.6 | 0.95 | 20 | 1.0 | Qwen3 model card best practices. DO NOT use greedy. |
| Qwen3.5 thinking, code | 0.6 | 0.95 | 20 | 1.0 | Qwen3.5 card; presence_penalty=1.5 for non-tool general work |
| Qwen3.6 thinking, code | 0.6 | 0.95 | 20 | 1.0 | Qwen3.6 card; same recipe |
| Qwen3-Coder (3.x variants) | 0.7 | 0.8 | 20 | 1.05 | Coder card; agentic mode is primary |
| MiniMax M2 / M2.x | 1.0 | 0.95 | 40 | unknown | MiniMax cards; lower T causes loops on long agent traces |
| GPT-OSS 20B / 120B | 1.0 | 1.0 | unknown | unknown | openai/gpt-oss; explicit T=1.0/top_p=1.0 |
| Llama 3.1 / 3.3 instruct | 0.6 | 0.9 | unknown | unknown | meta-llama generation.py; same recipe for tool use |
| DeepSeek-R1 | 0.6 | 0.95 | unknown | unknown | DeepSeek-R1 card — must stay 0.5–0.7; greedy explicitly forbidden |
| DeepSeek-V3 | 0.3 | unknown | unknown | unknown | DeepSeek API param guide (general preset) |
| Gemma 3 / 4 | 1.0 | 0.95 | 64 | unknown | Gemma model cards |
Cross-cutting findings
- oMLX (and underlying mlx-lm) defaults to T=0.0 (greedy) when the client omits temperature. This is the single biggest harness footgun and is the empirical origin of the deterministic tool-call loops on Qwen3 reasoning-mode models that motivated this ADR. Every other server defaults to T ≥ 0.7.
- Ollama’s OpenAI endpoint silently drops
top_k,min_p, andrepetition_penalty. Setting Qwen3-recommendedtop_k=20on the wire has no effect; Modelfile is the only path. The catalog resolver must know this when targeting Ollama. - vLLM does not apply HF
generation_config.jsonby default. Without--generation-config auto, the model runs at T=1.0/top_p=1.0 regardless of what the model card says. - llama-server’s defaults already include sane min_p=0.05, top_k=40, T=0.8 — omitting fields is much safer there than on oMLX/vLLM/LM Studio. Different servers fail differently.
- GPT-OSS recommends T=1.0/top_p=1.0 (no nucleus filtering at all), which is the opposite of the Qwen3 thinking recipe (T=0.6/top_p=0.95/ top_k=20). A single one-size-fits-all harness preset will mis-serve at least one of these families. This is why §1 of the decision keys sampling to the model catalog rather than a global default.
Source provenance
Per-server defaults table, per-family recommended sampling, the T=0
footgun finding, and the cross-cutting failure modes extracted from
docs/research/sampling-defaults-survey-2026-04-27.md (Sections 1, 2,
and the “5-line summary of surprising defaults”). Folded here in ADR-007
because this ADR is the policy origin for sampling-as-catalog; SD-005
covers provider config plumbing but not the per-family recommendations.
Out of scope
- CLI flags for sampler fields (
--temperature, etc.). - Per-
(model, profile)overrides onModelEntry. - Per-request profile selection (v1 hardcodes
code). partialsampling_control enforcement (Anthropic field-list clipping).- Range validation.
- Server-side
seedplumbing — empirically ignored by oMLX as of 2026-04-27 (/tmp/probe_samplers.pylog).