ADR-010: Reasoning Wire Form Belongs in the Model Catalog
| Date | Status | Deciders | Related | Confidence |
|---|---|---|---|---|
| 2026-05-11 | Accepted | Fizeau maintainers | ADR-005, ADR-006, ADR-007 | High |
Context
Fizeau’s reasoning policy (internal/reasoning/reasoning.go) normalizes
caller intent into one of two kinds:
KindNamed—low/medium/high/ etc. (a tier shortcut).KindTokens— an integer thinking budget.
Each provider’s translator emits a wire form for the upstream API. For
OpenRouter that’s a nested reasoning block with either effort: "<tier>" or max_tokens: <int>. For Qwen-native servers
(llama-server, vLLM) it’s enable_thinking: true, thinking_budget: <int>. For Anthropic-shaped servers (oMLX, ds4, lucebox) it’s
thinking: {type: enabled, budget_tokens: <int>}.
Today the OpenRouter translator at
internal/provider/openai/openai.go:453-473 chooses wire shape from
the policy Kind alone:
case KindTokens:
reasoning["max_tokens"] = policy.Tokens
case KindNamed:
reasoning["effort"] = effortEmpirical finding (2026-05-11 probe against qwen/qwen3.6-27b on
OpenRouter): the upstream silently flat-maps effort for Qwen3 — low,
medium, and high all yield ~5555 reasoning tokens. The
reasoning.max_tokens form is forwarded upstream as Qwen-native
thinking_budget and is honored. So which knob bites is not a
property of the OR transport; it’s a property of the upstream model.
Other OR-hosted models (GPT-5, Claude) honor effort because their
upstream vendors interpret it natively.
The principle from ADR-006 (“overrides are routing-failure signals”)
and ADR-007 (“sampling is catalog policy, not user configuration”)
applies again: a benchmark profile having to specify reasoning: 4096
to work around a tier flatten is the catalog failing to carry the
knowledge that OR-Qwen3 needs a token-budget wire. Callers should be
able to write reasoning: low and have the right wire shape emitted.
A latent bug accelerates the case: internal/config/config.go:925-944
keys the reasoning_wire lookup map by catalog ID (e.g.
qwen3.6-27b), but the openai translator queries with the surface
ID (qwen/qwen3.6-27b for the OR route). The catalog entry for
qwen3.6-27b has reasoning_wire: model_id, but the OR provider
never sees it — the lookup misses on key mismatch. Today this is
benign (the lookup returns the default provider and the wire path
runs); under any catalog change to the effort/tokens values, the
mismatch becomes a silent correctness bug.
Decision
1. Wire form is catalog-declared
Extend ReasoningWire (internal/modelcatalog/manifest.go:61,
internal/modelcatalog/manifest.go:90-93) with two new accepted
values:
effort— the upstream honors a discrete effort tier. Wire emitsreasoning.effort: "<tier>"(OpenRouter) or the equivalent for other transports.tokens— the upstream honors only a token budget. Wire emitsreasoning.max_tokens: <int>(OpenRouter),thinking_budget: <int>(Qwen-native), orbudget_tokens: <int>(Anthropic-shape).
The full enum becomes provider | model_id | none | effort | tokens.
provider remains the default and means “use the provider’s default
wire form” — preserving today’s behavior for any model the catalog
doesn’t classify.
2. PortableBudgets is the bidirectional conversion table
When the caller’s Kind and the catalog’s wire form disagree, the
translator converts via the existing PortableBudgets map
(low: 2048, medium: 8192, high: 32768):
- Caller passes
reasoning: low, catalog saystokens→ emitmax_tokens: 2048. - Caller passes
reasoning: 4096, catalog sayseffort→ snap to the nearest tier by PortableBudgets, rounding up on ties (principle of least surprise: a caller asking for “more thinking” gets at least what they asked for, never less).
PortableBudgets stays the single source of truth for the tier↔tokens relationship. No second mapping table.
3. Provider translators receive model and consult the catalog
openRouterReasoningOptions(policy) becomes
openRouterReasoningOptions(policy, model, wire), where wire is
the catalog’s reasoning-wire value for model. The translator picks
the wire shape from wire, applying PortableBudgets if the caller’s
form needs converting. When wire is empty or provider, the
translator falls back to today’s “pick wire shape from policy.Kind”
behavior (preserving correctness for ad-hoc model strings not in the
catalog).
Other format-specific paths
(ThinkingWireFormatQwen, ThinkingWireFormatThinkingMap) get the
same model + wire treatment for symmetry. They’re already
budget-only on the wire, so the new effort value is meaningless
for them; effort on a Qwen or ThinkingMap provider is treated
as tokens (the only thing they can express) with a single
structured-warning log so catalog mis-classifications are visible.
4. Lookup is surface-id-aware
Fix modelReasoningWireMap() to emit one map per surface ID, not
per catalog ID. The provider system already knows which surface it
serves (e.g. agent.openai for OR), so the lookup becomes
wireMap[surface][surfaceID]. The fix lands together with the
catalog value updates so no intermediate state regresses any
existing route.
5. Probe tooling is the source of truth; catalog is the cached form
The catalog’s reasoning_wire value for any model is determined by
measurement, not by interpretation of the model card prose. The
probe tool (cmd/fizeau-probe-reasoning, see beads) sends a small
matrix of requests against the live endpoint and records, per
(provider, model, knob form), whether usage.completion_tokens_details .reasoning_tokens actually moved. The verdict is what gets written
to the catalog.
Model cards remain useful operator context (they tell the human what to probe and what to expect) but they are not load-bearing for runtime correctness. We do not attempt to parse cards.
6. No runtime probing
The probe is operator tooling, not a startup hook. The runtime path remains: read catalog (compiled-in or external manifest) → emit wire. No network IO at request time for catalog purposes. This preserves cold-start latency and request determinism.
Consequences
Positive:
- Benchmark profiles stop carrying provider-quirk workarounds.
reasoning: lowworks the same on OR-Qwen3, sindri-llamacpp, and vidar-ds4. - The single “what wire shape does this model honor?” question has one home (catalog), populated by one tool (probe), with one conversion table (PortableBudgets).
- The latent surface-id lookup bug gets fixed before it bites (today
it would silently start dropping reasoning entirely if any catalog
entry’s
reasoning_wirewere set tomodel_idfor an OR model). - Adds a forcing function for catalog hygiene without a CI gate: when a sweep produces unexpected reasoning behavior, the operator re-runs the probe and either confirms or updates the catalog — process emerges from data, not policy.
Negative:
- One more catalog dimension. Operators authoring new model entries
must run the probe to populate
reasoning_wirecorrectly (or leave it as theproviderdefault and accept today’s behavior). - The wire-form decision moves from “implicit in the policy
Kindthe caller passed” to “looked up in a registry.” A new contributor readingopenRouterReasoningOptionshas one more indirection to follow. - The probe tool is operator infrastructure that needs maintenance separately from runtime code. Operators may forget to re-probe when an upstream changes routing; cells will then exhibit the pre-fix symptoms (silent flat-mapping) until someone notices.
Mitigations:
- The implementation bead pins a backwards-compat test: when
reasoning_wireis unset (i.e. on a manifest predating this change), the wire shape is selected from policyKindexactly as today. No regression for unaudited models. - A structured log (
reasoning_wire=provider, model=...) fires on the unset path so post-hoc analysis can spot models that haven’t been classified yet. - The probe tool ships with the bead, not as a follow-up. Operator hygiene without tooling is aspirational.
Out of scope
- A catalog refresh tool that pulls upstream
/modelsregistries and diffs against the catalog (pricing, context windows, supported parameters). Not blocking; ship when the pain is felt, not before. - A CI gate on
last_probed_atstaleness. Stale entries produce subtly-wrong reasoning, which a date check doesn’t catch; the control is “re-probe when results look weird,” not process. - Model-card URL or citation fields on catalog entries. The probe artifact is the audit object; cards are operator context, not schema.
- A startup probe that auto-discovers wire form per session. Explicitly rejected: runtime determinism and cold-start latency outweigh the convenience.
- Per-
(model, provider, region)wire-form variation. Today every observed case is per-model regardless of OR upstream region; recoverable additively if a real case appears.
References
- ADR-005 (smart routing replaces model routes)
- ADR-006 (overrides are routing-failure signals)
- ADR-007 (sampling profiles in catalog) — same architectural principle: behavioral knowledge belongs in the catalog, not in caller config.
internal/reasoning/reasoning.go—Policy,Kind,PortableBudgets.internal/provider/openai/openai.go:340-435—reasoningRequestOptionsandopenRouterReasoningOptions.internal/modelcatalog/manifest.go:59-93—ReasoningWireenum and validation.internal/config/config.go:925-944—modelReasoningWireMap(the surface-id lookup bug).
Amendment 2026-05-11: Introspection is the primary source of truth
Why this amendment exists
A direct probing session against the live ds4 (vidar:1236) and
sindri-llamacpp (sindri:8020) endpoints surfaced two facts the
original Decision didn’t account for:
- Both servers expose rich introspection at
/props.- ds4’s
/props.reasoningcarriessupported_efforts,aliases(the explicit low/medium/xhigh → high collapse map),default,effective_default, andthink_max_min_context(the--ctx ≥ 393216constraint for themaxtier). - llama-server’s
/propscarrieschat_template(full jinja),chat_template_caps.supports_preserve_reasoning,default_generation_settings.params.reasoning_format, andbuild_info. - Cost: 8–15 ms p50 on LAN endpoints, 316 ms p50 for OR’s
/api/v1/models(cross-internet, full-catalog payload). Negligible against per-request workload.
- ds4’s
- Wire-shape behavior differs across providers in ways the
original
effort | tokensenum cannot fully express.- llama-server requires
chat_template_kwargs.{enable_thinking, thinking_budget}— top-levelenable_thinkingis silently dropped. Verified: variant F of the 2026-05-11 sindri matrix produced 4 completion tokens (no thinking) where the kwargs envelope produced 250–526 tokens with extracted reasoning. - ds4 accepts only flat top-level
reasoning_efforton/v1/chat/completions; the Anthropic-shapethinking: {type, budget_tokens}is silently ignored. ds4’s/v1/messagesendpoint acceptsoutput_config.effortinstead — a separate shape — but we don’t route through it today. - ds4’s effort tiers
low,medium,xhighall alias to “high”; onlymaxis genuinely distinct (and only when--ctx ≥ 393216). Source:ds4_server.c:710andds4.c:15170per upstream agent reading. - sindri’s
chat_template_kwargs.thinking_budgetdoes not bind in any meaningful way — it’s a soft template hint at best. Verified:budget=128produced 1914 completion tokens with 3599 chars of reasoning vsbudget=2048producing 1434 completion tokens / 2494 chars (within sampling noise).max_tokensis the only hard limit on this profile.
- llama-server requires
These findings invert the original Decision §5–§6. The probe tool
should not be the source of truth where introspection answers the
question; introspection should be primary, with the probe relegated
to verification (does the upstream actually behave the way /props
claims?) and to discovery for opaque providers (OR’s upstream
routing, where /api/v1/models.supported_parameters lists the
reasoning family but doesn’t disambiguate which subkeys bite).
Revised Decision (supersedes §5–§6)
5’. Layered source-of-truth chain
Wire form is determined by a three-layer lookup at provider construction time:
- L1 — Live introspection via the provider’s own metadata
endpoint (
/propsfor ds4 and llama-server,/api/v1/modelsfor OR,/v1/modelsfor OpenAI/Anthropic-native, etc.). When the endpoint exposes reasoning capabilities directly (alias maps, supported parameters, chat-template flags), that information is authoritative. - L2 — Catalog
reasoning_wirefield. Used when L1 is silent (provider has no introspection, or the introspection doesn’t disambiguate the wire form). For OR-Qwen3, L1 confirmsreasoningis accepted but doesn’t say which subkey (effort/max_tokens) bites; L2 declarestokensper the probe verdict. - L3 — Static
ThinkingFormatper provider type. Final default when neither L1 nor L2 has an opinion. Today’s behavior; preserved for unaudited models.
Higher layers stomp lower layers. An operator override at L2 (catalog) beats L1 introspection — useful for testing or for working around a known server-side misclaim. A structured log emits when L2 overrides L1, mirroring ADR-006’s override-as-failure-signal pattern.
6’. Introspection adapters as a first-class concept
Each provider type that exposes useful introspection ships an
adapter under internal/provider/<name>/introspection.go,
extending the existing utilization-probe pattern at
internal/provider/ds4/utilization_probe.go. The adapter:
- Is invoked once at provider construction; result is cached for
the process lifetime keyed by
(base_url, model). - Returns a structured
ProviderIntrospectionthat augments the existingProtocolCapabilities: effectiveThinkingFormat, effectivereasoning_levels,supported_request_parameters, alias maps, server-side defaults (e.g. llama-server’sreasoning_format), and a free-form raw-snapshot field for audit. - Falls through silently to L2 if the endpoint is unreachable or returns malformed data — provider construction must not fail on introspection errors. A structured warning is logged.
Provider construction order: instantiate provider → run introspection adapter (if available) → merge into capabilities → ready. Per-request path is unchanged; no introspection IO at request time.
7. Wire-shape vocabulary expanded
The ThinkingFormat enum at internal/provider/openai/protocol_support.go
gains one new value:
ThinkingWireFormatOpenAIEffort— flat top-levelreasoning_effort: "<tier>". ds4’s wire on/v1/chat/completions. Disable form is top-levelthink: falseper ds4 docs.
The existing ThinkingWireFormatQwen is fixed to emit via
chat_template_kwargs envelope plus an optional per-request
reasoning_format override (defaulting to whatever the server’s
/props.default_generation_settings.params.reasoning_format
reports — typically auto or deepseek, both of which extract
the <think> block into message.reasoning_content). Top-level
enable_thinking/thinking_budget is removed. Verified against
sindri-llamacpp 2026-05-11.
ThinkingWireFormatThinkingMap is unchanged — it remains the
correct format for genuine Anthropic-shape providers (oMLX,
lucebox; ds4 was misclassified as ThinkingMap and switches to
OpenAIEffort).
8. Telemetry: faithful reasoning-token reporting
Cells must record reasoning_tokens truthfully even when the
upstream omits the OpenAI-shape usage.completion_tokens_details .reasoning_tokens field. Resolution chain in
internal/sdk/openaicompat/client.go:
- If
usage.completion_tokens_details.reasoning_tokensis present, use it. (OpenAI native, OR.) - Else if
message.reasoning_contentis non-empty, derive an approximation. Char-count divided by ~4 is the v1 estimator; the field is annotated asreasoning_tokens_approx=truein the cell record so analysts know the source. - Else
reasoning_tokens=0. (Honest absence, not silent truncation.)
The catalog never lies about this either: a model with a built-in
thinking-on default (ds4 default reasoning.effective_default: "high") gets recorded as such even when the caller didn’t
explicitly enable thinking.
9. Capability honesty over enforced uniformity
When a profile physically cannot honor the caller’s intent at the requested granularity (ds4 has no token budget; sindri’s budget hint doesn’t bind; ds4’s low/medium/xhigh all collapse), fizeau:
- Sends the closest representable wire form (e.g.
reasoning: lowon ds4 →reasoning_effort: "high"becauselowaliases tohighper L1 introspection). - Records both the requested intent and the actually-emitted wire
in the cell record (
reasoning_intent: low, reasoning_emitted: high, reasoning_emitted_reason: "ds4 alias map"). - Does not silently degrade to a wire form that wouldn’t produce comparable results. Cross-profile analysis is honest at read time, not forced at write time.
10. Verification asserts emitted control, not consumed depth
PortableBudgets entries are request-side control signals. They map a
portable intent such as reasoning: low to the closest representable
wire budget, for example reasoning.max_tokens: 2048 on a
tokens-shaped OpenRouter profile. They are not predictions that the
upstream will consume exactly that many reasoning tokens.
Therefore final verification must not require observed
reasoning_tokens to equal or approximate the portable budget. A
model may stop thinking early, sampling may change the amount of
hidden work, and provider accounting may report only actual consumed
tokens. The deterministic acceptance target is:
- the selected L1/L2/L3 source produces the expected wire form;
- the emitted wire is recorded alongside the requested intent;
- observed
reasoning_tokensare recorded truthfully; reasoning_tokens_approxdistinguishes direct usage accounting from fallback estimates.
For example, a smoke cell that emits reasoning.max_tokens: 2048
and records direct OpenRouter usage of reasoning_tokens: 352 has
passed the control-plane requirement. The 2048 value is the cap that
Fizeau sent, while 352 is the actual spend reported by the provider.
End goal (canonical statement)
A benchmark caller writes reasoning: <intent> once in a profile.
Across every profile in the matrix, that intent is honored to the
maximum extent the upstream supports. The actual wire emitted, the
upstream’s published capabilities, and the observed reasoning depth
are all recorded per cell. Cross-profile comparison happens at
analysis time on truthful data, not by forcing identical wire-level
behavior across providers that genuinely differ.
Out of scope (revised)
Removed from the original out-of-scope list:
- “Catalog refresh tool” — partially absorbed by L1 introspection. Refresh is still deferred for non-introspecting providers (OR, Anthropic-native) but for ds4/llama-server it’s now a runtime property.
Still out of scope:
- A startup probe that auto-discovers wire form via behavioral testing. (Cheap-introspection-at-construction is fine; expensive behavioral probing at startup remains rejected.)
- A CI gate on
last_probed_atstaleness. - Model-card URL/citation fields in catalog entries.
- vLLM introspection adapter — defer until we re-add a vLLM profile.
Newly identified bugs (filed as beads)
- Surface-id vs catalog-id lookup mismatch
(
internal/config/config.go:925-944). Fix bundled into the populate bead (fizeau-b83502c0) so it lands together with catalog value updates. - Reasoning-token extraction from
reasoning_contentwhenusage.completion_tokens_details.reasoning_tokensis absent. Affects ds4 cells today. Filed as a separate bead (independent of L1/L2/L3 work).