Skip to content

ADR-010: Reasoning Wire Form Belongs in the Model Catalog

DateStatusDecidersRelatedConfidence
2026-05-11AcceptedFizeau maintainersADR-005, ADR-006, ADR-007High

Context

Fizeau’s reasoning policy (internal/reasoning/reasoning.go) normalizes caller intent into one of two kinds:

  • KindNamedlow / medium / high / etc. (a tier shortcut).
  • KindTokens — an integer thinking budget.

Each provider’s translator emits a wire form for the upstream API. For OpenRouter that’s a nested reasoning block with either effort: "<tier>" or max_tokens: <int>. For Qwen-native servers (llama-server, vLLM) it’s enable_thinking: true, thinking_budget: <int>. For Anthropic-shaped servers (oMLX, ds4, lucebox) it’s thinking: {type: enabled, budget_tokens: <int>}.

Today the OpenRouter translator at internal/provider/openai/openai.go:453-473 chooses wire shape from the policy Kind alone:

case KindTokens:
    reasoning["max_tokens"] = policy.Tokens
case KindNamed:
    reasoning["effort"] = effort

Empirical finding (2026-05-11 probe against qwen/qwen3.6-27b on OpenRouter): the upstream silently flat-maps effort for Qwen3 — low, medium, and high all yield ~5555 reasoning tokens. The reasoning.max_tokens form is forwarded upstream as Qwen-native thinking_budget and is honored. So which knob bites is not a property of the OR transport; it’s a property of the upstream model. Other OR-hosted models (GPT-5, Claude) honor effort because their upstream vendors interpret it natively.

The principle from ADR-006 (“overrides are routing-failure signals”) and ADR-007 (“sampling is catalog policy, not user configuration”) applies again: a benchmark profile having to specify reasoning: 4096 to work around a tier flatten is the catalog failing to carry the knowledge that OR-Qwen3 needs a token-budget wire. Callers should be able to write reasoning: low and have the right wire shape emitted.

A latent bug accelerates the case: internal/config/config.go:925-944 keys the reasoning_wire lookup map by catalog ID (e.g. qwen3.6-27b), but the openai translator queries with the surface ID (qwen/qwen3.6-27b for the OR route). The catalog entry for qwen3.6-27b has reasoning_wire: model_id, but the OR provider never sees it — the lookup misses on key mismatch. Today this is benign (the lookup returns the default provider and the wire path runs); under any catalog change to the effort/tokens values, the mismatch becomes a silent correctness bug.

Decision

1. Wire form is catalog-declared

Extend ReasoningWire (internal/modelcatalog/manifest.go:61, internal/modelcatalog/manifest.go:90-93) with two new accepted values:

  • effort — the upstream honors a discrete effort tier. Wire emits reasoning.effort: "<tier>" (OpenRouter) or the equivalent for other transports.
  • tokens — the upstream honors only a token budget. Wire emits reasoning.max_tokens: <int> (OpenRouter), thinking_budget: <int> (Qwen-native), or budget_tokens: <int> (Anthropic-shape).

The full enum becomes provider | model_id | none | effort | tokens. provider remains the default and means “use the provider’s default wire form” — preserving today’s behavior for any model the catalog doesn’t classify.

2. PortableBudgets is the bidirectional conversion table

When the caller’s Kind and the catalog’s wire form disagree, the translator converts via the existing PortableBudgets map (low: 2048, medium: 8192, high: 32768):

  • Caller passes reasoning: low, catalog says tokens → emit max_tokens: 2048.
  • Caller passes reasoning: 4096, catalog says effort → snap to the nearest tier by PortableBudgets, rounding up on ties (principle of least surprise: a caller asking for “more thinking” gets at least what they asked for, never less).

PortableBudgets stays the single source of truth for the tier↔tokens relationship. No second mapping table.

3. Provider translators receive model and consult the catalog

openRouterReasoningOptions(policy) becomes openRouterReasoningOptions(policy, model, wire), where wire is the catalog’s reasoning-wire value for model. The translator picks the wire shape from wire, applying PortableBudgets if the caller’s form needs converting. When wire is empty or provider, the translator falls back to today’s “pick wire shape from policy.Kind” behavior (preserving correctness for ad-hoc model strings not in the catalog).

Other format-specific paths (ThinkingWireFormatQwen, ThinkingWireFormatThinkingMap) get the same model + wire treatment for symmetry. They’re already budget-only on the wire, so the new effort value is meaningless for them; effort on a Qwen or ThinkingMap provider is treated as tokens (the only thing they can express) with a single structured-warning log so catalog mis-classifications are visible.

4. Lookup is surface-id-aware

Fix modelReasoningWireMap() to emit one map per surface ID, not per catalog ID. The provider system already knows which surface it serves (e.g. agent.openai for OR), so the lookup becomes wireMap[surface][surfaceID]. The fix lands together with the catalog value updates so no intermediate state regresses any existing route.

5. Probe tooling is the source of truth; catalog is the cached form

The catalog’s reasoning_wire value for any model is determined by measurement, not by interpretation of the model card prose. The probe tool (cmd/fizeau-probe-reasoning, see beads) sends a small matrix of requests against the live endpoint and records, per (provider, model, knob form), whether usage.completion_tokens_details .reasoning_tokens actually moved. The verdict is what gets written to the catalog.

Model cards remain useful operator context (they tell the human what to probe and what to expect) but they are not load-bearing for runtime correctness. We do not attempt to parse cards.

6. No runtime probing

The probe is operator tooling, not a startup hook. The runtime path remains: read catalog (compiled-in or external manifest) → emit wire. No network IO at request time for catalog purposes. This preserves cold-start latency and request determinism.

Consequences

Positive:

  • Benchmark profiles stop carrying provider-quirk workarounds. reasoning: low works the same on OR-Qwen3, sindri-llamacpp, and vidar-ds4.
  • The single “what wire shape does this model honor?” question has one home (catalog), populated by one tool (probe), with one conversion table (PortableBudgets).
  • The latent surface-id lookup bug gets fixed before it bites (today it would silently start dropping reasoning entirely if any catalog entry’s reasoning_wire were set to model_id for an OR model).
  • Adds a forcing function for catalog hygiene without a CI gate: when a sweep produces unexpected reasoning behavior, the operator re-runs the probe and either confirms or updates the catalog — process emerges from data, not policy.

Negative:

  • One more catalog dimension. Operators authoring new model entries must run the probe to populate reasoning_wire correctly (or leave it as the provider default and accept today’s behavior).
  • The wire-form decision moves from “implicit in the policy Kind the caller passed” to “looked up in a registry.” A new contributor reading openRouterReasoningOptions has one more indirection to follow.
  • The probe tool is operator infrastructure that needs maintenance separately from runtime code. Operators may forget to re-probe when an upstream changes routing; cells will then exhibit the pre-fix symptoms (silent flat-mapping) until someone notices.

Mitigations:

  • The implementation bead pins a backwards-compat test: when reasoning_wire is unset (i.e. on a manifest predating this change), the wire shape is selected from policy Kind exactly as today. No regression for unaudited models.
  • A structured log (reasoning_wire=provider, model=...) fires on the unset path so post-hoc analysis can spot models that haven’t been classified yet.
  • The probe tool ships with the bead, not as a follow-up. Operator hygiene without tooling is aspirational.

Out of scope

  • A catalog refresh tool that pulls upstream /models registries and diffs against the catalog (pricing, context windows, supported parameters). Not blocking; ship when the pain is felt, not before.
  • A CI gate on last_probed_at staleness. Stale entries produce subtly-wrong reasoning, which a date check doesn’t catch; the control is “re-probe when results look weird,” not process.
  • Model-card URL or citation fields on catalog entries. The probe artifact is the audit object; cards are operator context, not schema.
  • A startup probe that auto-discovers wire form per session. Explicitly rejected: runtime determinism and cold-start latency outweigh the convenience.
  • Per-(model, provider, region) wire-form variation. Today every observed case is per-model regardless of OR upstream region; recoverable additively if a real case appears.

References

  • ADR-005 (smart routing replaces model routes)
  • ADR-006 (overrides are routing-failure signals)
  • ADR-007 (sampling profiles in catalog) — same architectural principle: behavioral knowledge belongs in the catalog, not in caller config.
  • internal/reasoning/reasoning.goPolicy, Kind, PortableBudgets.
  • internal/provider/openai/openai.go:340-435reasoningRequestOptions and openRouterReasoningOptions.
  • internal/modelcatalog/manifest.go:59-93ReasoningWire enum and validation.
  • internal/config/config.go:925-944modelReasoningWireMap (the surface-id lookup bug).

Amendment 2026-05-11: Introspection is the primary source of truth

Why this amendment exists

A direct probing session against the live ds4 (vidar:1236) and sindri-llamacpp (sindri:8020) endpoints surfaced two facts the original Decision didn’t account for:

  1. Both servers expose rich introspection at /props.
    • ds4’s /props.reasoning carries supported_efforts, aliases (the explicit low/medium/xhigh → high collapse map), default, effective_default, and think_max_min_context (the --ctx ≥ 393216 constraint for the max tier).
    • llama-server’s /props carries chat_template (full jinja), chat_template_caps.supports_preserve_reasoning, default_generation_settings.params.reasoning_format, and build_info.
    • Cost: 8–15 ms p50 on LAN endpoints, 316 ms p50 for OR’s /api/v1/models (cross-internet, full-catalog payload). Negligible against per-request workload.
  2. Wire-shape behavior differs across providers in ways the original effort | tokens enum cannot fully express.
    • llama-server requires chat_template_kwargs.{enable_thinking, thinking_budget} — top-level enable_thinking is silently dropped. Verified: variant F of the 2026-05-11 sindri matrix produced 4 completion tokens (no thinking) where the kwargs envelope produced 250–526 tokens with extracted reasoning.
    • ds4 accepts only flat top-level reasoning_effort on /v1/chat/completions; the Anthropic-shape thinking: {type, budget_tokens} is silently ignored. ds4’s /v1/messages endpoint accepts output_config.effort instead — a separate shape — but we don’t route through it today.
    • ds4’s effort tiers low, medium, xhigh all alias to “high”; only max is genuinely distinct (and only when --ctx ≥ 393216). Source: ds4_server.c:710 and ds4.c:15170 per upstream agent reading.
    • sindri’s chat_template_kwargs.thinking_budget does not bind in any meaningful way — it’s a soft template hint at best. Verified: budget=128 produced 1914 completion tokens with 3599 chars of reasoning vs budget=2048 producing 1434 completion tokens / 2494 chars (within sampling noise). max_tokens is the only hard limit on this profile.

These findings invert the original Decision §5–§6. The probe tool should not be the source of truth where introspection answers the question; introspection should be primary, with the probe relegated to verification (does the upstream actually behave the way /props claims?) and to discovery for opaque providers (OR’s upstream routing, where /api/v1/models.supported_parameters lists the reasoning family but doesn’t disambiguate which subkeys bite).

Revised Decision (supersedes §5–§6)

5’. Layered source-of-truth chain

Wire form is determined by a three-layer lookup at provider construction time:

  1. L1 — Live introspection via the provider’s own metadata endpoint (/props for ds4 and llama-server, /api/v1/models for OR, /v1/models for OpenAI/Anthropic-native, etc.). When the endpoint exposes reasoning capabilities directly (alias maps, supported parameters, chat-template flags), that information is authoritative.
  2. L2 — Catalog reasoning_wire field. Used when L1 is silent (provider has no introspection, or the introspection doesn’t disambiguate the wire form). For OR-Qwen3, L1 confirms reasoning is accepted but doesn’t say which subkey (effort/max_tokens) bites; L2 declares tokens per the probe verdict.
  3. L3 — Static ThinkingFormat per provider type. Final default when neither L1 nor L2 has an opinion. Today’s behavior; preserved for unaudited models.

Higher layers stomp lower layers. An operator override at L2 (catalog) beats L1 introspection — useful for testing or for working around a known server-side misclaim. A structured log emits when L2 overrides L1, mirroring ADR-006’s override-as-failure-signal pattern.

6’. Introspection adapters as a first-class concept

Each provider type that exposes useful introspection ships an adapter under internal/provider/<name>/introspection.go, extending the existing utilization-probe pattern at internal/provider/ds4/utilization_probe.go. The adapter:

  • Is invoked once at provider construction; result is cached for the process lifetime keyed by (base_url, model).
  • Returns a structured ProviderIntrospection that augments the existing ProtocolCapabilities: effective ThinkingFormat, effective reasoning_levels, supported_request_parameters, alias maps, server-side defaults (e.g. llama-server’s reasoning_format), and a free-form raw-snapshot field for audit.
  • Falls through silently to L2 if the endpoint is unreachable or returns malformed data — provider construction must not fail on introspection errors. A structured warning is logged.

Provider construction order: instantiate provider → run introspection adapter (if available) → merge into capabilities → ready. Per-request path is unchanged; no introspection IO at request time.

7. Wire-shape vocabulary expanded

The ThinkingFormat enum at internal/provider/openai/protocol_support.go gains one new value:

  • ThinkingWireFormatOpenAIEffort — flat top-level reasoning_effort: "<tier>". ds4’s wire on /v1/chat/completions. Disable form is top-level think: false per ds4 docs.

The existing ThinkingWireFormatQwen is fixed to emit via chat_template_kwargs envelope plus an optional per-request reasoning_format override (defaulting to whatever the server’s /props.default_generation_settings.params.reasoning_format reports — typically auto or deepseek, both of which extract the <think> block into message.reasoning_content). Top-level enable_thinking/thinking_budget is removed. Verified against sindri-llamacpp 2026-05-11.

ThinkingWireFormatThinkingMap is unchanged — it remains the correct format for genuine Anthropic-shape providers (oMLX, lucebox; ds4 was misclassified as ThinkingMap and switches to OpenAIEffort).

8. Telemetry: faithful reasoning-token reporting

Cells must record reasoning_tokens truthfully even when the upstream omits the OpenAI-shape usage.completion_tokens_details .reasoning_tokens field. Resolution chain in internal/sdk/openaicompat/client.go:

  1. If usage.completion_tokens_details.reasoning_tokens is present, use it. (OpenAI native, OR.)
  2. Else if message.reasoning_content is non-empty, derive an approximation. Char-count divided by ~4 is the v1 estimator; the field is annotated as reasoning_tokens_approx=true in the cell record so analysts know the source.
  3. Else reasoning_tokens=0. (Honest absence, not silent truncation.)

The catalog never lies about this either: a model with a built-in thinking-on default (ds4 default reasoning.effective_default: "high") gets recorded as such even when the caller didn’t explicitly enable thinking.

9. Capability honesty over enforced uniformity

When a profile physically cannot honor the caller’s intent at the requested granularity (ds4 has no token budget; sindri’s budget hint doesn’t bind; ds4’s low/medium/xhigh all collapse), fizeau:

  • Sends the closest representable wire form (e.g. reasoning: low on ds4 → reasoning_effort: "high" because low aliases to high per L1 introspection).
  • Records both the requested intent and the actually-emitted wire in the cell record (reasoning_intent: low, reasoning_emitted: high, reasoning_emitted_reason: "ds4 alias map").
  • Does not silently degrade to a wire form that wouldn’t produce comparable results. Cross-profile analysis is honest at read time, not forced at write time.

10. Verification asserts emitted control, not consumed depth

PortableBudgets entries are request-side control signals. They map a portable intent such as reasoning: low to the closest representable wire budget, for example reasoning.max_tokens: 2048 on a tokens-shaped OpenRouter profile. They are not predictions that the upstream will consume exactly that many reasoning tokens.

Therefore final verification must not require observed reasoning_tokens to equal or approximate the portable budget. A model may stop thinking early, sampling may change the amount of hidden work, and provider accounting may report only actual consumed tokens. The deterministic acceptance target is:

  • the selected L1/L2/L3 source produces the expected wire form;
  • the emitted wire is recorded alongside the requested intent;
  • observed reasoning_tokens are recorded truthfully;
  • reasoning_tokens_approx distinguishes direct usage accounting from fallback estimates.

For example, a smoke cell that emits reasoning.max_tokens: 2048 and records direct OpenRouter usage of reasoning_tokens: 352 has passed the control-plane requirement. The 2048 value is the cap that Fizeau sent, while 352 is the actual spend reported by the provider.

End goal (canonical statement)

A benchmark caller writes reasoning: <intent> once in a profile. Across every profile in the matrix, that intent is honored to the maximum extent the upstream supports. The actual wire emitted, the upstream’s published capabilities, and the observed reasoning depth are all recorded per cell. Cross-profile comparison happens at analysis time on truthful data, not by forcing identical wire-level behavior across providers that genuinely differ.

Out of scope (revised)

Removed from the original out-of-scope list:

  • “Catalog refresh tool” — partially absorbed by L1 introspection. Refresh is still deferred for non-introspecting providers (OR, Anthropic-native) but for ds4/llama-server it’s now a runtime property.

Still out of scope:

  • A startup probe that auto-discovers wire form via behavioral testing. (Cheap-introspection-at-construction is fine; expensive behavioral probing at startup remains rejected.)
  • A CI gate on last_probed_at staleness.
  • Model-card URL/citation fields in catalog entries.
  • vLLM introspection adapter — defer until we re-add a vLLM profile.

Newly identified bugs (filed as beads)

  • Surface-id vs catalog-id lookup mismatch (internal/config/config.go:925-944). Fix bundled into the populate bead (fizeau-b83502c0) so it lands together with catalog value updates.
  • Reasoning-token extraction from reasoning_content when usage.completion_tokens_details.reasoning_tokens is absent. Affects ds4 cells today. Filed as a separate bead (independent of L1/L2/L3 work).