ADR-011: Cost-Based Routing With Quota Pools
| Date | Status | Deciders | Related | Confidence |
|---|---|---|---|---|
| 2026-05-12 | Proposed | Fizeau maintainers | ADR-005, ADR-006, ADR-007, fizeau-c04be6b0, fizeau-d18e11f5 | Medium |
Context
ADR-005 established automatic routing over a joined candidate inventory. ADR-006 framed explicit harness, provider, and model pins as override signals rather than the normal operating mode. ADR-007 moved generation defaults into the catalog so callers do not compensate for missing policy with ad hoc request knobs. ADR-009 later tightened the public surface around policies, numeric power bounds, and hard pins.
The remaining routing gap is cost. For a given required model power, the router should prefer the cheapest qualified candidate that is likely to complete the request. “Cheapest” cannot mean list price alone because Fizeau can dispatch through several billing shapes:
- subscription harnesses, where marginal cost is effectively prepaid until a quota pool is exhausted;
- per-token APIs, where each request burns metered input and output tokens;
- local or fixed-cost providers, where marginal cost is zero but only if the provider is eligible for automatic routing;
- provider surfaces excluded from unpinned automatic routing by
IncludeByDefaultor metered-spend policy, where accidental spend or special-purpose endpoints must be impossible unless the caller explicitly pins them.
Recent implementation beads provide two pieces this ADR can rely on without
specifying new code in this bead. fizeau-c04be6b0 added catalog
quota_pool schema and the default semantic that missing pool values derive
from the provider system. fizeau-d18e11f5 wired configured provider
IncludeByDefault into routing eligibility, so excluded providers are removed
from unpinned automatic candidate sets and remain reachable only through
explicit pins.
Operator direction on 2026-05-11 was: for a given model power, choose intelligently based on cost, and let remaining usage quota influence that cost. This ADR defines that policy. Implementation follows in a separate epic after the ADR is accepted.
Decision
Candidate Eligibility Comes Before Cost
The router first builds the joined inventory described by ADR-005 and filters it before any cost ranking:
- Apply hard pins for harness, provider, and exact model identity.
- Drop candidates that fail policy requirements, capability requirements, health, context, tool, reasoning, or catalog auto-routing gates.
- Apply
IncludeByDefault: a provider or catalog entry withIncludeByDefault: falseis absent from the unpinned automatic candidate set. This is a hard filter, not a cost penalty. - Apply metered opt-in: a pay-per-token provider is absent from the unpinned automatic candidate set unless provider default inclusion and explicit metered-spend opt-in both allow it.
- Apply quota-pool exhaustion: when a quota pool is known exhausted, drop every candidate in that pool.
Only surviving candidates receive cost scores. This preserves the ADR-006 principle that manual pins are override signals and prevents price scoring from accidentally making an excluded provider “almost eligible.”
Power bounds are deliberately not a hard eligibility filter in this ADR. FEAT-004
and ADR-009 define MinPower and MaxPower as soft scoring inputs after
auto-routability gates: undershooting the requested floor is penalized more
heavily than overshooting the requested ceiling, but an otherwise eligible model
outside the band may still be ranked when its cost, health, and capability trade
off is better than the alternatives. Exact model pins keep exact identity and do
not substitute another model to satisfy power hints.
Effective Cost Formula
Effective cost is request-local and expressed in normalized dollars so subscription, metered, and local candidates can be compared in one ranking.
For per-token APIs:
effective_cost =
cost_input_per_m * estimated_input_tokens / 1_000_000
+ cost_output_per_m * estimated_output_tokens / 1_000_000For local or fixed-cost providers, effective_cost = 0 after eligibility
filters. They can still lose to candidates with a better policy power fit when
the local candidate is materially underpowered for the requested policy, and
they still carry latency, reliability, and health signals outside this ADR’s
cost term.
For subscription candidates, the router computes the same nominal per-request metered cost when catalog price data exists, then discounts it by quota fraction:
quota_fraction = remaining_quota / quota_limit
if quota_fraction <= 0:
drop the quota pool
if quota_fraction >= 0.20:
effective_cost = 0
else:
effective_cost = nominal_metered_cost * (1 - quota_fraction / 0.20)This quota fraction mapping is deliberately simple. Healthy prepaid quota is
free at the margin. The final 20 percent of a pool is still usable, but it
acquires a linear scarcity cost so another qualified subscription pool, a local
candidate, or a cheap metered API that is explicitly opted into automatic
routing can win before the pool reaches zero. When the catalog lacks a
comparable per-token price for a subscription model, the router uses the
cheapest known per-token model in the same provider family and power band as the
nominal cost proxy; if no proxy exists, the scarcity cost is 0 until the pool
is exhausted.
Quota Pool Semantics
Models belong to quota pools. A catalog quota_pool value names an explicit
pool; otherwise the effective pool is the provider system, matching
fizeau-c04be6b0.
Pool exhaustion is a hard candidate-set event. If the main OpenAI Codex pool is
exhausted, every model in that pool is dropped together. A model in a different
pool, such as gpt-5.3-codex-spark in openai-codex-spark, remains eligible
when it passes auto-routability, capability, and policy filters and remains
competitive under the caller’s soft power-fit score, even if it is not the
newest model in the family. This intentionally enables “use spark when the main
pool is empty” without encoding that fallback as a version-newest rule.
Quota signals are consumed in this priority order:
- explicit subscription usage endpoints, when available;
- HTTP rate-limit or quota-exhaustion response headers, including the
quota_exhaustedsignal already plumbed by commit7776890e; - in-memory recent-rejection rate for the same quota pool.
Known exhaustion wins over stale positive data. Unknown quota is not treated as exhausted; it is scored as scarce only when recent rejection evidence indicates probable depletion.
Quota Window Handling
Quota window handling is v1 local and greedy. The router uses quota available now and does not reserve quota for later requests, future sessions, or higher-value work. Reset time may appear in trace output and may clear a stale exhaustion state when observed, but it does not create a reservation policy. The quota window rule is therefore “available now,” not “save for later.”
This keeps routing deterministic and avoids a scheduler hidden inside the router. Per-tenant, per-key, and priority reservation behavior is future scope.
Token Estimate Source
The token estimate source for v1 is the request itself plus a coarse heuristic:
- use caller-provided
EstimatedPromptTokenswhen present; - otherwise estimate input tokens from system prompt, retained transcript, user prompt, and tool schemas using the same tokenizer when a provider tokenizer is available;
- fall back to
ceil(bytes / 4)when no tokenizer is available.
Estimated output tokens come from the request’s explicit max-output setting when present. Otherwise v1 uses a fixed default budget by power band:
| Candidate power | Estimated output tokens |
|---|---|
| 1-4 | 2,048 |
| 5-7 | 4,096 |
| 8-10 | 8,192 |
The estimate is intentionally conservative enough for ranking and not intended to predict final billing exactly. Actual usage continues to be recorded from provider responses for reporting and future tuning.
Ranking and Tie-Breaking
Among eligible candidates, choose the lowest effective cost candidate whose policy power fit is sufficient for the caller’s intent. This ADR makes “cheapest qualified candidate” the primary rule, not “most powerful available candidate.” “Qualified” means the candidate passed hard constraints, policy requirements, default-inclusion and metered opt-in gates, auto-routability, liveness, quota, and capability gates, then remains competitive after the FEAT-004 soft power-fit score is applied. A free but substantially underpowered candidate should not beat an in-band routine implementation candidate solely because it is free.
Tie-breaking is deterministic:
- prefer subscription candidates over pay-as-you-go candidates when effective cost is equal;
- prefer local or fixed-cost candidates over pay-as-you-go candidates when effective cost is equal and capability requirements are still satisfied;
- prefer the lower-power candidate when both have acceptable power fit;
- prefer healthier and lower-latency candidates using ADR-005 route-health signals;
- preserve stable catalog order only as the final tie-break.
Fallback Chain On Mid-Request Exhaustion
The fallback chain is caller-owned retry, consistent with ADR-005. The service dispatches the top-ranked candidate once. If dispatch fails because the quota pool becomes exhausted mid-request or the provider returns a quota/rate-limit error, the service records the pool exhaustion signal, emits the failed attempt’s trace, and returns the error.
The caller may issue a new request. On that next routing pass, the exhausted quota pool is filtered out and the next cheapest eligible pool can win. The router does not silently retry an alternate model inside the same service request because that would hide cost, duplicate side effects, and blur the operator-visible route decision.
Consequences
Positive
- Cost routing now reflects the real marginal economics of subscriptions: healthy quota is cheap, scarce quota is protected, and exhausted pools are removed.
- Quota pools provide an explicit mechanism for fallback across independent subscription allocations without relying on model recency.
IncludeByDefaultand metered opt-in compose cleanly with cost ranking because excluded providers never enter the unpinned automatic ranked set.- The routing trace can explain cost decisions with concrete inputs: estimated tokens, price data, quota fraction, quota pool, and filter reason.
Negative
- The quota fraction threshold is a policy choice, not an empirical optimum. It should be tuned after route traces show real depletion behavior.
- Coarse token estimates can mis-rank close candidates. This is acceptable for v1 because actual billing remains visible and estimates can improve without changing the public contract.
- Subscription models without comparable catalog pricing cannot express scarcity cost until proxy data exists; they remain zero-cost until exhausted.
- No within-request retry means callers may see one quota error before fallback takes effect on the next request.
Out of scope
- Runtime implementation of the scorer, candidate trace fields, or quota-pool filters. Those belong to the follow-up implementation epic.
- New quota signal infrastructure beyond existing rate-limit/header plumbing and future beads for subscription usage endpoints.
- Per-tenant, per-key, or priority quota reservations.
- Learning the quota fraction mapping from historical usage.
- Counterfactual dispatch to measure whether a more expensive candidate would have completed better.
References
ADR-005— power-based automatic routing, candidate inventory, scoring, and caller-owned retry.ADR-006— manual pins are override signals, not the primary routing mode.ADR-007— catalog-owned generation policy; this ADR follows the same catalog-policy principle for cost and quota metadata.ADR-009— v0.11 routing surface, billing classification, andIncludeByDefaultcomposition.fizeau-c04be6b0— catalogquota_poolfield schema and effective-pool default semantics.fizeau-d18e11f5— providerIncludeByDefaultrouting eligibility filter.- Commit
7776890e— existingquota_exhaustedsignal plumbing from HTTP rate-limit response handling.