private repo · research stage · single-tenant

Reserve

Private · Python + FastAPI + Redis + Postgres · Provider-agnostic LLM failover · 2026

The internal name is Citadel; the showcase name is Reserve.

Built for:
Apps where an LLM call sits on the critical path and a 429 from the primary provider means a broken UX, not just slower output.
Not built for:
Workloads where the answer must come from a specific model, full stop. Reserve assumes you can route to a peer of the primary if the primary is down.

The promise of a single AI provider is that you don’t have to think about reliability. The reality is that every provider has bad days, rate-limit cliffs, and quiet quality regressions, and your app gets to discover those at the worst possible moment. Reserve sits in front and reroutes around them.

§ I

The problem

Most production AI apps have a single point of failure: the model provider. When that provider rate-limits, throws 5xx, or silently regresses on a model version, the app degrades or breaks — and the engineering team finds out from users, not from monitoring. The default state of LLM infrastructure is “hopeful.”

Reserve makes the failover explicit and observable. Each request runs against a primary; if the primary errors, slows past a budget, or trips a quality gate, the request retries on a peer. The application layer never sees the difference; the operations layer sees the whole story.

§ II

Decisions

  1. kept2026-Q1

    A typed capability surface for providers, not a string-keyed registry. Adding Anthropic, OpenAI, Gemini, Mistral, and a self-hosted Ollama means a Python protocol class and a few hundred lines of adapter — not a config-driven black hole.

  2. cut2026-Q1

    Streaming-aware failover mid-response. If the primary fails halfway through a streaming completion, Reserve does not silently restart on a peer — the partial response is surfaced and the application chooses. Hidden mid-stream switches are a debuggability disaster.

  3. kept2026-Q1

    A circuit breaker per provider, not per route. When OpenAI is having an incident, every route through OpenAI is suspect, not just the one that just failed. The breaker tracks at the provider level and opens for everyone using that provider, fast.

  4. deferred2026

    A self-hosted vector search subsystem to avoid Pinecone-class costs. The need is real but the scope is its own product; Reserve v1 stays focused on the failover layer.

§ III

System

clientEDGEFastAPIOpenAI-compatible · 1 endpointBREAKERper providerredis · sliding-window err rateAnthropicOpenAIGeminiMistralOllama (local)JUDGE1% sampleflags silent regressionsAUDITPostgresevery req · provider · latencybrass path = primary lane · breaker opens, next provider takes over
FIGURE 1. A request through the breaker hits the primary; on error or quality miss, the next provider in the rank takes the call — the application sees one endpoint.
Stack — current pins.
LayerImplementationPurpose
EdgeFastAPI + uvicornSingle OpenAI-compatible endpoint surface
RoutingProvider protocolTyped adapters · per-provider quotas
BreakersRedis-backedPer-provider circuit · sliding window error rate
QualitySampled judge1% of responses graded; regressions flag
AuditPostgresEvery request · provider · latency · outcome
MetricsOpenTelemetryp50/p95/p99 per provider per minute
reserve/providers/protocol.pypython · provider protocol
# A provider is a typed capability surface, not a config blob.
# Adding a peer means implementing two methods + a circuit name —
# the rest of Reserve doesn't change.
class Provider(Protocol):
    name: str
    breaker_key: str

    async def complete(
        self,
        req: ChatRequest,
        budget: Budget,
    ) -> ChatResponse: ...

    async def health(self) -> HealthSignal: ...

# Failover is explicit: try primary, observe, fall back to peer.
async def route(req: ChatRequest) -> ChatResponse:
    primary, peers = pick(req)
    if breaker.open(primary.breaker_key):
        return await failover(peers, req)
    try:
        return await primary.complete(req, budget=req.budget)
    except (Timeout, RateLimit, ProviderError) as e:
        breaker.record(primary.breaker_key, e)
        return await failover(peers, req)
breaker.events.logndjson · operations
{"t":"02:14:09Z","provider":"openai","event":"trip","window":"30s","err_rate":0.41,"reason":"5xx"}
{"t":"02:14:09Z","provider":"openai","state":"open","cooldown_until":"02:14:39Z"}
{"t":"02:14:09Z","route":"chat","peer":"anthropic","reason":"failover"}
{"t":"02:14:39Z","provider":"openai","event":"probe","state":"half-open"}
{"t":"02:14:40Z","provider":"openai","event":"recover","state":"closed","probe_ms":612}
{"t":"02:14:40Z","route":"chat","peer":"openai","reason":"primary-restored"}
FIGURE. One provider trip, one failover, one probe-and-recover. The breaker is per provider — when OpenAI tripped, every route through OpenAI failed over until the half-open probe came back clean.
Reserve provider monitoring panel — four providers with health and latency, request flow log showing one openai → anthropic failover, latency p99 sparkline.
FIGURE. The operations view of one Gemini degradation incident. The breaker tripped at 22:15:26 and every chat call routed away until the half-open probe came back inside the budget.
§ V

What I’d do differently

The quality-judge sampler should have been built into v0, not bolted on. Failover by error code is necessary but insufficient — a provider that returns 200 with quietly-degraded answers is the harder failure mode, and only a sampling judge surfaces it.

Acknowledgments

Reserve stands on FastAPI, Redis, Postgres, OpenTelemetry, and the published OpenAI-compatible API surface that lets a router layer be agnostic to which actual model handles the work.

← Index