private repo · research stage · single-tenant

Reserve

Private · Python + FastAPI + Redis + Postgres · Provider-agnostic LLM failover · 2026

The internal name is Citadel; the showcase name is Reserve.

Built for:: Apps where an LLM call sits on the critical path and a 429 from the primary provider means a broken UX, not just slower output.
Not built for:: Workloads where the answer must come from a specific model, full stop. Reserve assumes you can route to a peer of the primary if the primary is down.

The promise of a single AI provider is that you don’t have to think about reliability. The reality is that every provider has bad days, rate-limit cliffs, and quiet quality regressions, and your app gets to discover those at the worst possible moment. Reserve sits in front and reroutes around them.

§ I

The problem

Most production AI apps have a single point of failure: the model provider. When that provider rate-limits, throws 5xx, or silently regresses on a model version, the app degrades or breaks — and the engineering team finds out from users, not from monitoring. The default state of LLM infrastructure is “hopeful.”

Reserve makes the failover explicit and observable. Each request runs against a primary; if the primary errors, slows past a budget, or trips a quality gate, the request retries on a peer. The application layer never sees the difference; the operations layer sees the whole story.

§ II

Decisions

kept2026-Q1
A typed capability surface for providers, not a string-keyed registry. Adding Anthropic, OpenAI, Gemini, Mistral, and a self-hosted Ollama means a Python protocol class and a few hundred lines of adapter — not a config-driven black hole.
cut2026-Q1
Streaming-aware failover mid-response. If the primary fails halfway through a streaming completion, Reserve does not silently restart on a peer — the partial response is surfaced and the application chooses. Hidden mid-stream switches are a debuggability disaster.
kept2026-Q1
A circuit breaker per provider, not per route. When OpenAI is having an incident, every route through OpenAI is suspect, not just the one that just failed. The breaker tracks at the provider level and opens for everyone using that provider, fast.
deferred2026
A self-hosted vector search subsystem to avoid Pinecone-class costs. The need is real but the scope is its own product; Reserve v1 stays focused on the failover layer.

§ III

System

FIGURE 1. A request through the breaker hits the primary; on error or quality miss, the next provider in the rank takes the call — the application sees one endpoint.

Stack — current pins.
Layer	Implementation	Purpose
Edge	FastAPI + uvicorn	Single OpenAI-compatible endpoint surface
Routing	Provider protocol	Typed adapters · per-provider quotas
Breakers	Redis-backed	Per-provider circuit · sliding window error rate
Quality	Sampled judge	1% of responses graded; regressions flag
Audit	Postgres	Every request · provider · latency · outcome
Metrics	OpenTelemetry	p50/p95/p99 per provider per minute

reserve/providers/protocol.pypython · provider protocol

# A provider is a typed capability surface, not a config blob.
# Adding a peer means implementing two methods + a circuit name —
# the rest of Reserve doesn't change.
class Provider(Protocol):
    name: str
    breaker_key: str

    async def complete(
        self,
        req: ChatRequest,
        budget: Budget,
    ) -> ChatResponse: ...

    async def health(self) -> HealthSignal: ...

# Failover is explicit: try primary, observe, fall back to peer.
async def route(req: ChatRequest) -> ChatResponse:
    primary, peers = pick(req)
    if breaker.open(primary.breaker_key):
        return await failover(peers, req)
    try:
        return await primary.complete(req, budget=req.budget)
    except (Timeout, RateLimit, ProviderError) as e:
        breaker.record(primary.breaker_key, e)
        return await failover(peers, req)

breaker.events.logndjson · operations

{"t":"02:14:09Z","provider":"openai","event":"trip","window":"30s","err_rate":0.41,"reason":"5xx"}
{"t":"02:14:09Z","provider":"openai","state":"open","cooldown_until":"02:14:39Z"}
{"t":"02:14:09Z","route":"chat","peer":"anthropic","reason":"failover"}
{"t":"02:14:39Z","provider":"openai","event":"probe","state":"half-open"}
{"t":"02:14:40Z","provider":"openai","event":"recover","state":"closed","probe_ms":612}
{"t":"02:14:40Z","route":"chat","peer":"openai","reason":"primary-restored"}

FIGURE. One provider trip, one failover, one probe-and-recover. The breaker is per provider — when OpenAI tripped, every route through OpenAI failed over until the half-open probe came back clean.

Reserve provider monitoring panel — four providers with health and latency, request flow log showing one openai → anthropic failover, latency p99 sparkline. — FIGURE. The operations view of one Gemini degradation incident. The breaker tripped at 22:15:26 and every chat call routed away until the half-open probe came back inside the budget.

§ V

What I’d do differently

The quality-judge sampler should have been built into v0, not bolted on. Failover by error code is necessary but insufficient — a provider that returns 200 with quietly-degraded answers is the harder failure mode, and only a sampling judge surfaces it.

Acknowledgments

Reserve stands on FastAPI, Redis, Postgres, OpenTelemetry, and the published OpenAI-compatible API surface that lets a router layer be agnostic to which actual model handles the work.

← Index