High-availability inference: same model, every provider

By Göran Sandahl -

When you call a model like aws/claude-sonnet-4-5-eu, you've actually made three choices: the model, the provider, and the region. And you've tied your app's uptime to a single endpoint.

That's fine until that one endpoint has a hiccup. The provider hits rate-limits and your traffic has nowhere to go. A region degrades and your calls start failing. A better-priced provider comes online, but you can only reach it by editing code. Every fix is a deploy, and none of them is really about your app. They're all about keeping tokens flowing.

What you actually want is something simpler: give me Claude Sonnet 4.5, and keep it available. Which cloud serves it, and from which region, is a routing decision, and that belongs in the gateway, not your code.

So that's how Opper serves models. You call a model by its plain name, and we route each request to a healthy provider that runs it. If one deployment fails, the next takes over, so your app never sees the outage. Same model, every provider, behind one name.

Name the model, not the deployment

On Opper you call a model by its plain, canonical name, with no provider prefix:

claude-sonnet-4-5
gpt-oss-120b
mistral-medium-3.5
claude-opus-4-7

Behind claude-sonnet-4-5 sit all of its deployments. Be that Anthropic direct, Azure, AWS in the EU, GCP in the US and EU. You don't enumerate them. You name the model and Opper serves it from one of them.

It works through the OpenAI-compatible endpoint, so this is the OpenAI SDK you already know. Instead of pinning a single deployment:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.opper.ai/compat/openai",
    api_key="-",  # must not be blank
    default_headers={"x-opper-api-key": OPPER_API_KEY},
)

resp = client.chat.completions.create(
    model="aws/claude-sonnet-4-5-eu",  # one deployment, one region
    messages=[{"role": "user", "content": "Summarize the following text: ..."}],
)

you name the model and let the gateway route it. Only the model line changes:

resp = client.chat.completions.create(
    model="claude-sonnet-4-5",  # every provider that runs it
    messages=[{"role": "user", "content": "Summarize the following text: ..."}],
)

Same call and same response shape. The difference is that the second one keeps working when a provider has a bad day, and picks up new regions and deployments as we add them, without you touching the code.

You can still pin a single deployment with a full provider/model id whenever you need exact control. The plain name is the default, not the only option.

What happens behind one name

Routing a plain name isn't round-robin roulette. Opper orders the deployments so the common cases are both fast and stable:

  • Sticky within a session. The first call in a session lands on a deployment, and the rest of that session prefers the same one. Providers bill the repeated part of a conversation at a discount when the same deployment handles every turn (a warm prompt cache), so staying put keeps that discount instead of resetting it.
  • Fallback on failure. If a deployment returns a 429 or 5xx, Opper moves to the next one and retries. For structured outputs it also retries JSON/XML parsing per model before moving on. The caller sees a result, not an outage. This is the same machinery described in Fallbacks and Aliases, now applied automatically to every deployment of a model.
  • Inside your policy. Routing only ever considers providers and models your org and project allow. If your rules permit only EU regions, a plain name resolves to EU deployments only, so claude-sonnet-4-5 stays in Europe without you hand-picking the EU id.
  • Even spread on cold start. With no session to be sticky to, deployments are rotated so load doesn't pile onto one provider.
flowchart LR A["Call claude-sonnet-4-5"] --> B["Expand to allowed deployments"] B --> C{"Seen this session?"} C -->|"Yes"| D["Reuse the warm one"] C -->|"No"| E["Pick the next healthy one"] D --> F["Send request"] E --> F F -->|"Success"| Z["Return result"] F -->|"429 / 5xx"| G["Try the next deployment"] G --> F

In addition, every attempt is optionally traced, so you can see which deployment actually served a call, where fallbacks kicked in, and where spend went.

How this relates to aliases

If you've used Opper's aliases, this will feel familiar. Both turn one name into an ordered list of concrete deployments with fallback. The difference is who curates the list.

Plain model nameAlias
Who maintains itOpper (platform)You (per organization)
The nameThe model's canonical name, e.g. claude-sonnet-4-5Whatever you choose, e.g. sonnet, fast, default-llm
What it expands toEvery provider and region that runs that modelThe exact ordered list you define
Stays current as new regions/providers appearAutomaticallyWhen you edit it
Best for"Give me this model, wherever it runs best"Org-wide conventions, mixing custom + vendor models, exact fallback order

They compose. Point an alias at a couple of model names and you get a stable team-wide name on top of always-current deployment lists:

curl -X POST https://api.opper.ai/v2/models/aliases \
  -H "Authorization: Bearer YOUR_OPPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "default-llm",
    "fallback_models": ["claude-sonnet-4-5", "gpt-oss-120b"],
    "description": "Sonnet 4.5 across all its deployments, then open-weight as a floor"
  }'

Now your whole codebase calls "model": "default-llm". Sonnet's deployments are kept current by us; the fallback floor and the org convention are yours.

Aliases aren't limited to our pre-integrated models, either. You can add your own custom model endpoints to the same list, so one name can fall back from a self-hosted model to a vendor one, or the other way around. Everything still runs through the same gateway, so budgets, guardrails, region policy, and traces apply uniformly across your custom models and our pre-integrated ones.

Why we serve models this way

  • Painless migrations. Point releases and new regional deployments roll in behind the model name. No PR to bump an id.
  • Resilience by default. Multi-provider fallback is just how serving works. You don't wire it up per call.
  • EU residency without bookkeeping. Set the region policy once; routing respects it. claude-sonnet-4-5 stays in the EU because your project says so, not because you memorized the EU id.
  • Cheaper multi-turn work, through preserved prompt caching. Providers cache the part of a prompt they've already seen and bill it at a discount (prompt caching), but only if the same deployment handles each turn. Because a session sticks to one deployment, that cache stays warm and you keep getting the discount instead of paying full price every turn.
  • Cleaner code. Your application says what model it wants. The gateway handles where it runs.

Get started

Open Browse Models in the Opper dashboard (under Settings, Models) to see every model in your project. Each row expands to show the deployments behind it, the regions they run in, and the price range. From there, swap a pinned provider/model id for its plain name on your next call and watch the trace to see which deployment served it.

New to Opper? Sign up, name a model, and let us route it. The full catalog also lives on opper.ai/models and in the docs. Questions? We're on Discord or at hello@opper.ai.