Smart balancer · automatic failover & latency routing

Every model on one shelf, with a smarter router.

Prometheus is the supermarket of LLMs — OpenAI, Anthropic, Grok, and Qwen behind one OpenAI-compatible endpoint. An intelligent balancer fails over when a provider goes down and reroutes by latency when one degrades. Plus two model lines we build ourselves: calculation models trained on math, and latency-optimized voice — both served multi-region.

No SDK rewrite · Pay per token · Pick your US region · No seats, no minimums

example.ts
// drop-in: point the OpenAI SDK at Prometheus
import OpenAI from "openai"

const prometheus = new OpenAI({
  baseURL: "https://api.getprometheus.org/v1",
  apiKey: process.env.PROMETHEUS_API_KEY,
})

const res = await prometheus.chat.completions.create({
  model: "claude-opus-4.5", // or gpt-5.2, grok-4.20, prometheus-axiom…
  messages: [{ role: "user", content: "Ship it." }],
})

The supermarket of LLMs

Every model on one shelf

The newest and most-used models from OpenAI, Anthropic, Grok, and Qwen, all behind a single OpenAI-compatible API and one key. No per-provider accounts, no juggling SDKs — just pick a model off the shelf and call it. A smart balancer keeps it up when a provider doesn't.

  • OpenAI
  • Anthropic
  • xAI Grok
  • Qwen

Curated, not cluttered — the models worth shipping, kept current.

On the shelf today One key
gpt-5.2 OpenAI
Flagship reasoning & chat
claude-opus-4.5 Anthropic
Frontier reasoning & code
grok-4.20 xAI Grok
Flagship, 2M context
qwen3-max Qwen
Flagship, frontier open

The balancer

A router that keeps you up

A supermarket is only useful if the shelves are never empty. Every request flows through a balancer that watches provider health in real time — failing over when one goes down, rerouting by latency when one degrades.

Automatic failover

When a provider returns errors or goes down, the balancer reroutes the same request to a healthy fallback — your call still completes instead of throwing.

Latency-aware routing

We probe provider health continuously. When a model degrades or slows, traffic shifts to the fastest healthy path so your p95 stays flat.

One contract, no client changes

Routing happens behind the endpoint. Your code keeps calling the same model name — the balancer decides where it actually lands.

balancer · live routing
Primary · OpenAI → serve
Anthropic → serve
Grok → reroute (latency)
Qwen → failover

Calculation models

Models trained on math

A line Prometheus builds and trains in-house, tuned on mathematical work — from arithmetic at scale to symbolic proofs and research-grade reasoning. These are the models with proper names.

Served multi-region — pick where the math runs

Fast numeric

Prometheus Sigma

prometheus-sigma

High-throughput arithmetic, unit conversion, and tabular math at scale — the fastest model in the line for everyday calculation.

Applied math

Prometheus Calculus

prometheus-calculus

Calculus, linear algebra, and numerical methods for engineering, simulation, and quant & finance modeling.

Symbolic proofs

Prometheus Theorem

prometheus-theorem

Step-by-step derivations, symbolic manipulation, and formal proofs that hold up to verification.

Flagship

Prometheus Axiom

prometheus-axiom

The deepest tier — competition- and research-grade mathematical reasoning for the problems nothing else can crack.

Voice models

Voice, optimized for speed

LLMs that generate voice, optimized end-to-end for latency — multi-region serving you choose by need, our own cache layer, and a fine-tune pass for raw speed.

Multi-region by need

Pick the API region per request so audio is served closest to your users — lowest, steadiest latency wherever they call from.

Our own cache

A Prometheus cache layer short-circuits repeated work for an extra speed boost on the cached tier.

Fine-tuned for speed

A dedicated fine-tune pass squeezes time-to-first-audio so live calls and voice agents feel instant.

gpt-oss-120b-uncached Standard

Low-latency voice generation served from the region you choose — built for live phone calls and voice agents.

Input

$0.15 / 1M

Output

$0.60 / 1M

gpt-oss-120b-cached Cached boost

Same model with an extra speed boost from our own cache — the fastest path to first audio when prompts repeat.

Input

$0.20 / 1M

Output

$0.80 / 1M

Both tiers run multi-region — the cached tier adds our cache for an extra speed boost.

Multi-region · United States

Six US regions, you pick at signup

Our own model lines — calculation and voice — run across six regions spanning the United States, so requests land on the capacity closest to your users. Pick the region by need and your math and voice traffic is served from there for the lowest, steadiest latency.

N. Virginia us-east-1

Virginia

Our densest hub, next to the busiest internet exchanges on the East Coast.

Columbus us-east-2

Ohio

Low-latency Midwest coverage with plenty of dedicated GPU capacity.

Dallas us-central-1

Texas

Central routing that keeps both coasts within a tight latency budget.

Atlanta us-south-1

Georgia

Southeast presence tuned for voice and live phone agents.

San Jose us-west-1

California

Bay Area capacity close to where most AI products are built.

Portland us-west-2

Oregon

Pacific Northwest region for the lowest West Coast time-to-first-token.

Choose your region

Same API in every region — only the latency changes. Marketplace models route through the balancer; calculation and voice are served multi-region.

Pricing

Pay for tokens, not complexity

Transparent, usage-based pricing per model — billed by input and output tokens. No seats, no minimums. Prices in USD per 1M tokens.

Marketplace

OpenAI · Anthropic · Grok · Qwen
Model Context Input Output
gpt-5.2 OpenAI · Flagship reasoning & chat 400K $1.925 / 1M $15.40 / 1M
gpt-5.1 OpenAI · Balanced flagship 400K $1.375 / 1M $11.00 / 1M
gpt-5-mini OpenAI · Fast everyday workhorse 400K $0.275 / 1M $2.20 / 1M
gpt-5-nano OpenAI · High-volume, lowest cost 400K $0.055 / 1M $0.44 / 1M
claude-opus-4.5 Anthropic · Frontier reasoning & code 200K $5.50 / 1M $27.50 / 1M
claude-sonnet-4.5 Anthropic · Balanced, 1M context 1M $3.30 / 1M $16.50 / 1M
claude-haiku-4.5 Anthropic · Fast & affordable 200K $1.10 / 1M $5.50 / 1M
grok-4.20 xAI Grok · Flagship, 2M context 2M $1.375 / 1M $2.75 / 1M
grok-4.3 xAI Grok · Latest balanced 1M $1.375 / 1M $2.75 / 1M
grok-build-0.1 xAI Grok · Agentic coding & build 256K $1.10 / 1M $2.20 / 1M
qwen3-max Qwen · Flagship, frontier open 262K $0.858 / 1M $4.29 / 1M
qwen3-coder Qwen · Agentic coding, 1M context 1M $0.242 / 1M $1.98 / 1M
qwen3-235b-a22b Qwen · Efficient MoE reasoning 131K $0.50 / 1M $2.00 / 1M
qwen-plus Qwen · Low-cost workhorse 1M $0.286 / 1M $0.858 / 1M

Calculation models

Multi-region
Model Tier Input Output
Prometheus Sigma prometheus-sigma Fast numeric $0.20 / 1M $0.80 / 1M
Prometheus Calculus prometheus-calculus Applied math $0.50 / 1M $2.00 / 1M
Prometheus Theorem prometheus-theorem Symbolic proofs $1.50 / 1M $6.00 / 1M
Prometheus Axiom prometheus-axiom Flagship $3.00 / 1M $12.00 / 1M

Voice models

Multi-region
Model Tier Input Output
gpt-oss-120b-uncached Standard $0.15 / 1M $0.60 / 1M
gpt-oss-120b-cached Cached boost $0.20 / 1M $0.80 / 1M

Prices in USD per 1M tokens, billed per actual usage. Marketplace traffic routes through the balancer; calculation and voice are served multi-region.

Get an API key

FAQ

Questions, answered

What is Prometheus?

A supermarket of LLMs behind one OpenAI-compatible endpoint. You get the newest, most-used models from OpenAI, Anthropic, Grok, and Qwen through a single key — fronted by a smart balancer — plus two model lines we build ourselves: calculation models trained on math, and latency-optimized voice models.

How does the balancer work?

Every request flows through a router that watches provider health in real time. If a provider returns errors or goes down, the balancer fails over to a healthy fallback so your call still completes. If a model degrades or slows, traffic reroutes by latency to the fastest healthy path — all behind the endpoint, with no client changes.

Which providers and models can I use?

We curate the newest and most-used models from OpenAI (the GPT-5 family), Anthropic (Claude 4.5), xAI Grok (Grok 4), and Qwen (Qwen3). Curated, not cluttered — the models worth shipping, kept current, all under one key.

What are the calculation models?

A line Prometheus trains in-house, tuned on mathematical work: Sigma for fast numeric work at scale, Calculus for applied math and quant, Theorem for symbolic proofs and derivations, and Axiom for competition- and research-grade reasoning. These are the models with proper names, and they're served multi-region.

What are the voice models?

LLMs that generate voice, optimized end-to-end for latency. gpt-oss-120b runs in two tiers: a standard uncached tier ($0.15 in / $0.60 out per 1M), and a cached tier with an extra speed boost from our own cache ($0.20 in / $0.80 out per 1M). Both are fine-tuned for fast time-to-first-audio and served multi-region.

How does multi-region work?

Our own model lines — calculation and voice — run across six US regions (N. Virginia, Columbus, Dallas, Atlanta, San Jose, and Portland). You pick the region by need so the work is served closest to your users. The endpoint and API are identical in every region — only the latency changes.

Is it really OpenAI-compatible?

Yes. Point the OpenAI SDK (or any OpenAI-style client) at the Prometheus /v1 base URL, use a prom_sk_ key, and call chat completions with the request shapes you already know. Marketplace, calculation, and voice models are all addressed by name through the same endpoint.

How does billing work?

Usage-based, per key. Models bill by input and output tokens at the rate listed on the pricing table. No seats and no monthly minimums — you pay for what you call.

Do you support streaming and tools?

Streaming chat returns standard OpenAI-style SSE chunks, and extra fields like tools, response_format, top_p, and stream_options are supported. Prefer stream: true for the lowest time-to-first-token.

Bring your product to the fire

Create a key, point your OpenAI client at Prometheus, and ship. Every model on one shelf, a balancer that keeps you up, and our own math and voice models — no rewrite, no minimums.