Smart balancer · automatic failover & latency routing

Every model on one shelf, with a smarterrouter.

Prometheus is the supermarket of LLMs — OpenAI, Anthropic, Grok, and Qwen behind one OpenAI-compatible endpoint. An intelligent balancer fails over when a provider goes down and reroutes by latency when one degrades. Plus two model lines we build ourselves: calculation models trained on math, and latency-optimized voice — both served multi-region.

Get an API key Read the docs

No SDK rewrite · Pay per token · Pick your US region · No seats, no minimums

example.ts

// drop-in: point the OpenAI SDK at Prometheus
import OpenAI from "openai"

const prometheus = new OpenAI({
  baseURL: "https://api.getprometheus.org/v1",
  apiKey: process.env.PROMETHEUS_API_KEY,
})

const res = await prometheus.chat.completions.create({
  model: "claude-opus-4.5", // or gpt-5.2, grok-4.20, prometheus-axiom…
  messages: [{ role: "user", content: "Ship it." }],
})

The supermarket of LLMs

Every model on oneshelf

The newest and most-used models from OpenAI, Anthropic, Grok, and Qwen, all behind a single OpenAI-compatible API and one key. No per-provider accounts, no juggling SDKs — just pick a model off the shelf and call it. A smart balancer keeps it up when a provider doesn't.

OpenAI
Anthropic
xAI Grok
Qwen

Curated, not cluttered — the models worth shipping, kept current.

On the shelf todayOne key

gpt-5.2OpenAI

Flagship reasoning & chat

claude-opus-4.5Anthropic

Frontier reasoning & code

grok-4.20xAI Grok

Flagship, 2M context

qwen3-maxQwen

Flagship, frontier open

The balancer

A router that keeps youup

A supermarket is only useful if the shelves are never empty. Every request flows through a balancer that watches provider health in real time — failing over when one goes down, rerouting by latency when one degrades.

Automatic failover

When a provider returns errors or goes down, the balancer reroutes the same request to a healthy fallback — your call still completes instead of throwing.

Latency-aware routing

We probe provider health continuously. When a model degrades or slows, traffic shifts to the fastest healthy path so your p95 stays flat.

One contract, no client changes

Routing happens behind the endpoint. Your code keeps calling the same model name — the balancer decides where it actually lands.

Read the docs Get an API key

balancer · live routing

Primary · OpenAI→ serve

Anthropic→ serve

Grok→ reroute (latency)

Qwen→ failover

Built by Prometheus

Two model lines that areours

The supermarket resells the best of the market. These two we build ourselves — and we run both multi-regionso you choose where the work is served.

Named line

Calculation models

Trained on math, by us

A family of models Prometheus trains in-house on mathematical work — arithmetic at scale, applied math, symbolic proofs, and research-grade reasoning. The models with proper names.

Tuned specifically for numerical and symbolic accuracy
From fast arithmetic to competition-grade proofs
Served multi-region for steady, low latency

Our own stack

Voice models

Voice, optimized for speed

LLMs that generate voice, optimized end-to-end for latency: multi-region serving you pick by need, our own cache layer, and a fine-tune pass for raw speed.

Pick the API region closest to your users
Prometheus cache layer for an extra speed boost
Fine-tuned for fast time-to-first-audio

Calculation models

Models trained onmath

A line Prometheus builds and trains in-house, tuned on mathematical work — from arithmetic at scale to symbolic proofs and research-grade reasoning. These are the models with proper names.

Served multi-region — pick where the math runs

Fast numeric

Prometheus Sigma

prometheus-sigma

High-throughput arithmetic, unit conversion, and tabular math at scale — the fastest model in the line for everyday calculation.

Applied math

Prometheus Calculus

prometheus-calculus

Calculus, linear algebra, and numerical methods for engineering, simulation, and quant & finance modeling.

Symbolic proofs

Prometheus Theorem

prometheus-theorem

Step-by-step derivations, symbolic manipulation, and formal proofs that hold up to verification.

Flagship

Prometheus Axiom

prometheus-axiom

The deepest tier — competition- and research-grade mathematical reasoning for the problems nothing else can crack.

Voice models

Voice, optimized forspeed

LLMs that generate voice, optimized end-to-end for latency — multi-region serving you choose by need, our own cache layer, and a fine-tune pass for raw speed.

Multi-region by need

Pick the API region per request so audio is served closest to your users — lowest, steadiest latency wherever they call from.

Our own cache

A Prometheus cache layer short-circuits repeated work for an extra speed boost on the cached tier.

Fine-tuned for speed

A dedicated fine-tune pass squeezes time-to-first-audio so live calls and voice agents feel instant.

gpt-oss-120b-uncachedStandard

Low-latency voice generation served from the region you choose — built for live phone calls and voice agents.

Input

$0.50 / 1M

Output

$1.50 / 1M

gpt-oss-120b-cachedCached boost

Same model with an extra speed boost from our own cache — the fastest path to first audio when prompts repeat.

Input

$0.75 / 1M

Output

$2.00 / 1M

Both tiers run multi-region — the cached tier adds our cache for an extra speed boost.

Multi-region · United States

Six US regions, you pick atsignup

Our own model lines — calculation and voice — run across six regions spanning the United States, so requests land on the capacity closest to your users. Pick the region by need and your math and voice traffic is served from there for the lowest, steadiest latency.

N. Virginiaus-east-1

Virginia

Our densest hub, next to the busiest internet exchanges on the East Coast.

Columbusus-east-2

Ohio

Low-latency Midwest coverage with plenty of dedicated GPU capacity.

Dallasus-central-1

Texas

Central routing that keeps both coasts within a tight latency budget.

Atlantaus-south-1

Georgia

Southeast presence tuned for voice and live phone agents.

San Joseus-west-1

California

Bay Area capacity close to where most AI products are built.

Portlandus-west-2

Oregon

Pacific Northwest region for the lowest West Coast time-to-first-token.

Choose your region

Same API in every region — only the latency changes. Marketplace models route through the balancer; calculation and voice are served multi-region.

Pricing

Pay for tokens, notcomplexity

Transparent, usage-based pricing per model — billed by input and output tokens. No seats, no minimums. Prices in USD per 1M tokens.

Marketplace

OpenAI · Anthropic · Grok · Qwen

Model	Context	Input	Output
`gpt-5.2`OpenAI · Flagship reasoning & chat	400K	$1.925 / 1M	$15.40 / 1M
`gpt-5.1`OpenAI · Balanced flagship	400K	$1.375 / 1M	$11.00 / 1M
`gpt-5-mini`OpenAI · Fast everyday workhorse	400K	$0.275 / 1M	$2.20 / 1M
`gpt-5-nano`OpenAI · High-volume, lowest cost	400K	$0.055 / 1M	$0.44 / 1M
`claude-opus-4.5`Anthropic · Frontier reasoning & code	200K	$5.50 / 1M	$27.50 / 1M
`claude-sonnet-4.5`Anthropic · Balanced, 1M context	1M	$3.30 / 1M	$16.50 / 1M
`claude-haiku-4.5`Anthropic · Fast & affordable	200K	$1.10 / 1M	$5.50 / 1M
`grok-4.20`xAI Grok · Flagship, 2M context	2M	$1.375 / 1M	$2.75 / 1M
`grok-4.3`xAI Grok · Latest balanced	1M	$1.375 / 1M	$2.75 / 1M
`grok-build-0.1`xAI Grok · Agentic coding & build	256K	$1.10 / 1M	$2.20 / 1M
`qwen3-max`Qwen · Flagship, frontier open	262K	$0.858 / 1M	$4.29 / 1M
`qwen3-coder`Qwen · Agentic coding, 1M context	1M	$0.242 / 1M	$1.98 / 1M
`qwen3-235b-a22b`Qwen · Efficient MoE reasoning	131K	$0.50 / 1M	$2.00 / 1M
`qwen-plus`Qwen · Low-cost workhorse	1M	$0.286 / 1M	$0.858 / 1M

Calculation models

Multi-region

Model	Tier	Input	Output
Prometheus Sigma`prometheus-sigma`	Fast numeric	$0.20 / 1M	$0.80 / 1M
Prometheus Calculus`prometheus-calculus`	Applied math	$0.50 / 1M	$2.00 / 1M
Prometheus Theorem`prometheus-theorem`	Symbolic proofs	$1.50 / 1M	$6.00 / 1M
Prometheus Axiom`prometheus-axiom`	Flagship	$3.00 / 1M	$12.00 / 1M

Voice models

Multi-region

Model	Tier	Input	Output
`gpt-oss-120b-uncached`	Standard	$0.50 / 1M	$1.50 / 1M
`gpt-oss-120b-cached`	Cached boost	$0.75 / 1M	$2.00 / 1M

Prices in USD per 1M tokens, billed per actual usage. Marketplace traffic routes through the balancer; calculation and voice are served multi-region.

Get an API key

FAQ

Questions, answered

What is Prometheus?

A supermarket of LLMs behind one OpenAI-compatible endpoint. You get the newest, most-used models from OpenAI, Anthropic, Grok, and Qwen through a single key — fronted by a smart balancer — plus two model lines we build ourselves: calculation models trained on math, and latency-optimized voice models.

How does the balancer work?

Every request flows through a router that watches provider health in real time. If a provider returns errors or goes down, the balancer fails over to a healthy fallback so your call still completes. If a model degrades or slows, traffic reroutes by latency to the fastest healthy path — all behind the endpoint, with no client changes.

Which providers and models can I use?

We curate the newest and most-used models from OpenAI (the GPT-5 family), Anthropic (Claude 4.5), xAI Grok (Grok 4), and Qwen (Qwen3). Curated, not cluttered — the models worth shipping, kept current, all under one key.

What are the calculation models?

A line Prometheus trains in-house, tuned on mathematical work: Sigma for fast numeric work at scale, Calculus for applied math and quant, Theorem for symbolic proofs and derivations, and Axiom for competition- and research-grade reasoning. These are the models with proper names, and they're served multi-region.

What are the voice models?

LLMs that generate voice, optimized end-to-end for latency. gpt-oss-120b runs in two tiers: a standard uncached tier ($0.50 in / $1.50 out per 1M), and a cached tier with an extra speed boost from our own cache ($0.75 in / $2.00 out per 1M). Both are fine-tuned for fast time-to-first-audio and served multi-region.

How does multi-region work?

Our own model lines — calculation and voice — run across six US regions (N. Virginia, Columbus, Dallas, Atlanta, San Jose, and Portland). You pick the region by need so the work is served closest to your users. The endpoint and API are identical in every region — only the latency changes.

Is it really OpenAI-compatible?

Yes. Point the OpenAI SDK (or any OpenAI-style client) at the Prometheus /v1 base URL, use a prom_sk_ key, and call chat completions with the request shapes you already know. Marketplace, calculation, and voice models are all addressed by name through the same endpoint.

How does billing work?

Usage-based, per key. Models bill by input and output tokens at the rate listed on the pricing table. No seats and no monthly minimums — you pay for what you call.

Do you support streaming and tools?

Streaming chat returns standard OpenAI-style SSE chunks, and extra fields like tools, response_format, top_p, and stream_options are supported. Prefer stream: true for the lowest time-to-first-token.

Bring your product to thefire

Create a key, point your OpenAI client at Prometheus, and ship. Every model on one shelf, a balancer that keeps you up, and our own math and voice models — no rewrite, no minimums.

Get an API key Read the docs