Smart balancer · automatic failover & latency routing

Every model on one shelf, with a smarterrouter.

Prometheus is the supermarket of LLMs — OpenAI, Anthropic, Grok, and Qwen behind one OpenAI-compatible endpoint. An intelligent balancer fails over when a provider goes down and reroutes by latency when one degrades. Plus two model lines we build ourselves: calculation models trained on math, and latency-optimized voice — both served multi-region.

No SDK rewrite · Pay per token · Pick your US region · No seats, no minimums

example.ts
// drop-in: point the OpenAI SDK at Prometheus
import OpenAI from "openai"

const prometheus = new OpenAI({
  baseURL: "https://api.getprometheus.org/v1",
  apiKey: process.env.PROMETHEUS_API_KEY,
})

const res = await prometheus.chat.completions.create({
  model: "claude-opus-4.5", // or gpt-5.2, grok-4.20, prometheus-axiom…
  messages: [{ role: "user", content: "Ship it." }],
})

The supermarket of LLMs

Every model on oneshelf

The newest and most-used models from OpenAI, Anthropic, Grok, and Qwen, all behind a single OpenAI-compatible API and one key. No per-provider accounts, no juggling SDKs — just pick a model off the shelf and call it. A smart balancer keeps it up when a provider doesn't.

  • OpenAI
  • Anthropic
  • xAI Grok
  • Qwen

Curated, not cluttered — the models worth shipping, kept current.

On the shelf todayOne key
gpt-5.2OpenAI
Flagship reasoning & chat
claude-opus-4.5Anthropic
Frontier reasoning & code
grok-4.20xAI Grok
Flagship, 2M context
qwen3-maxQwen
Flagship, frontier open

The balancer

A router that keeps youup

A supermarket is only useful if the shelves are never empty. Every request flows through a balancer that watches provider health in real time — failing over when one goes down, rerouting by latency when one degrades.

Automatic failover

When a provider returns errors or goes down, the balancer reroutes the same request to a healthy fallback — your call still completes instead of throwing.

Latency-aware routing

We probe provider health continuously. When a model degrades or slows, traffic shifts to the fastest healthy path so your p95 stays flat.

One contract, no client changes

Routing happens behind the endpoint. Your code keeps calling the same model name — the balancer decides where it actually lands.

balancer · live routing
Primary · OpenAI→ serve
Anthropic→ serve
Grok→ reroute (latency)
Qwen→ failover

Calculation models

Models trained onmath

A line Prometheus builds and trains in-house, tuned on mathematical work — from arithmetic at scale to symbolic proofs and research-grade reasoning. These are the models with proper names.

Served multi-region — pick where the math runs

Fast numeric

Prometheus Sigma

prometheus-sigma

High-throughput arithmetic, unit conversion, and tabular math at scale — the fastest model in the line for everyday calculation.

Applied math

Prometheus Calculus

prometheus-calculus

Calculus, linear algebra, and numerical methods for engineering, simulation, and quant & finance modeling.

Symbolic proofs

Prometheus Theorem

prometheus-theorem

Step-by-step derivations, symbolic manipulation, and formal proofs that hold up to verification.

Flagship

Prometheus Axiom

prometheus-axiom

The deepest tier — competition- and research-grade mathematical reasoning for the problems nothing else can crack.

Voice models

Voice, optimized forspeed

LLMs that generate voice, optimized end-to-end for latency — multi-region serving you choose by need, our own cache layer, and a fine-tune pass for raw speed.

Multi-region by need

Pick the API region per request so audio is served closest to your users — lowest, steadiest latency wherever they call from.

Our own cache

A Prometheus cache layer short-circuits repeated work for an extra speed boost on the cached tier.

Fine-tuned for speed

A dedicated fine-tune pass squeezes time-to-first-audio so live calls and voice agents feel instant.

gpt-oss-120b-uncachedStandard

Low-latency voice generation served from the region you choose — built for live phone calls and voice agents.

Input

$0.50 / 1M

Output

$1.50 / 1M

gpt-oss-120b-cachedCached boost

Same model with an extra speed boost from our own cache — the fastest path to first audio when prompts repeat.

Input

$0.75 / 1M

Output

$2.00 / 1M

Both tiers run multi-region — the cached tier adds our cache for an extra speed boost.

Multi-region · United States

Six US regions, you pick atsignup

Our own model lines — calculation and voice — run across six regions spanning the United States, so requests land on the capacity closest to your users. Pick the region by need and your math and voice traffic is served from there for the lowest, steadiest latency.

N. Virginiaus-east-1

Virginia

Our densest hub, next to the busiest internet exchanges on the East Coast.

Columbusus-east-2

Ohio

Low-latency Midwest coverage with plenty of dedicated GPU capacity.

Dallasus-central-1

Texas

Central routing that keeps both coasts within a tight latency budget.

Atlantaus-south-1

Georgia

Southeast presence tuned for voice and live phone agents.

San Joseus-west-1

California

Bay Area capacity close to where most AI products are built.

Portlandus-west-2

Oregon

Pacific Northwest region for the lowest West Coast time-to-first-token.

Choose your region

Same API in every region — only the latency changes. Marketplace models route through the balancer; calculation and voice are served multi-region.

Pricing

Pay for tokens, notcomplexity

Transparent, usage-based pricing per model — billed by input and output tokens. No seats, no minimums. Prices in USD per 1M tokens.

Marketplace

OpenAI · Anthropic · Grok · Qwen
ModelContextInputOutput
gpt-5.2OpenAI · Flagship reasoning & chat400K$1.925 / 1M$15.40 / 1M
gpt-5.1OpenAI · Balanced flagship400K$1.375 / 1M$11.00 / 1M
gpt-5-miniOpenAI · Fast everyday workhorse400K$0.275 / 1M$2.20 / 1M
gpt-5-nanoOpenAI · High-volume, lowest cost400K$0.055 / 1M$0.44 / 1M
claude-opus-4.5Anthropic · Frontier reasoning & code200K$5.50 / 1M$27.50 / 1M
claude-sonnet-4.5Anthropic · Balanced, 1M context1M$3.30 / 1M$16.50 / 1M
claude-haiku-4.5Anthropic · Fast & affordable200K$1.10 / 1M$5.50 / 1M
grok-4.20xAI Grok · Flagship, 2M context2M$1.375 / 1M$2.75 / 1M
grok-4.3xAI Grok · Latest balanced1M$1.375 / 1M$2.75 / 1M
grok-build-0.1xAI Grok · Agentic coding & build256K$1.10 / 1M$2.20 / 1M
qwen3-maxQwen · Flagship, frontier open262K$0.858 / 1M$4.29 / 1M
qwen3-coderQwen · Agentic coding, 1M context1M$0.242 / 1M$1.98 / 1M
qwen3-235b-a22bQwen · Efficient MoE reasoning131K$0.50 / 1M$2.00 / 1M
qwen-plusQwen · Low-cost workhorse1M$0.286 / 1M$0.858 / 1M

Calculation models

Multi-region
ModelTierInputOutput
Prometheus Sigmaprometheus-sigmaFast numeric$0.20 / 1M$0.80 / 1M
Prometheus Calculusprometheus-calculusApplied math$0.50 / 1M$2.00 / 1M
Prometheus Theoremprometheus-theoremSymbolic proofs$1.50 / 1M$6.00 / 1M
Prometheus Axiomprometheus-axiomFlagship$3.00 / 1M$12.00 / 1M

Voice models

Multi-region
ModelTierInputOutput
gpt-oss-120b-uncachedStandard$0.50 / 1M$1.50 / 1M
gpt-oss-120b-cachedCached boost$0.75 / 1M$2.00 / 1M

Prices in USD per 1M tokens, billed per actual usage. Marketplace traffic routes through the balancer; calculation and voice are served multi-region.

Get an API key

FAQ

Questions, answered

What is Prometheus?

A supermarket of LLMs behind one OpenAI-compatible endpoint. You get the newest, most-used models from OpenAI, Anthropic, Grok, and Qwen through a single key — fronted by a smart balancer — plus two model lines we build ourselves: calculation models trained on math, and latency-optimized voice models.

How does the balancer work?

Every request flows through a router that watches provider health in real time. If a provider returns errors or goes down, the balancer fails over to a healthy fallback so your call still completes. If a model degrades or slows, traffic reroutes by latency to the fastest healthy path — all behind the endpoint, with no client changes.

Which providers and models can I use?

We curate the newest and most-used models from OpenAI (the GPT-5 family), Anthropic (Claude 4.5), xAI Grok (Grok 4), and Qwen (Qwen3). Curated, not cluttered — the models worth shipping, kept current, all under one key.

What are the calculation models?

A line Prometheus trains in-house, tuned on mathematical work: Sigma for fast numeric work at scale, Calculus for applied math and quant, Theorem for symbolic proofs and derivations, and Axiom for competition- and research-grade reasoning. These are the models with proper names, and they're served multi-region.

What are the voice models?

LLMs that generate voice, optimized end-to-end for latency. gpt-oss-120b runs in two tiers: a standard uncached tier ($0.50 in / $1.50 out per 1M), and a cached tier with an extra speed boost from our own cache ($0.75 in / $2.00 out per 1M). Both are fine-tuned for fast time-to-first-audio and served multi-region.

How does multi-region work?

Our own model lines — calculation and voice — run across six US regions (N. Virginia, Columbus, Dallas, Atlanta, San Jose, and Portland). You pick the region by need so the work is served closest to your users. The endpoint and API are identical in every region — only the latency changes.

Is it really OpenAI-compatible?

Yes. Point the OpenAI SDK (or any OpenAI-style client) at the Prometheus /v1 base URL, use a prom_sk_ key, and call chat completions with the request shapes you already know. Marketplace, calculation, and voice models are all addressed by name through the same endpoint.

How does billing work?

Usage-based, per key. Models bill by input and output tokens at the rate listed on the pricing table. No seats and no monthly minimums — you pay for what you call.

Do you support streaming and tools?

Streaming chat returns standard OpenAI-style SSE chunks, and extra fields like tools, response_format, top_p, and stream_options are supported. Prefer stream: true for the lowest time-to-first-token.

Bring your product to thefire

Create a key, point your OpenAI client at Prometheus, and ship. Every model on one shelf, a balancer that keeps you up, and our own math and voice models — no rewrite, no minimums.