Prometheus Sigma
prometheus-sigma High-throughput arithmetic, unit conversion, and tabular math at scale — the fastest model in the line for everyday calculation.
Prometheus is the supermarket of LLMs — OpenAI, Anthropic, Grok, and Qwen behind one OpenAI-compatible endpoint. An intelligent balancer fails over when a provider goes down and reroutes by latency when one degrades. Plus two model lines we build ourselves: calculation models trained on math, and latency-optimized voice — both served multi-region.
No SDK rewrite · Pay per token · Pick your US region · No seats, no minimums
// drop-in: point the OpenAI SDK at Prometheus
import OpenAI from "openai"
const prometheus = new OpenAI({
baseURL: "https://api.getprometheus.org/v1",
apiKey: process.env.PROMETHEUS_API_KEY,
})
const res = await prometheus.chat.completions.create({
model: "claude-opus-4.5", // or gpt-5.2, grok-4.20, prometheus-axiom…
messages: [{ role: "user", content: "Ship it." }],
}) The supermarket of LLMs
The newest and most-used models from OpenAI, Anthropic, Grok, and Qwen, all behind a single OpenAI-compatible API and one key. No per-provider accounts, no juggling SDKs — just pick a model off the shelf and call it. A smart balancer keeps it up when a provider doesn't.
Curated, not cluttered — the models worth shipping, kept current.
gpt-5.2 OpenAI claude-opus-4.5 Anthropic grok-4.20 xAI Grok qwen3-max Qwen The balancer
A supermarket is only useful if the shelves are never empty. Every request flows through a balancer that watches provider health in real time — failing over when one goes down, rerouting by latency when one degrades.
When a provider returns errors or goes down, the balancer reroutes the same request to a healthy fallback — your call still completes instead of throwing.
We probe provider health continuously. When a model degrades or slows, traffic shifts to the fastest healthy path so your p95 stays flat.
Routing happens behind the endpoint. Your code keeps calling the same model name — the balancer decides where it actually lands.
Built by Prometheus
The supermarket resells the best of the market. These two we build ourselves — and we run both multi-region so you choose where the work is served.
Calculation models
A family of models Prometheus trains in-house on mathematical work — arithmetic at scale, applied math, symbolic proofs, and research-grade reasoning. The models with proper names.
Voice models
LLMs that generate voice, optimized end-to-end for latency: multi-region serving you pick by need, our own cache layer, and a fine-tune pass for raw speed.
Calculation models
A line Prometheus builds and trains in-house, tuned on mathematical work — from arithmetic at scale to symbolic proofs and research-grade reasoning. These are the models with proper names.
Served multi-region — pick where the math runs
prometheus-sigma High-throughput arithmetic, unit conversion, and tabular math at scale — the fastest model in the line for everyday calculation.
prometheus-calculus Calculus, linear algebra, and numerical methods for engineering, simulation, and quant & finance modeling.
prometheus-theorem Step-by-step derivations, symbolic manipulation, and formal proofs that hold up to verification.
prometheus-axiom The deepest tier — competition- and research-grade mathematical reasoning for the problems nothing else can crack.
Voice models
LLMs that generate voice, optimized end-to-end for latency — multi-region serving you choose by need, our own cache layer, and a fine-tune pass for raw speed.
Pick the API region per request so audio is served closest to your users — lowest, steadiest latency wherever they call from.
A Prometheus cache layer short-circuits repeated work for an extra speed boost on the cached tier.
A dedicated fine-tune pass squeezes time-to-first-audio so live calls and voice agents feel instant.
gpt-oss-120b-uncached Standard Low-latency voice generation served from the region you choose — built for live phone calls and voice agents.
Input
$0.15 / 1M
Output
$0.60 / 1M
gpt-oss-120b-cached Cached boost Same model with an extra speed boost from our own cache — the fastest path to first audio when prompts repeat.
Input
$0.20 / 1M
Output
$0.80 / 1M
Both tiers run multi-region — the cached tier adds our cache for an extra speed boost.
Multi-region · United States
Our own model lines — calculation and voice — run across six regions spanning the United States, so requests land on the capacity closest to your users. Pick the region by need and your math and voice traffic is served from there for the lowest, steadiest latency.
us-east-1 Virginia
Our densest hub, next to the busiest internet exchanges on the East Coast.
us-east-2 Ohio
Low-latency Midwest coverage with plenty of dedicated GPU capacity.
us-central-1 Texas
Central routing that keeps both coasts within a tight latency budget.
us-south-1 Georgia
Southeast presence tuned for voice and live phone agents.
us-west-1 California
Bay Area capacity close to where most AI products are built.
us-west-2 Oregon
Pacific Northwest region for the lowest West Coast time-to-first-token.
Same API in every region — only the latency changes. Marketplace models route through the balancer; calculation and voice are served multi-region.
Pricing
Transparent, usage-based pricing per model — billed by input and output tokens. No seats, no minimums. Prices in USD per 1M tokens.
| Model | Context | Input | Output |
|---|---|---|---|
gpt-5.2 OpenAI · Flagship reasoning & chat | 400K | $1.925 / 1M | $15.40 / 1M |
gpt-5.1 OpenAI · Balanced flagship | 400K | $1.375 / 1M | $11.00 / 1M |
gpt-5-mini OpenAI · Fast everyday workhorse | 400K | $0.275 / 1M | $2.20 / 1M |
gpt-5-nano OpenAI · High-volume, lowest cost | 400K | $0.055 / 1M | $0.44 / 1M |
claude-opus-4.5 Anthropic · Frontier reasoning & code | 200K | $5.50 / 1M | $27.50 / 1M |
claude-sonnet-4.5 Anthropic · Balanced, 1M context | 1M | $3.30 / 1M | $16.50 / 1M |
claude-haiku-4.5 Anthropic · Fast & affordable | 200K | $1.10 / 1M | $5.50 / 1M |
grok-4.20 xAI Grok · Flagship, 2M context | 2M | $1.375 / 1M | $2.75 / 1M |
grok-4.3 xAI Grok · Latest balanced | 1M | $1.375 / 1M | $2.75 / 1M |
grok-build-0.1 xAI Grok · Agentic coding & build | 256K | $1.10 / 1M | $2.20 / 1M |
qwen3-max Qwen · Flagship, frontier open | 262K | $0.858 / 1M | $4.29 / 1M |
qwen3-coder Qwen · Agentic coding, 1M context | 1M | $0.242 / 1M | $1.98 / 1M |
qwen3-235b-a22b Qwen · Efficient MoE reasoning | 131K | $0.50 / 1M | $2.00 / 1M |
qwen-plus Qwen · Low-cost workhorse | 1M | $0.286 / 1M | $0.858 / 1M |
| Model | Tier | Input | Output |
|---|---|---|---|
Prometheus Sigma prometheus-sigma | Fast numeric | $0.20 / 1M | $0.80 / 1M |
Prometheus Calculus prometheus-calculus | Applied math | $0.50 / 1M | $2.00 / 1M |
Prometheus Theorem prometheus-theorem | Symbolic proofs | $1.50 / 1M | $6.00 / 1M |
Prometheus Axiom prometheus-axiom | Flagship | $3.00 / 1M | $12.00 / 1M |
| Model | Tier | Input | Output |
|---|---|---|---|
gpt-oss-120b-uncached | Standard | $0.15 / 1M | $0.60 / 1M |
gpt-oss-120b-cached | Cached boost | $0.20 / 1M | $0.80 / 1M |
Prices in USD per 1M tokens, billed per actual usage. Marketplace traffic routes through the balancer; calculation and voice are served multi-region.
Get an API keyFAQ
A supermarket of LLMs behind one OpenAI-compatible endpoint. You get the newest, most-used models from OpenAI, Anthropic, Grok, and Qwen through a single key — fronted by a smart balancer — plus two model lines we build ourselves: calculation models trained on math, and latency-optimized voice models.
Every request flows through a router that watches provider health in real time. If a provider returns errors or goes down, the balancer fails over to a healthy fallback so your call still completes. If a model degrades or slows, traffic reroutes by latency to the fastest healthy path — all behind the endpoint, with no client changes.
We curate the newest and most-used models from OpenAI (the GPT-5 family), Anthropic (Claude 4.5), xAI Grok (Grok 4), and Qwen (Qwen3). Curated, not cluttered — the models worth shipping, kept current, all under one key.
A line Prometheus trains in-house, tuned on mathematical work: Sigma for fast numeric work at scale, Calculus for applied math and quant, Theorem for symbolic proofs and derivations, and Axiom for competition- and research-grade reasoning. These are the models with proper names, and they're served multi-region.
LLMs that generate voice, optimized end-to-end for latency. gpt-oss-120b runs in two tiers: a standard uncached tier ($0.15 in / $0.60 out per 1M), and a cached tier with an extra speed boost from our own cache ($0.20 in / $0.80 out per 1M). Both are fine-tuned for fast time-to-first-audio and served multi-region.
Our own model lines — calculation and voice — run across six US regions (N. Virginia, Columbus, Dallas, Atlanta, San Jose, and Portland). You pick the region by need so the work is served closest to your users. The endpoint and API are identical in every region — only the latency changes.
Yes. Point the OpenAI SDK (or any OpenAI-style client) at the Prometheus /v1 base URL, use a prom_sk_ key, and call chat completions with the request shapes you already know. Marketplace, calculation, and voice models are all addressed by name through the same endpoint.
Usage-based, per key. Models bill by input and output tokens at the rate listed on the pricing table. No seats and no monthly minimums — you pay for what you call.
Streaming chat returns standard OpenAI-style SSE chunks, and extra fields like tools, response_format, top_p, and stream_options are supported. Prefer stream: true for the lowest time-to-first-token.
Create a key, point your OpenAI client at Prometheus, and ship. Every model on one shelf, a balancer that keeps you up, and our own math and voice models — no rewrite, no minimums.