Prometheus Sigma
prometheus-sigmaHigh-throughput arithmetic, unit conversion, and tabular math at scale — the fastest model in the line for everyday calculation.
Prometheus is the supermarket of LLMs — OpenAI, Anthropic, Grok, and Qwen behind one OpenAI-compatible endpoint. An intelligent balancer fails over when a provider goes down and reroutes by latency when one degrades. Plus two model lines we build ourselves: calculation models trained on math, and latency-optimized voice — both served multi-region.
No SDK rewrite · Pay per token · Pick your US region · No seats, no minimums
// drop-in: point the OpenAI SDK at Prometheus
import OpenAI from "openai"
const prometheus = new OpenAI({
baseURL: "https://api.getprometheus.org/v1",
apiKey: process.env.PROMETHEUS_API_KEY,
})
const res = await prometheus.chat.completions.create({
model: "claude-opus-4.5", // or gpt-5.2, grok-4.20, prometheus-axiom…
messages: [{ role: "user", content: "Ship it." }],
})The supermarket of LLMs
The newest and most-used models from OpenAI, Anthropic, Grok, and Qwen, all behind a single OpenAI-compatible API and one key. No per-provider accounts, no juggling SDKs — just pick a model off the shelf and call it. A smart balancer keeps it up when a provider doesn't.
Curated, not cluttered — the models worth shipping, kept current.
gpt-5.2OpenAIclaude-opus-4.5Anthropicgrok-4.20xAI Grokqwen3-maxQwenThe balancer
A supermarket is only useful if the shelves are never empty. Every request flows through a balancer that watches provider health in real time — failing over when one goes down, rerouting by latency when one degrades.
When a provider returns errors or goes down, the balancer reroutes the same request to a healthy fallback — your call still completes instead of throwing.
We probe provider health continuously. When a model degrades or slows, traffic shifts to the fastest healthy path so your p95 stays flat.
Routing happens behind the endpoint. Your code keeps calling the same model name — the balancer decides where it actually lands.
Built by Prometheus
The supermarket resells the best of the market. These two we build ourselves — and we run both multi-regionso you choose where the work is served.
Calculation models
A family of models Prometheus trains in-house on mathematical work — arithmetic at scale, applied math, symbolic proofs, and research-grade reasoning. The models with proper names.
Voice models
LLMs that generate voice, optimized end-to-end for latency: multi-region serving you pick by need, our own cache layer, and a fine-tune pass for raw speed.
Calculation models
A line Prometheus builds and trains in-house, tuned on mathematical work — from arithmetic at scale to symbolic proofs and research-grade reasoning. These are the models with proper names.
Served multi-region — pick where the math runs
prometheus-sigmaHigh-throughput arithmetic, unit conversion, and tabular math at scale — the fastest model in the line for everyday calculation.
prometheus-calculusCalculus, linear algebra, and numerical methods for engineering, simulation, and quant & finance modeling.
prometheus-theoremStep-by-step derivations, symbolic manipulation, and formal proofs that hold up to verification.
prometheus-axiomThe deepest tier — competition- and research-grade mathematical reasoning for the problems nothing else can crack.
Voice models
LLMs that generate voice, optimized end-to-end for latency — multi-region serving you choose by need, our own cache layer, and a fine-tune pass for raw speed.
Pick the API region per request so audio is served closest to your users — lowest, steadiest latency wherever they call from.
A Prometheus cache layer short-circuits repeated work for an extra speed boost on the cached tier.
A dedicated fine-tune pass squeezes time-to-first-audio so live calls and voice agents feel instant.
gpt-oss-120b-uncachedStandardLow-latency voice generation served from the region you choose — built for live phone calls and voice agents.
Input
$0.50 / 1M
Output
$1.50 / 1M
gpt-oss-120b-cachedCached boostSame model with an extra speed boost from our own cache — the fastest path to first audio when prompts repeat.
Input
$0.75 / 1M
Output
$2.00 / 1M
Both tiers run multi-region — the cached tier adds our cache for an extra speed boost.
Multi-region · United States
Our own model lines — calculation and voice — run across six regions spanning the United States, so requests land on the capacity closest to your users. Pick the region by need and your math and voice traffic is served from there for the lowest, steadiest latency.
us-east-1Virginia
Our densest hub, next to the busiest internet exchanges on the East Coast.
us-east-2Ohio
Low-latency Midwest coverage with plenty of dedicated GPU capacity.
us-central-1Texas
Central routing that keeps both coasts within a tight latency budget.
us-south-1Georgia
Southeast presence tuned for voice and live phone agents.
us-west-1California
Bay Area capacity close to where most AI products are built.
us-west-2Oregon
Pacific Northwest region for the lowest West Coast time-to-first-token.
Same API in every region — only the latency changes. Marketplace models route through the balancer; calculation and voice are served multi-region.
Pricing
Transparent, usage-based pricing per model — billed by input and output tokens. No seats, no minimums. Prices in USD per 1M tokens.
| Model | Context | Input | Output |
|---|---|---|---|
gpt-5.2OpenAI · Flagship reasoning & chat | 400K | $1.925 / 1M | $15.40 / 1M |
gpt-5.1OpenAI · Balanced flagship | 400K | $1.375 / 1M | $11.00 / 1M |
gpt-5-miniOpenAI · Fast everyday workhorse | 400K | $0.275 / 1M | $2.20 / 1M |
gpt-5-nanoOpenAI · High-volume, lowest cost | 400K | $0.055 / 1M | $0.44 / 1M |
claude-opus-4.5Anthropic · Frontier reasoning & code | 200K | $5.50 / 1M | $27.50 / 1M |
claude-sonnet-4.5Anthropic · Balanced, 1M context | 1M | $3.30 / 1M | $16.50 / 1M |
claude-haiku-4.5Anthropic · Fast & affordable | 200K | $1.10 / 1M | $5.50 / 1M |
grok-4.20xAI Grok · Flagship, 2M context | 2M | $1.375 / 1M | $2.75 / 1M |
grok-4.3xAI Grok · Latest balanced | 1M | $1.375 / 1M | $2.75 / 1M |
grok-build-0.1xAI Grok · Agentic coding & build | 256K | $1.10 / 1M | $2.20 / 1M |
qwen3-maxQwen · Flagship, frontier open | 262K | $0.858 / 1M | $4.29 / 1M |
qwen3-coderQwen · Agentic coding, 1M context | 1M | $0.242 / 1M | $1.98 / 1M |
qwen3-235b-a22bQwen · Efficient MoE reasoning | 131K | $0.50 / 1M | $2.00 / 1M |
qwen-plusQwen · Low-cost workhorse | 1M | $0.286 / 1M | $0.858 / 1M |
| Model | Tier | Input | Output |
|---|---|---|---|
Prometheus Sigmaprometheus-sigma | Fast numeric | $0.20 / 1M | $0.80 / 1M |
Prometheus Calculusprometheus-calculus | Applied math | $0.50 / 1M | $2.00 / 1M |
Prometheus Theoremprometheus-theorem | Symbolic proofs | $1.50 / 1M | $6.00 / 1M |
Prometheus Axiomprometheus-axiom | Flagship | $3.00 / 1M | $12.00 / 1M |
| Model | Tier | Input | Output |
|---|---|---|---|
gpt-oss-120b-uncached | Standard | $0.50 / 1M | $1.50 / 1M |
gpt-oss-120b-cached | Cached boost | $0.75 / 1M | $2.00 / 1M |
Prices in USD per 1M tokens, billed per actual usage. Marketplace traffic routes through the balancer; calculation and voice are served multi-region.
Get an API keyFAQ
A supermarket of LLMs behind one OpenAI-compatible endpoint. You get the newest, most-used models from OpenAI, Anthropic, Grok, and Qwen through a single key — fronted by a smart balancer — plus two model lines we build ourselves: calculation models trained on math, and latency-optimized voice models.
Every request flows through a router that watches provider health in real time. If a provider returns errors or goes down, the balancer fails over to a healthy fallback so your call still completes. If a model degrades or slows, traffic reroutes by latency to the fastest healthy path — all behind the endpoint, with no client changes.
We curate the newest and most-used models from OpenAI (the GPT-5 family), Anthropic (Claude 4.5), xAI Grok (Grok 4), and Qwen (Qwen3). Curated, not cluttered — the models worth shipping, kept current, all under one key.
A line Prometheus trains in-house, tuned on mathematical work: Sigma for fast numeric work at scale, Calculus for applied math and quant, Theorem for symbolic proofs and derivations, and Axiom for competition- and research-grade reasoning. These are the models with proper names, and they're served multi-region.
LLMs that generate voice, optimized end-to-end for latency. gpt-oss-120b runs in two tiers: a standard uncached tier ($0.50 in / $1.50 out per 1M), and a cached tier with an extra speed boost from our own cache ($0.75 in / $2.00 out per 1M). Both are fine-tuned for fast time-to-first-audio and served multi-region.
Our own model lines — calculation and voice — run across six US regions (N. Virginia, Columbus, Dallas, Atlanta, San Jose, and Portland). You pick the region by need so the work is served closest to your users. The endpoint and API are identical in every region — only the latency changes.
Yes. Point the OpenAI SDK (or any OpenAI-style client) at the Prometheus /v1 base URL, use a prom_sk_ key, and call chat completions with the request shapes you already know. Marketplace, calculation, and voice models are all addressed by name through the same endpoint.
Usage-based, per key. Models bill by input and output tokens at the rate listed on the pricing table. No seats and no monthly minimums — you pay for what you call.
Streaming chat returns standard OpenAI-style SSE chunks, and extra fields like tools, response_format, top_p, and stream_options are supported. Prefer stream: true for the lowest time-to-first-token.
Create a key, point your OpenAI client at Prometheus, and ship. Every model on one shelf, a balancer that keeps you up, and our own math and voice models — no rewrite, no minimums.