The Real Costs of Serving an LLM in Production (2026 Edition)
LLM pricing pages list "$X per million tokens" and let you draw your own conclusions. In production, that number is a fraction of what you actually spend. Here's the breakdown nobody publishes.
The headline numbers (December 2025)
Per million tokens, input/output, frontier models:
| Model | Input | Output |
|---|---|---|
| GPT-4.1 | $2.50 | $10.00 |
| GPT-4.1-mini | $0.15 | $0.60 |
| Claude Opus 4.7 | $15.00 | $75.00 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| Claude Haiku 4.5 | $0.80 | $4.00 |
| Gemini 2.5 Pro | $1.25 | $5.00 |
| Llama 3.3 70B (self-hosted on AWS g5.12xlarge) | ~$0.30 | ~$0.30 |
These are sticker prices. Your actual cost per query depends on factors that matter much more.
The 4 multipliers that actually determine cost
1. Output:input ratio
A query with 200 input tokens and 1.500 output tokens on Claude Sonnet costs $0.023 — and 95% of that is output. Engineers optimize prompts (input) when they should be optimizing response length (output).
Lesson: cap max_tokens. Use structured outputs that force concision. Bullet points instead of paragraphs in agent responses.
2. Retry rate
If your structured outputs fail JSON parsing 8% of the time and you retry, that's effectively a 1.08x multiplier on your bill. Some systems I've audited had 30%+ retry rates because of brittle prompts.
Lesson: invest in robust output parsing with auto-repair logic before throwing more compute at it.
3. Context bloat
Every long-running conversation adds history to the prompt. A chat that's been going for 40 turns might be sending 15.000 tokens of context with every new message — and you pay for all of it, every time.
Lesson: summarize old context aggressively. Use semantic memory (not raw history). Truncate or compress conversations >20 turns.
4. Embedding storage and re-embedding
A 10-million-chunk vector DB on Pinecone Standard is ~$280/month. On pgvector + RDS, ~$120/month. Self-hosted on a single beefy box, ~$30/month plus your time.
Add the cost of re-embedding when you change the embedding model (it happens twice a year on average), and the operational cost of the vector DB often exceeds the inference cost for read-heavy RAG systems.
Worked example: a customer support agent
Let's price a real workload.
Assumptions: - 100.000 conversations per month - Average conversation: 4 turns - Each turn: 800 input tokens (system prompt + history + retrieved chunks) + 250 output tokens - 70% handled by Claude Haiku, 30% escalated to Sonnet - RAG over 50.000 chunks, embedded with text-embedding-3-small
Inference cost:
| Component | Calculation | Monthly |
|---|---|---|
| Haiku tokens: 70k convs × 4 turns × (800 in + 250 out) | 280M in, 70M out @ Haiku rates | $504 |
| Sonnet tokens: 30k convs × 4 turns × (800 in + 250 out) | 96M in, 30M out @ Sonnet rates | $738 |
| Inference total | $1.242 |
Infrastructure cost:
| Component | Monthly |
|---|---|
| Pinecone Standard (5M vectors) | $140 |
| Embedding generation (50k chunks × $0.02/M tokens, avg 400 tokens each) | $0.40 |
| Re-embedding on query (400k embed calls × ~30 tokens) | $0.24 |
| Backend infra (1 small ECS service + RDS Postgres) | $180 |
| Logging & monitoring (Langfuse self-hosted) | $40 |
| Infra total |
Hidden costs:
| Component | Monthly |
|---|---|
| Engineer time on retry/error handling (~3hrs/week × $100/hr) | $1.200 |
| Evaluation runs (weekly regression test on 500 queries) | $80 |
| Failed experiments / A/B tests | $200 |
| Hidden total |
Grand total: $3.082/month for 100k conversations = $0.031 per conversation.
The inference is 40% of the bill. Infra is 12%. Engineering time and operations are 48% — which is the part vendor pricing pages never mention.
Where teams overspend
Patterns I see consistently:
- Using Opus/GPT-4.1 for everything. Most queries don't need it. Routing 80% to a cheaper model usually cuts costs 5-10x.
- Not caching. Identical queries hit the LLM. Even a 15% cache hit rate is a 15% bill reduction.
- Streaming as a default. Streaming has UX value but adds overhead. For batch jobs, disable it.
- Forgetting about embedding re-runs. If you change models or chunking strategy, you pay to re-embed everything.
- Storing context in the prompt instead of the retrieval layer. Long prompts feel productive but compound costs across every turn.
Where to self-host
Self-hosting an open model (Llama 3.3 70B on AWS g5.12xlarge) hits cost parity with frontier APIs at roughly 5M output tokens per day, sustained. Below that, you're paying for idle GPU. Above that, self-hosting can be 60-80% cheaper.
What people forget: - Self-hosting needs ML/SRE skills. Budget 0.5 FTE just for the inference stack. - vLLM, TGI, and llama.cpp have very different throughput profiles. Benchmark for your actual workload. - Open models lag frontier by 6-12 months on the hardest reasoning tasks. Match the model to the job.
The cheap-but-correct stack for 2026
If you're building a new LLM product today and want to keep costs sane:
- Default model: Claude Haiku 4.5 or GPT-4.1-mini (cheap, fast, good enough for 80% of queries)
- Escalation model: Claude Sonnet 4.6 (smart, reasonable cost)
- Embeddings: OpenAI text-embedding-3-small or Cohere embed-multilingual-v3
- Vector DB: pgvector if you have <2M chunks, Qdrant or Weaviate for larger
- Observability: Langfuse (open source, self-hostable)
- Orchestration: simple Python or LangGraph (avoid frameworks that hide the cost of every call)
This stack runs production workloads at $0.005-0.05 per query for most use cases.
The number that actually matters
Forget cost per million tokens. The number to optimize is cost per successful task completion. A $0.10 query that solves a problem is cheaper than a $0.005 query that requires three retries and a human escalation.
That's the metric we teach our students to track. It's the only one that maps directly to whether your LLM product is profitable.
Want to go deeper?
This article is a small slice of what we teach in the Generative AI Development Diploma at UNLu. The full program covers RAG architectures, fine-tuning, evaluation frameworks, MLOps for LLMs, and production deployment.
Explore the IA Diploma