RAG vs Fine-Tuning: When to Use Each (and When Not To)
If you're building anything with LLMs in production, this question shows up in week one and never really goes away. The internet has a lot of bad takes on it. Here's what actually matters.
The 30-second answer
- RAG = "give the model fresh information at inference time"
- Fine-tuning = "change how the model behaves or speaks"
They solve different problems. Most teams need RAG. A few teams need fine-tuning. A small minority need both.
When you need RAG
Use RAG when the bottleneck is knowledge the model doesn't have.
Examples: - A customer support bot that needs to know your product docs (which update weekly) - A legal assistant searching internal contracts - A coding agent that needs to read the current state of a repo - Any chatbot that should never confidently make up facts
The pattern is always the same:
user query
↓
embed query
↓
search vector DB (top 5-20 chunks)
↓
inject chunks into prompt
↓
LLM generates answer grounded in chunks
The model itself doesn't change. You're just feeding it the right context at the right time.
Cost reality: a well-tuned RAG pipeline with gpt-4.1-mini or claude-haiku-4-5 runs ~$0.001-0.005 per query at scale. The infrastructure (vector DB, embeddings, orchestration) is the real engineering work.
When you need fine-tuning
Use fine-tuning when the bottleneck is how the model behaves.
Examples: - You need it to always answer in a specific JSON schema (and prompt engineering keeps failing on edge cases) - You need it to mimic a specific writing style across thousands of generations - You need it to handle a specialized vocabulary (medical, legal, internal jargon) without constantly explaining terms - You need a smaller, cheaper model to behave like a bigger one for a narrow task (distillation)
The pattern:
collect 500-5.000 high-quality examples of (input, ideal output)
↓
fine-tune base model on those examples
↓
deploy the new model
↓
serve it like any other model
Cost reality: fine-tuning an open model (Llama 3.3, Mistral) costs $50-500 in compute. Hosting it costs ~$0.50-3/hour depending on size and provider. Fine-tuning a frontier closed model (OpenAI, Anthropic) is more expensive but the operational simplicity is worth it for small teams.
When you absolutely don't need either
Be honest: a lot of "AI projects" are just prompt engineering problems wearing a costume.
Before reaching for RAG or fine-tuning, try:
- A better system prompt with explicit examples
- Tool use (function calling) instead of forcing the model to know everything
- A smaller model with a cleaner prompt instead of a bigger one with a sloppy one
If GPT-4.1 with a 200-word system prompt does the job, that is the production answer. Adding RAG or fine-tuning to feel sophisticated is technical debt with extra steps.
The decision matrix
| Symptom | First try | If that fails |
|---|---|---|
| "It hallucinates facts about our domain" | RAG | Fine-tune for grounding |
| "It doesn't follow our output format" | Better prompt + examples | Fine-tune for format |
| "It's too expensive at scale" | Smaller model + RAG | Distill via fine-tuning |
| "It doesn't know about events after its cutoff" | RAG with web search | (Don't fine-tune for facts. Ever.) |
| "It writes in the wrong tone" | System prompt with style guide | Fine-tune on tone examples |
| "It can't reason about long documents" | RAG with hierarchical chunking | Long-context model |
The combined pattern (real production systems)
The mature setup in most production LLM stacks looks like:
- Fine-tuned smaller model as the "router" (cheap, fast, specialized)
- RAG pipeline for any domain-specific knowledge
- Frontier model for the hard cases the router escalates
Example: a customer support agent might use a fine-tuned Llama 3.3 8B for 80% of queries (template responses, simple lookups), with a Claude or GPT-4.1 fallback for the 20% that need real reasoning. RAG over the docs feeds both.
This is overkill for an MVP. It's the right architecture once you're handling 100k+ queries/month and need to control costs.
What I tell teams starting out
Build the dumbest thing that works first. Track where it fails. Let the failures tell you whether you need RAG (model doesn't know things) or fine-tuning (model doesn't behave correctly) or just a better prompt.
Most teams reach for fine-tuning when RAG would have solved it. RAG is reversible, debuggable, and your data stays in your control. Fine-tuning bakes assumptions into a model — it's harder to roll back.
One more thing: evaluation matters more than the technique
Whatever you choose, you need a regression test suite of 50-200 representative queries with expected behaviors. Without that, you can't tell if a change made things better or worse. This single discipline separates production-grade LLM systems from demos that work on Tuesday and break on Wednesday.
We dedicate an entire module of the curriculum to evaluation methodology because it's the unsexy skill that determines whether anything else you build is actually working.
Want to go deeper?
This article is a small slice of what we teach in the Generative AI Development Diploma at UNLu. The full program covers RAG architectures, fine-tuning, evaluation frameworks, MLOps for LLMs, and production deployment.
Explore the IA Diploma