TL;DR · Key Takeaways
- Prompting, RAG, and fine-tuning are not rivals. They fix different problems, and most confusion comes from treating them as interchangeable.
- Prompting steers the model in the moment. RAG gives it facts it lacks. Fine-tuning changes how it behaves.
- The fastest way to choose: do you need new facts (reach for RAG) or new behavior (reach for fine-tuning)? Start with prompting either way.
- Escalate in order of cost. Prompt first, add RAG if it needs your knowledge, fine-tune last, and combine them when it helps.
Ask five teams how to make a model do what they want and you will hear the same three answers in a different order: prompt it, give it RAG, or fine-tune it. Then the argument starts, usually as if you must pick one and the others are wrong. That framing is the real mistake. These three tools solve genuinely different problems, and once you can name which problem is which, the choice stops being a debate and becomes almost mechanical. This part is the decision guide, with a verdict at the end.
What each one actually changes
Prompting, covered in Part 12, changes nothing permanent. You shape the output by what you put in the context: instructions, examples, a role, the relevant data. It is instant, free beyond the tokens, and endlessly adjustable, which is exactly why it should always be your first move. Its limit is that everything has to fit in the context window every time, and it cannot give the model knowledge or skills it fundamentally lacks.
RAG, from Part 13, changes the facts in front of the model. It fetches relevant passages from your documents and drops them into the prompt, so the model answers from real, current, citable material instead of fuzzy memory. RAG is the right tool whenever the gap is knowledge: private data, fresh data, or a body of facts too large and changeable to bake in. It does not change how the model writes or reasons, only what it knows at answer time.
Fine-tuning changes the model itself, by training it further on examples until new behaviour is baked into its weights. This is the tool when the gap is not knowledge but conduct: you need a consistent house style, a rigid output format, a specialised tone, or reliable performance on a narrow task that prompting cannot pin down. Fine-tuning is the most powerful of the three and also the most expensive and the slowest to change, because every update means another training run and a fresh dataset.
The one question that decides it
Here is the heuristic I would put on a sticky note: do you need new facts, or new behaviour? If the model gives the wrong answer because it lacks information, that is a facts problem, and fine-tuning will not fix it reliably, you want RAG. If the model has the right information but says it in the wrong way, wrong format, wrong tone, ignoring your rules, that is a behaviour problem, and stuffing more context is a losing battle, you want fine-tuning. Mixing these up is the single most common and most expensive error in applied AI: teams fine-tune to inject facts (fragile and quickly stale) or bolt on RAG to fix a formatting quirk (when one good example in the prompt would have done it).
A concrete case makes the trap obvious. Imagine a support bot that keeps citing last year’s prices. The tempting fix is to fine-tune it on the new price list so it “learns” them. Do that and you have signed up for a fresh training run every time a price changes, and you still cannot trust the numbers, because the model may blur them like any other memorised fact. The facts problem wants RAG: put the live price list in a retrievable store, and the bot quotes the right number today and tomorrow, with the source attached. Now flip it. Suppose the same bot answers correctly but rambles for three paragraphs when you need two crisp sentences in a fixed template. No amount of retrieval fixes that, because the knowledge was never the issue. That is a behaviour problem, and a short fine-tune (or, often, a couple of strong examples in the prompt) is the cure. Same bot, two complaints, two completely different tools.
| Prompting | RAG | Fine-tuning | |
|---|---|---|---|
| Changes | The instruction | The facts available | The model’s behaviour |
| Best for | Steering, quick wins | Private / current knowledge | Style, format, niche skill |
| Cost | Lowest | Medium (build a pipeline) | Highest (data + compute) |
| Freshness | Instant | Re-index and it is current | Stale until you retrain |
| Can cite sources | No | Yes | No |
They stack better than they compete
The cleanest mental model is a ladder of cost, not a menu of alternatives. Start at the bottom with prompting, because it is free and fast and solves more than people expect. Climb to RAG when the missing piece is knowledge. Climb to fine-tuning only when you have a genuine behaviour gap that prompting and examples cannot close. Each rung costs more in effort and time, so you climb only as high as the problem forces you.
And the most capable systems often use all three together. A customer-support assistant might be fine-tuned to nail the company’s tone and a strict reply format, use RAG to pull the right policy and order details, and sit behind a carefully written system prompt that sets the rules. There is no contradiction in that, because each tool is doing the job only it can do: fine-tuning for the voice, RAG for the facts, prompting for the moment-to-moment steering. Seeing them as a stack rather than a contest is most of the battle.
▾ Go Deeper (optional, for technical readers)
If you do fine-tune, you rarely need to touch all the weights. Full fine-tuning updates every parameter in the model. It is the most thorough option but also the heaviest: you need enough GPU memory to hold and update the entire model, you produce a full-size copy for every variant you train, and you risk catastrophic forgetting, where pushing hard on new behaviour degrades the general ability the model already had.
LoRA (low-rank adaptation) is the popular alternative and the reason fine-tuning got accessible. Instead of editing the original weights, it freezes them and trains a small set of extra “adapter” matrices that nudge the model’s behaviour. These adapters are tiny, often a fraction of a percent of the model’s size, so training is far cheaper, several variants can be stored and swapped cheaply, and the base model stays intact. QLoRA pushes this further by quantizing the frozen base model to 4-bit so the whole job fits on a single modest GPU. For the large majority of fine-tuning needs, LoRA-style methods match full fine-tuning closely at a tiny fraction of the cost, which is why “fine-tuning” in practice usually means LoRA. Full fine-tuning is reserved for cases where you are deeply customising a model or training at the frontier.
This is Part 15 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It ties together Part 12 on prompting and Part 13 on RAG.
The Bottom Line
Prompting, RAG, and fine-tuning answer different questions, so the verdict is not “which is best” but “what are you missing.” Need to steer the model? Prompt. Need it to know your facts? RAG. Need it to behave differently in a way instructions cannot capture? Fine-tune. Run that test honestly and the right tool is usually obvious within a sentence.
My standing verdict: climb the ladder, do not leap to the top. Start with the cheapest tool that could work, escalate only when it provably cannot, and combine freely once you understand what each rung buys. That discipline saves more money and grief than any single clever technique. With the practical “how to use models” questions settled, the series now turns to the most hyped idea of the moment, AI agents, and asks the harder question: what actually works, and what is just a good demo.
References
- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
- QLoRA: efficient fine-tuning of quantized LLMs (Dettmers et al., 2023)
- When to prompt vs fine-tune (Anthropic docs)
« Part 14: vector databases | Generative AI Complete Guide | Next: Part 16, AI agents »








