Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

RAG: How to Stop Your AI Making Things Up (GenAI Series, Part 13)

Retrieval-augmented generation lets a model answer from your own documents by fetching the relevant passages at question time. How RAG works, and why it beats fine-tuning for facts.

10 minutes

Read Time

Generative AI Series · Part 13 of 30

TL;DR · Key Takeaways

  • RAG (retrieval-augmented generation) lets a model answer from your documents by fetching the relevant passages and handing them to the model at question time.
  • It works in a loop: ingest, chunk, embed, store your documents once, then for each question retrieve the best-matching pieces and generate an answer grounded in them.
  • For facts, RAG usually beats fine-tuning: it is cheaper, easy to keep current, and lets the answer cite its source.
  • RAG does not abolish hallucination, but it sharply reduces it by putting the truth in front of the model instead of hoping it memorised it.

Sooner or later, almost everyone asks the same question of these tools: “How do I get the AI to answer from my documents?” Your company handbook, your product manuals, last quarter’s reports, the model has never seen any of it, and when you ask about it directly you get a confident guess or a polite shrug. The standard answer to this problem has a name, RAG, and it has quietly become the most important pattern in applied AI. It is also the most reliable single fix for the hallucination problem we have been circling since Part 5.

The model, plus your documents THE MODELfrozen, knows onlyits training data + YOUR DOCSprivate, current,specific to you GROUNDED ANSWERSfrom your material,with a source to check RAG is the wiring that connects the two boxes on the left.
You do not retrain the model. You give it the right pages to read at the moment it answers.

The problem RAG solves

A model knows only what was baked into its parameters during training, as Part 4 laid out. That means three big blind spots: anything private (your internal docs), anything recent (events after its cut-off), and anything niche it barely saw. Ask about any of these and the model has nothing solid to draw on, so it does what it always does and produces a plausible guess. This is the root of a huge share of real-world hallucinations: not the model being broken, just the model being asked about things it was never told.

The instinct many people have is to fine-tune, to train the model further on their documents so it “knows” them. For facts, this is usually the wrong tool, and it is worth being blunt about why. Fine-tuning is good at teaching a model new behaviour, a style, a format, a tone. It is unreliable at teaching it new facts you can trust, because the facts get blurred into the same fuzzy parameters as everything else, with no guarantee they come back out correctly and no way to cite them. And the moment a document changes, your fine-tune is stale and you are retraining again. RAG sidesteps all of that by keeping facts where they belong: in a lookup, not in the model’s weights.

How RAG works: retrieve, then generate

The name says it plainly: retrieval-augmented generation. Before the model writes a word, you retrieve the relevant material and augment the prompt with it. The setup happens in two phases. The first is a one-time indexing pass over your documents. You break each document into bite-sized chunks (a few paragraphs each), turn every chunk into an embedding vector using the trick from Part 7, and store all those vectors in a vector database built for fast similarity search. Now your knowledge is sitting in a searchable map of meaning.

The second phase runs on every question. You embed the user’s question into a vector, search the database for the handful of chunks whose vectors sit closest to it, and paste those chunks into the prompt alongside the question, with an instruction like “answer using only the context below.” The model then generates its reply from material that is physically present in its context window, not from hazy memory. Because the relevant facts are right there in front of it, the answer is far more likely to be correct, and you can show the user exactly which source passages it came from.

The RAG pipeline PHASE 1 · INDEXING (once) documents chunk embed vector database PHASE 2 · QUERYING (every question) user question embed it retrieve topmatching chunks question + chunks→ prompt model writesgrounded answer The vector database built in Phase 1 is what Phase 2 searches.
Index your knowledge once; retrieve the relevant slice on every question.

Why grounding changes the answer

The difference in practice is stark. Ask a bare model “what is our refund window?” and it will invent a reasonable-sounding policy, because it has seen thousands of refund policies and will average one for you. Ask a RAG system the same thing and it retrieves the actual paragraph from your actual policy document, then answers from that, often quoting it. One is a confident guess; the other is your real answer with a citation attached. The model did not get smarter. You changed what was in front of it when it spoke.

RAG brings three concrete wins beyond accuracy. It is current: update a document and re-index, and the system knows the new fact immediately, no retraining. It is auditable: every answer can point to its sources, which matters enormously in regulated or high-stakes settings. And it is controllable: you decide what goes in the knowledge base, so the model answers from approved material rather than the whole internet. Those properties are why RAG, not fine-tuning, is the default architecture for “chat with your data” systems across the industry.

It is worth being honest about where RAG struggles, because it is not a cure-all. It shines on questions whose answer lives in one or a few specific passages: “what is the refund window,” “which clause covers liability.” It is much weaker on questions that need the whole corpus at once, like “how many of our 4,000 contracts mention arbitration” or “summarise every complaint from last year.” Retrieval pulls a handful of chunks, not all of them, so aggregate and counting questions quietly slip through its fingers. RAG also cannot save you from a question your documents simply do not answer; it will retrieve the closest thing and the model may still stretch to fill the gap. Knowing this boundary is what separates teams who deploy RAG well from teams who are surprised by it. It is a precision tool for fetching the right passage, not a substitute for a database query or real analysis over your full dataset.

“What is our refund window?” WITHOUT RAG “Our refund window is 30 days from purchase.” plausible, and possibly wrong no source, just an average WITH RAG “Returns are accepted within 14 days.” from your real policy doc source: returns-policy.pdf, p.2
Same model, same question. Only one of these answers you would dare put in front of a customer.
Reality check: RAG is only as good as its retrieval. If the search pulls the wrong chunks, the model will faithfully ground its answer in the wrong material and sound just as confident doing it. Most “RAG is not working” complaints are really retrieval problems, bad chunking, weak embeddings, no reranking, and they get solved in the search layer, not by swapping the model.
▾  Go Deeper (optional, for technical readers)

The quiet make-or-break of a RAG system is chunking. Split documents too coarsely and each chunk carries several topics, so a retrieved passage is mostly noise and the model has to hunt for the relevant line. Split too finely and a chunk loses the context that made it meaningful, a sentence that only makes sense with the paragraph around it. There is no universal answer; teams tune chunk size and add overlap (repeating a little text between adjacent chunks) so an idea straddling a boundary is not cut in half. Structure-aware splitting, breaking on headings and sections rather than blind character counts, usually beats naive slicing.

Retrieval quality has a few more levers. Reranking adds a second, more precise model that re-scores the top candidates from the fast vector search, pushing the genuinely relevant chunks to the top. Hybrid search blends semantic similarity with old-fashioned keyword matching, which rescues cases where an exact term (a product code, a name) matters more than meaning. And the number of chunks you retrieve is a balance: too few and you miss the answer, too many and you bloat the prompt, raise cost, and trigger the “lost in the middle” effect from Part 10. If you want to see all of this built end to end on a specific on-prem stack, with the failure modes spelled out, I walk through a full pipeline in my VMware Private AI RAG pipeline write-up.

This is Part 13 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. RAG leans on the embeddings from Part 7, and answers the hallucination problem from Part 11.

The Bottom Line

RAG answers the question everyone eventually asks, how to make a model speak from your own knowledge, without the cost and fragility of retraining. You index your documents into a searchable map of meaning once, then on every question you fetch the most relevant pieces and let the model answer from them. The payoff is accuracy you can trace to a source, knowledge you can update in minutes, and a knowledge base you fully control.

The reframing I would hand anyone starting out: do not try to pour facts into the model, put the facts beside it. That one idea separates demos that hallucinate from systems people actually trust. RAG also leans entirely on a piece we have so far waved at, the vector database that makes “find the closest chunks” fast enough to do on every request. That is exactly where the next part goes.

Frequently Asked Questions

What does RAG stand for?

RAG stands for retrieval-augmented generation. It fetches relevant passages from your own documents and gives them to the model at question time, so it answers from real sources instead of memory.

Is RAG better than fine-tuning?

For supplying facts, usually yes. RAG is cheaper, easy to keep current, and can cite its sources, while fine-tuning is better for changing a model behaviour or style than for injecting reliable facts.

Does RAG eliminate hallucinations?

No, it sharply reduces them. If retrieval pulls the wrong passage the model can still answer incorrectly, so RAG lowers the risk rather than removing it entirely.

References

Generative AI Series · Part 13 of 30
« Part 12: prompt engineering that works  |  Generative AI Complete Guide  |  Next: Part 14, vector databases »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading