TL;DR · Key Takeaways
- On VMware Private AI Foundation, roughly four out of five bad RAG answers trace to ingestion, chunking and retrieval, not to the LLM. Stop tuning the prompt first.
- The single highest-ROI fix is adding a NeMo Retriever reranking NIM. It reorders the top-k chunks so the best passage is not stranded at position 4.
- Embedding asymmetry (query vs passage
input_type) and dimension mismatches against pgvector are the two silent killers that pass every smoke test. - Build an eval set before you scale ingestion. Without retrieval metrics you cannot tell a chunking failure from a generation failure.
A RAG demo that wows the steering committee and a RAG system that survives real users are two different animals. The demo runs against ten clean PDFs. Production runs against a decade of messy SharePoint exports, scanned contracts, and tables that mean nothing once they are flattened into prose. When answers go wrong, almost everyone reaches for the model first: swap it, tune the prompt, raise the temperature. That is the wrong end of the pipeline. On VMware Private AI Foundation with NVIDIA, the model is rarely the problem. Retrieval is.
Here are the seven failures I see most often when a PAIF RAG pipeline ships, mapped to the stage where each one actually lives, with the fix that moves the needle. Most of them never throw an error. They just quietly hand the LLM the wrong context and let it sound confident about it.
1. Chunking is the bug you blame on the model
If you fix one thing, fix this. The bulk of RAG failures originate in ingestion and chunking, not generation. Fixed-size 512-token chunks with no overlap cut sentences in half, strip tables of their headers, and split a procedure across two chunks so neither one is retrievable on its own. The retriever then returns a fragment that is topically close but semantically useless, and the LLM does its best with garbage. Hierarchical or layout-aware chunking that respects document structure routinely lifts answer accuracy from the low 60s into the high 80s on the same corpus. Use NeMo Retriever ingestion (nv-ingest) for PDF and table extraction rather than a naive text splitter, and keep tables intact as their own units with the surrounding heading attached.
Validate before you scale: take twenty real questions, retrieve, and read the raw chunks that come back. If a human cannot answer the question from those chunks, no model will.
2. No reranker, so the best chunk sits at position 4
Raw vector search sorts by embedding similarity, which is a blunt proxy for relevance. The chunk that actually answers the question often lands at rank 4 or 5, behind noisier passages that happen to sit closer in vector space. A reranker reads the query and each candidate chunk together and rescores them with a deeper cross-encoder, pushing the genuinely relevant passage to the top. On PAIF, deploy a NeMo Retriever reranking NIM (for example nvidia/llama-3.2-nv-rerankqa-1b-v2) and retrieve wide, then rerank narrow: pull the top 20 from pgvector, rerank, and pass the top 5 to the model.
Why I recommend it: it is the highest-ROI change in the whole pipeline and it touches nothing upstream. When I would not: a tiny corpus (a few hundred chunks) where retrieval is already near-perfect, or a latency budget so tight that the extra GPU hop is unacceptable. Validate first: the reranker is another model to serve, so confirm you have the vGPU headroom before you wire it in, or it becomes failure #6.
3. Embedding mismatch: the silent corruption
This one passes every demo and rots in production. NeMo Retriever embedding models are asymmetric: passages are embedded with input_type: passage and queries with input_type: query. Embed both sides the same way and similarity scores drop just enough that retrieval looks plausible but lands on the wrong chunks. The second trap is re-embedding your corpus with a different model than the one serving live queries, or changing the model under an existing index. A dimension mismatch at least fails loudly against pgvector. A same-dimension model swap fails silently and you will chase it for days. Pin the embedding model and version, store it in the collection metadata, and reindex the entire corpus whenever it changes.
# Queries and passages must use different input_type
curl -s http://embed-nim:8000/v1/embeddings -d '{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
"input": ["how do I rotate the vGPU license token"],
"input_type": "query"
}'
# pgvector: a dimension mismatch fails loudly (good)
# ERROR: expected 2048 dimensions, not 768
# A same-dimension model swap under the index fails SILENTLY (bad)
# retrieval still returns rows, they are just wrong
4. pgvector recall: the index defaults betray you
A pgvector instance with no vector index does a sequential scan and returns correct results, so it passes testing on 5,000 rows and falls over at five million. When teams finally add an index they often accept defaults and never tune the search beam. With HNSW, ef_search controls the recall-versus-latency trade: leave it low and the index quietly skips the chunk you needed. Build the HNSW index explicitly, size the Postgres instance on Data Services Manager for the working set plus index, and raise ef_search until recall plateaus. I cover the storage and sizing side of this in the dedicated pgvector part of the series.
-- Build an HNSW index, then widen the search beam for recall
CREATE INDEX ON kb_chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
SET hnsw.ef_search = 100; -- raise recall at the cost of latency
-- Confirm the planner actually uses the index, not a seq scan
EXPLAIN ANALYZE
SELECT id FROM kb_chunks ORDER BY embedding <=> :q LIMIT 20;
5. Context stuffing and lost-in-the-middle
A large context window is not an excuse to dump everything into it. Models recall information at the start and end of the context far better than the middle, so burying the key passage at position 3 of 8 hurts even when the window is nowhere near full. More retrieved text is not more signal, it is more noise the model has to ignore. Keep assembled context tight, usually well under 8K tokens, and let the reranker decide what earns a slot rather than padding to top-k. Shorter, precise context beats a wall of marginally relevant chunks almost every time. This is the one failure the LLM choice can partly mask, and even then only partly.
6. The embedding NIM is your real throughput ceiling
Everyone sizes GPUs for the chat model. Then the first full corpus load takes a weekend because the embedding NIM is single-instance and ingestion is serialized behind it. Embedding is throughput-bound and bursty: heavy during ingestion and reindex, light at query time. Give the embedding and reranking NIMs their own GPU allocation so a reindex does not starve live inference, and batch ingestion through nv-ingest rather than firing one document at a time. If reindexing the corpus is a multi-day event, you have undersized this layer, and that pain compounds every time the embedding model changes (see failure #3).
7. No eval harness, so you are flying blind
Without retrieval metrics, every bad answer is a guess. Was the right chunk retrieved and the model ignored it, or was it never retrieved at all? You cannot tell by reading outputs. Build a golden set of question-and-answer pairs with the known-correct source chunk, and measure retrieval (recall@k, whether the right chunk made the context) separately from generation (faithfulness, whether the answer is grounded in that context). With those two numbers you can localize any regression in minutes instead of swapping models on a hunch. This is unglamorous and it is the difference between a RAG system you can operate and one you can only pray to.
The failure-to-fix matrix
| Symptom | Likely cause | Fix |
|---|---|---|
| Confident but wrong answers | Fragmented chunks, tables flattened | Layout-aware chunking via nv-ingest |
| Right doc exists, never surfaces | Best chunk stranded below top-k | Add a NeMo Retriever reranker |
| Plausible but subtly off results | Query/passage input_type mismatch | Set input_type correctly, reindex |
| Fine at test scale, bad in prod | No index or ef_search too low | Build HNSW, raise ef_search |
| Answer ignores a chunk that is present | Lost-in-the-middle, context too long | Trim context under 8K, rerank first |
| Ingestion takes days | Embedding NIM undersized | Dedicate GPU, batch via nv-ingest |
| Cannot explain regressions | No retrieval vs generation metrics | Golden set, measure recall@k + faithfulness |
For the layers underneath this pipeline, see the related parts of this series: pgvector on Data Services Manager for the vector store design, NVIDIA NIM microservices for the model-serving layer, and the Model Store and Model Runtime for how the models get deployed in the first place.
What I’d Do
Build in this order, and resist the urge to start with the model. First, an eval set, even a rough one, so you can measure anything at all. Second, layout-aware ingestion, because no downstream fix survives bad chunks. Third, the reranker, because it is the cheapest large win on the board. Only then tune pgvector recall, context length, and finally the generation model. Nine times out of ten you will have a working system before you ever touch the LLM. The model is the last knob to turn, not the first.
Which of these seven has bitten you hardest in the field? If it is one I have not listed, I want to hear it.
References
- Broadcom TechDocs: Deploy a RAG Workload on a VKS Cluster in Private AI Foundation
- NVIDIA AI Blueprint: Enterprise RAG Pipeline
- NVIDIA NeMo Retriever Text Embedding NIM Overview
- VMware Generative AI Reference Architecture
« Previous: Part 13 | VMware Private AI Complete Guide | Next: Part 15 »








