Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Building a RAG Pipeline on VMware Private AI: 7 Failures That Quietly Break Retrieval (Private AI Series, Part 14)

Most RAG failures on VMware Private AI Foundation are not the LLM. Here are the seven pipeline failures that quietly wreck retrieval quality on PAIF 9, and how I fix each one in the field.

VMware Private AI Series · Part 14 of 24

TL;DR · Key Takeaways

  • On VMware Private AI Foundation, roughly four out of five bad RAG answers trace to ingestion, chunking and retrieval, not to the LLM. Stop tuning the prompt first.
  • The single highest-ROI fix is adding a NeMo Retriever reranking NIM. It reorders the top-k chunks so the best passage is not stranded at position 4.
  • Embedding asymmetry (query vs passage input_type) and dimension mismatches against pgvector are the two silent killers that pass every smoke test.
  • Build an eval set before you scale ingestion. Without retrieval metrics you cannot tell a chunking failure from a generation failure.
Who this is for: Architects and platform teams standing up RAG workloads on PAIF 9.0 / 9.1 with NeMo Retriever, NIM and pgvector.  Prerequisites: a GPU workload domain, model serving in place, and a pgvector instance on Data Services Manager.

A RAG demo that wows the steering committee and a RAG system that survives real users are two different animals. The demo runs against ten clean PDFs. Production runs against a decade of messy SharePoint exports, scanned contracts, and tables that mean nothing once they are flattened into prose. When answers go wrong, almost everyone reaches for the model first: swap it, tune the prompt, raise the temperature. That is the wrong end of the pipeline. On VMware Private AI Foundation with NVIDIA, the model is rarely the problem. Retrieval is.

Here are the seven failures I see most often when a PAIF RAG pipeline ships, mapped to the stage where each one actually lives, with the fix that moves the needle. Most of them never throw an error. They just quietly hand the LLM the wrong context and let it sound confident about it.

Where RAG Breaks on Private AI Most faults live left of the LLM, before a single token is generated Ingest Chunk Embed Indexpgvector Retrieve Rerank LLMNIM #6 throughput #1 chunking #3 embedding mismatch #4 recall / HNSW params #5 context stuffing #2 no reranker #7 no eval THE PATTERN Swapping the LLM fixes failure #5 at best. The other six are upstream, and they decide what the model ever gets to see.
The seven failure points, mapped to the stage where each one originates.

1. Chunking is the bug you blame on the model

If you fix one thing, fix this. The bulk of RAG failures originate in ingestion and chunking, not generation. Fixed-size 512-token chunks with no overlap cut sentences in half, strip tables of their headers, and split a procedure across two chunks so neither one is retrievable on its own. The retriever then returns a fragment that is topically close but semantically useless, and the LLM does its best with garbage. Hierarchical or layout-aware chunking that respects document structure routinely lifts answer accuracy from the low 60s into the high 80s on the same corpus. Use NeMo Retriever ingestion (nv-ingest) for PDF and table extraction rather than a naive text splitter, and keep tables intact as their own units with the surrounding heading attached.

Validate before you scale: take twenty real questions, retrieve, and read the raw chunks that come back. If a human cannot answer the question from those chunks, no model will.

2. No reranker, so the best chunk sits at position 4

Raw vector search sorts by embedding similarity, which is a blunt proxy for relevance. The chunk that actually answers the question often lands at rank 4 or 5, behind noisier passages that happen to sit closer in vector space. A reranker reads the query and each candidate chunk together and rescores them with a deeper cross-encoder, pushing the genuinely relevant passage to the top. On PAIF, deploy a NeMo Retriever reranking NIM (for example nvidia/llama-3.2-nv-rerankqa-1b-v2) and retrieve wide, then rerank narrow: pull the top 20 from pgvector, rerank, and pass the top 5 to the model.

Why I recommend it: it is the highest-ROI change in the whole pipeline and it touches nothing upstream. When I would not: a tiny corpus (a few hundred chunks) where retrieval is already near-perfect, or a latency budget so tight that the extra GPU hop is unacceptable. Validate first: the reranker is another model to serve, so confirm you have the vGPU headroom before you wire it in, or it becomes failure #6.

What a Reranker Actually Does Vector search only (top-k by cosine) 1. close-but-wrong passage 2. close-but-wrong passage 3. partial match 4. THE answer (stranded) After NeMo Retriever rerank 1. THE answer (promoted) 2. partial match 3. close-but-wrong passage 4. close-but-wrong passage Same retrieved set, reordered by a cross-encoder that reads the query and chunk together. The model now sees the right passage first.
Retrieve wide, rerank narrow: the reranker promotes the stranded answer to the top of the context.

3. Embedding mismatch: the silent corruption

This one passes every demo and rots in production. NeMo Retriever embedding models are asymmetric: passages are embedded with input_type: passage and queries with input_type: query. Embed both sides the same way and similarity scores drop just enough that retrieval looks plausible but lands on the wrong chunks. The second trap is re-embedding your corpus with a different model than the one serving live queries, or changing the model under an existing index. A dimension mismatch at least fails loudly against pgvector. A same-dimension model swap fails silently and you will chase it for days. Pin the embedding model and version, store it in the collection metadata, and reindex the entire corpus whenever it changes.

# Queries and passages must use different input_type
curl -s http://embed-nim:8000/v1/embeddings -d '{
  "model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
  "input": ["how do I rotate the vGPU license token"],
  "input_type": "query"
}'

# pgvector: a dimension mismatch fails loudly (good)
# ERROR:  expected 2048 dimensions, not 768

# A same-dimension model swap under the index fails SILENTLY (bad)
# retrieval still returns rows, they are just wrong

4. pgvector recall: the index defaults betray you

A pgvector instance with no vector index does a sequential scan and returns correct results, so it passes testing on 5,000 rows and falls over at five million. When teams finally add an index they often accept defaults and never tune the search beam. With HNSW, ef_search controls the recall-versus-latency trade: leave it low and the index quietly skips the chunk you needed. Build the HNSW index explicitly, size the Postgres instance on Data Services Manager for the working set plus index, and raise ef_search until recall plateaus. I cover the storage and sizing side of this in the dedicated pgvector part of the series.

-- Build an HNSW index, then widen the search beam for recall
CREATE INDEX ON kb_chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

SET hnsw.ef_search = 100;   -- raise recall at the cost of latency

-- Confirm the planner actually uses the index, not a seq scan
EXPLAIN ANALYZE
SELECT id FROM kb_chunks ORDER BY embedding <=> :q LIMIT 20;

5. Context stuffing and lost-in-the-middle

A large context window is not an excuse to dump everything into it. Models recall information at the start and end of the context far better than the middle, so burying the key passage at position 3 of 8 hurts even when the window is nowhere near full. More retrieved text is not more signal, it is more noise the model has to ignore. Keep assembled context tight, usually well under 8K tokens, and let the reranker decide what earns a slot rather than padding to top-k. Shorter, precise context beats a wall of marginally relevant chunks almost every time. This is the one failure the LLM choice can partly mask, and even then only partly.

6. The embedding NIM is your real throughput ceiling

Everyone sizes GPUs for the chat model. Then the first full corpus load takes a weekend because the embedding NIM is single-instance and ingestion is serialized behind it. Embedding is throughput-bound and bursty: heavy during ingestion and reindex, light at query time. Give the embedding and reranking NIMs their own GPU allocation so a reindex does not starve live inference, and batch ingestion through nv-ingest rather than firing one document at a time. If reindexing the corpus is a multi-day event, you have undersized this layer, and that pain compounds every time the embedding model changes (see failure #3).


7. No eval harness, so you are flying blind

Without retrieval metrics, every bad answer is a guess. Was the right chunk retrieved and the model ignored it, or was it never retrieved at all? You cannot tell by reading outputs. Build a golden set of question-and-answer pairs with the known-correct source chunk, and measure retrieval (recall@k, whether the right chunk made the context) separately from generation (faithfulness, whether the answer is grounded in that context). With those two numbers you can localize any regression in minutes instead of swapping models on a hunch. This is unglamorous and it is the difference between a RAG system you can operate and one you can only pray to.

Triage: Retrieval or Generation? Answer is wrong Was the right chunkin the context? NO YES FIX RETRIEVAL chunking, embedding type, HNSW ef_search, add reranker FIX GENERATION trim context, fix prompt, reorder, then consider model
One question localizes most RAG bugs. Answer it with metrics, not vibes.

The failure-to-fix matrix

SymptomLikely causeFix
Confident but wrong answersFragmented chunks, tables flattenedLayout-aware chunking via nv-ingest
Right doc exists, never surfacesBest chunk stranded below top-kAdd a NeMo Retriever reranker
Plausible but subtly off resultsQuery/passage input_type mismatchSet input_type correctly, reindex
Fine at test scale, bad in prodNo index or ef_search too lowBuild HNSW, raise ef_search
Answer ignores a chunk that is presentLost-in-the-middle, context too longTrim context under 8K, rerank first
Ingestion takes daysEmbedding NIM undersizedDedicate GPU, batch via nv-ingest
Cannot explain regressionsNo retrieval vs generation metricsGolden set, measure recall@k + faithfulness

For the layers underneath this pipeline, see the related parts of this series: pgvector on Data Services Manager for the vector store design, NVIDIA NIM microservices for the model-serving layer, and the Model Store and Model Runtime for how the models get deployed in the first place.

Disclaimer: Reindexing and index changes are corpus-wide operations. Validate the embedding model and version against your target BOM, check NeMo Retriever and NIM interoperability, back up the pgvector instance, run prechecks, and test on a copy of the collection before you reindex production.

What I’d Do

Build in this order, and resist the urge to start with the model. First, an eval set, even a rough one, so you can measure anything at all. Second, layout-aware ingestion, because no downstream fix survives bad chunks. Third, the reranker, because it is the cheapest large win on the board. Only then tune pgvector recall, context length, and finally the generation model. Nine times out of ten you will have a working system before you ever touch the LLM. The model is the last knob to turn, not the first.

Which of these seven has bitten you hardest in the field? If it is one I have not listed, I want to hear it.

References

VMware Private AI Series · Part 14 of 30
« Previous: Part 13  |  VMware Private AI Complete Guide  |  Next: Part 15 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading