Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NVIDIA NeMo Retriever: RAG with Embeddings, Reranking and Guardrails (NVIDIA AI Series, Part 27)

How NVIDIA’s NeMo Retriever builds enterprise RAG: extraction, embedding and reranking NIMs, the open Nemotron Retriever models, and NeMo Guardrails, plus the retrieval failures they quietly fix.

NVIDIA AI Series · Part 27 of 30

TL;DR

NVIDIA's RAG stack is NeMo Retriever: a set of NIM microservices plus open Nemotron Retriever models that turn documents into a knowledge layer an LLM can ground on. The pipeline has five jobs, not one: extract content from messy documents (text, tables, charts), embed it, store the vectors, rerank the top hits with a citation-aware cross-encoder, and run a guardrail check on the way in and out. The reranker is the part most teams skip and the part that fixes the most retrieval failures, because a fast embedding search gets you a rough top 50 and the reranker reorders it into a precise top 5. NeMo Guardrails sits around the whole thing to filter unsafe or off-topic queries. Use the full NVIDIA stack when you want every step self-hosted under one license and accuracy that holds up in production. Roll your own only if you already operate an embedding and rerank pipeline you trust.

Who this is for: AI-infrastructure architects and platform engineers building retrieval-augmented generation on NVIDIA GPUs. Prerequisites: the serving model from Part 16 (NIM) and the models from Part 26 (Nemotron). This part is the retrieval layer that feeds the model.

A RAG system rarely fails loudly. It returns an answer, the answer reads fluently, and it is wrong because the retriever handed the model the wrong three chunks. No exception, no error log, just a confident hallucination grounded in irrelevant context. That failure mode is why retrieval quality, not model size, is usually the thing standing between a demo and a system you can put in front of users. NeMo Retriever is NVIDIA's answer to that problem, and it is built around the steps most homegrown pipelines get wrong.

The pipeline NVIDIA actually ships

NeMo Retriever is not a single model. It is an agent-ready stack: the open Nemotron Retriever models, a set of NIM microservices, and the glue to wire them into a retrieval pipeline. The containers ship with TensorRT-LLM-compiled kernels so the embedding and rerank steps run fast on NVIDIA GPUs. A query passes through five stages, and each one is a place where retrieval either holds up or quietly breaks.

Ingestion happens once, ahead of time: extraction NIMs pull text, tables, and chart data out of documents, an embedding NIM turns each chunk into a vector, and the vectors land in a vector database. At query time, the question is embedded, the vector store returns a rough candidate set, a reranking NIM reorders it for precision, and a guardrail check runs on the query and the final answer. Only then does the LLM see anything.

The NeMo Retriever RAG pathIngest once (top), retrieve per query (bottom)Extraction NIMtext/table/chartEmbedding NIMchunk to vectorVector DBstores vectorsQueryembed itVector searchrough top 50Reranking NIMprecise top 5Guardrailssafety checkLLMgrounded answer
The dashed line is the vector store feeding query-time search. Extraction and embedding are one-time ingest cost; search, rerank, and guardrails run on every query.

Embeddings and reranking are two different jobs

The single most common design mistake in RAG is treating retrieval as one step. It is two, and they have opposite trade-offs.

The embedding NIM: fast and approximate

An embedding model maps each chunk and each query into a dense vector, and similarity in that space approximates relevance. This is a bi-encoder: the document and the query are embedded independently, so you can embed your whole corpus once and reuse it. That is what makes it fast enough to search millions of chunks in milliseconds. The NeMo Retriever embedding NIM ships multilingual models covering many languages, so a query in one language can retrieve documents in another. The catch is that independent encoding is approximate: it gets you a good rough set, not a precise ranking.

The reranking NIM: slow and precise

A reranker is a cross-encoder: it looks at the query and a candidate passage together and scores how well that specific passage answers that specific query. That joint attention is far more accurate than vector similarity, but it cannot be precomputed, so you only run it on the top candidates the embedding step already found, not the whole corpus. NeMo Retriever's reranker is citation-aware, which matters when you need the answer to point at the exact source passage. The pattern is retrieve-then-rerank: embed to get a cheap top 50, rerank to get a trustworthy top 5.

Retrieve then rerankCheap and wide, then expensive and narrowMillions of chunkscorpusEmbedding searchtop 50, fastRerankertop 5, precisebi-encodercross-encoderWidth falls by orders of magnitude at each stage; cost per item rises in step.
Skipping the rerank stage is the most common reason a RAG system retrieves plausible-but-wrong context. The embedding step alone is not precise enough.
PropertyEmbedding NIMReranking NIM
Model typeBi-encoderCross-encoder
PrecomputableYes, embed corpus onceNo, per query and passage
Runs overThe whole corpusOnly the candidate set
StrengthSpeed and recallPrecision and ordering
Job in the pipelineFind candidatesChoose the final few
In practice: ingestion quality decides everything downstream. If your extraction step flattens a table into a wall of numbers or drops the chart entirely, no embedding or reranker can recover the meaning. The extraction NIMs for tables and charts exist precisely because real enterprise documents are not clean paragraphs. Budget as much attention for ingestion as for the query path.

Guardrails around the pipeline

NeMo Guardrails is the optional layer that filters or reshapes a query for safety and compliance before it enters retrieval, and checks the response before it reaches the user. It supports input rails (block or rewrite unsafe or off-topic queries, catch jailbreak attempts), topic control (keep an enterprise assistant on its intended subject), and output rails (screen the generated answer). In a RAG context this matters twice: a malicious query can try to pull restricted documents into context, and a generated answer can leak something it should not. Guardrails is policy you can run in-line as its own NIM rather than prompt text you hope the model honors.

Be clear-eyed about what it does and does not do. Guardrails reduces the rate of unsafe or off-topic completions; it does not make a system safe by itself, and an over-aggressive rail will block legitimate queries and frustrate users. Treat it as one defensive layer with measurable false-positive and false-negative rates, not a guarantee.

Guardrails on both endsCheck the query in, check the answer outUser queryInput railsafety / topicRetrieve + LLMthe pipelineOutput railscreen answerUserblocks jailbreaksblocks leaks
Rails run as their own service, not as prompt instructions. That separation is what makes the policy auditable and consistent across models.

Calling the services

Both the embedding and reranking NIMs expose simple HTTP endpoints, so wiring them into any framework is a couple of calls. The artifact below embeds a passage and then reranks a candidate set against a query.

# 1) embed text with the embedding NIM
curl -s http://localhost:8000/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-retriever-embedding","input":["How do I rotate a GPU driver?"],"input_type":"query"}'
# expected: data[0].embedding -> a float vector

# 2) rerank candidates with the reranking NIM
curl -s http://localhost:8001/v1/ranking \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-retriever-reranking","query":{"text":"How do I rotate a GPU driver?"},"passages":[{"text":"Use the GPU Operator driver upgrade flow..."},{"text":"Billing is configured under..."}]}'
# expected: rankings ordered by relevance, billing passage scored low

The failure mode to watch is a mismatched input_type. Embedding models distinguish query text from passage text, and embedding a query as if it were a passage quietly degrades recall without throwing an error. Confirm the model names and the input_type values against the NIM documentation for your version. [VERIFY exact model identifiers, endpoint paths and the input_type field names against the current NeMo Retriever NIM docs before wiring them in.]

Worked example

Take a knowledge base of 2 million chunks and a query where the right passage sits at rank 23 by pure embedding similarity. If you pass only the top 5 embedding hits to the model, that correct passage never reaches it, and the answer is grounded on the five near-misses above it. Recall at 5 fails.

Now retrieve the top 50 by embedding and rerank them. The cross-encoder reads each of the 50 against the query and lifts the true passage from rank 23 to rank 2, so it lands in the top 5 you actually feed the model. You ran the expensive cross-encoder on 50 passages, not 2 million, and turned a wrong answer into a right one for a few milliseconds of extra latency. That is the entire economic case for reranking.

Where RAG quietly breaksThe fix in the NVIDIA stack
Tables and charts lost at ingestUse the table and chart extraction NIMs
Plausible-but-wrong top 5Add the reranking NIM (retrieve-then-rerank)
Query embedded as a passageSet the correct input_type on the embedding call
Cross-language missesUse the multilingual embedding model
Unsafe or off-topic queriesAdd NeMo Guardrails input and output rails
Gotcha: chunking is the silent killer. If you split documents so a fact and the context that qualifies it land in separate chunks, retrieval can return one without the other and the model answers confidently from half the picture. No NIM fixes a bad chunking strategy. Decide chunk size and overlap against your real documents and measure retrieval quality before you blame the model.

Measure it or you are guessing

The single biggest difference between a RAG system that works and one that embarrasses you in production is whether anyone measured retrieval. You cannot eyeball it. The pipeline returns fluent answers either way, so the only way to know if the right context is reaching the model is to evaluate the retrieval step directly, separate from the generation step.

Two families of metric matter. Retrieval metrics ask whether the correct passage was found and ranked highly: recall at k tells you if the right chunk made the cut, and a rank-aware score like nDCG tells you whether it landed near the top where the model will actually weight it. Grounding metrics ask whether the generated answer is supported by the retrieved context rather than invented, which is where the citation-aware reranker earns its place because it makes that support traceable. Build a small labeled set of real queries with known-correct passages, run it on every change to chunking, embedding model, or rerank depth, and watch the numbers move. A reranker that lifts recall at 5 from 0.7 to 0.92 on your own data is worth far more than a model upgrade, and you will only know it happened if you were measuring. Treat the evaluation set as production infrastructure, not a one-time exercise, because your documents and your queries drift.

What I would actually choose

My recommendation: use the full NeMo Retriever stack, embedding plus reranking plus extraction, and add Guardrails, rather than assembling open-source pieces yourself. Why: the components are tuned to work together, ship GPU-optimized as NIMs, and keep the whole retrieval path under one self-hostable license alongside the Nemotron models from the last part, so nothing in the chain depends on an external API. When it is not the right call: if you already operate a mature embedding and rerank pipeline that you trust and have evaluated, swapping it for parity is not worth the migration. What to validate first: retrieval quality on your own documents and queries, measured with recall and answer-grounding metrics, because a RAG system that is not measured is a RAG system that is quietly wrong. For the VMware-hosted version of this exact pipeline, see the Private AI RAG pipeline walkthrough.

The Verdict

RAG is not a model problem, it is a retrieval problem, and NeMo Retriever is built around the steps that actually decide whether retrieval works: clean extraction, fast embedding, precise reranking, and a guardrail on each end. The reranker is the highest-impact component and the one most homegrown pipelines omit. Run the full stack, measure recall and grounding on your own corpus, and treat Guardrails as one defensive layer rather than a safety guarantee. If you are debugging a RAG system that gives confident wrong answers, add the reranker and fix your chunking before you reach for a bigger model.

Next we move up the stack to agents: NVIDIA Blueprints and AI-Q, where the retrieval pipeline you just built becomes a tool an autonomous agent calls. Bring your measured RAG metrics into that work, because an agent is only as reliable as the retrieval underneath it.

My take: if I am handed a struggling RAG system, I look at three things in order before touching the model. First, ingestion: are tables and charts actually being captured, or silently dropped. Second, the rerank stage: is there one at all, and is the candidate depth high enough that the right passage can be recovered. Third, chunking: do facts and their qualifying context survive in the same chunk. In my experience the model is almost never the problem. The retrieval path is, and it is fixable with the components in this part once you are measuring the right numbers.
NVIDIA AI Series · Part 27 of 30
« Previous: Part 26  |  NVIDIA AI Guide  |  Next: Part 28 »

References

NVIDIA NeMo Retriever (developer overview)
NeMo Retriever Text Reranking NIM documentation
NVIDIA RAG Blueprint (GitHub)

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading