TL;DR
NVIDIA's RAG stack is NeMo Retriever: a set of NIM microservices plus open Nemotron Retriever models that turn documents into a knowledge layer an LLM can ground on. The pipeline has five jobs, not one: extract content from messy documents (text, tables, charts), embed it, store the vectors, rerank the top hits with a citation-aware cross-encoder, and run a guardrail check on the way in and out. The reranker is the part most teams skip and the part that fixes the most retrieval failures, because a fast embedding search gets you a rough top 50 and the reranker reorders it into a precise top 5. NeMo Guardrails sits around the whole thing to filter unsafe or off-topic queries. Use the full NVIDIA stack when you want every step self-hosted under one license and accuracy that holds up in production. Roll your own only if you already operate an embedding and rerank pipeline you trust.
A RAG system rarely fails loudly. It returns an answer, the answer reads fluently, and it is wrong because the retriever handed the model the wrong three chunks. No exception, no error log, just a confident hallucination grounded in irrelevant context. That failure mode is why retrieval quality, not model size, is usually the thing standing between a demo and a system you can put in front of users. NeMo Retriever is NVIDIA's answer to that problem, and it is built around the steps most homegrown pipelines get wrong.
The pipeline NVIDIA actually ships
NeMo Retriever is not a single model. It is an agent-ready stack: the open Nemotron Retriever models, a set of NIM microservices, and the glue to wire them into a retrieval pipeline. The containers ship with TensorRT-LLM-compiled kernels so the embedding and rerank steps run fast on NVIDIA GPUs. A query passes through five stages, and each one is a place where retrieval either holds up or quietly breaks.
Ingestion happens once, ahead of time: extraction NIMs pull text, tables, and chart data out of documents, an embedding NIM turns each chunk into a vector, and the vectors land in a vector database. At query time, the question is embedded, the vector store returns a rough candidate set, a reranking NIM reorders it for precision, and a guardrail check runs on the query and the final answer. Only then does the LLM see anything.
Embeddings and reranking are two different jobs
The single most common design mistake in RAG is treating retrieval as one step. It is two, and they have opposite trade-offs.
The embedding NIM: fast and approximate
An embedding model maps each chunk and each query into a dense vector, and similarity in that space approximates relevance. This is a bi-encoder: the document and the query are embedded independently, so you can embed your whole corpus once and reuse it. That is what makes it fast enough to search millions of chunks in milliseconds. The NeMo Retriever embedding NIM ships multilingual models covering many languages, so a query in one language can retrieve documents in another. The catch is that independent encoding is approximate: it gets you a good rough set, not a precise ranking.
The reranking NIM: slow and precise
A reranker is a cross-encoder: it looks at the query and a candidate passage together and scores how well that specific passage answers that specific query. That joint attention is far more accurate than vector similarity, but it cannot be precomputed, so you only run it on the top candidates the embedding step already found, not the whole corpus. NeMo Retriever's reranker is citation-aware, which matters when you need the answer to point at the exact source passage. The pattern is retrieve-then-rerank: embed to get a cheap top 50, rerank to get a trustworthy top 5.
| Property | Embedding NIM | Reranking NIM |
|---|---|---|
| Model type | Bi-encoder | Cross-encoder |
| Precomputable | Yes, embed corpus once | No, per query and passage |
| Runs over | The whole corpus | Only the candidate set |
| Strength | Speed and recall | Precision and ordering |
| Job in the pipeline | Find candidates | Choose the final few |
Guardrails around the pipeline
NeMo Guardrails is the optional layer that filters or reshapes a query for safety and compliance before it enters retrieval, and checks the response before it reaches the user. It supports input rails (block or rewrite unsafe or off-topic queries, catch jailbreak attempts), topic control (keep an enterprise assistant on its intended subject), and output rails (screen the generated answer). In a RAG context this matters twice: a malicious query can try to pull restricted documents into context, and a generated answer can leak something it should not. Guardrails is policy you can run in-line as its own NIM rather than prompt text you hope the model honors.
Be clear-eyed about what it does and does not do. Guardrails reduces the rate of unsafe or off-topic completions; it does not make a system safe by itself, and an over-aggressive rail will block legitimate queries and frustrate users. Treat it as one defensive layer with measurable false-positive and false-negative rates, not a guarantee.
Calling the services
Both the embedding and reranking NIMs expose simple HTTP endpoints, so wiring them into any framework is a couple of calls. The artifact below embeds a passage and then reranks a candidate set against a query.
# 1) embed text with the embedding NIM
curl -s http://localhost:8000/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"model":"nvidia/nemotron-retriever-embedding","input":["How do I rotate a GPU driver?"],"input_type":"query"}'
# expected: data[0].embedding -> a float vector
# 2) rerank candidates with the reranking NIM
curl -s http://localhost:8001/v1/ranking \
-H 'Content-Type: application/json' \
-d '{"model":"nvidia/nemotron-retriever-reranking","query":{"text":"How do I rotate a GPU driver?"},"passages":[{"text":"Use the GPU Operator driver upgrade flow..."},{"text":"Billing is configured under..."}]}'
# expected: rankings ordered by relevance, billing passage scored lowThe failure mode to watch is a mismatched input_type. Embedding models distinguish query text from passage text, and embedding a query as if it were a passage quietly degrades recall without throwing an error. Confirm the model names and the input_type values against the NIM documentation for your version. [VERIFY exact model identifiers, endpoint paths and the input_type field names against the current NeMo Retriever NIM docs before wiring them in.]
Worked example
Take a knowledge base of 2 million chunks and a query where the right passage sits at rank 23 by pure embedding similarity. If you pass only the top 5 embedding hits to the model, that correct passage never reaches it, and the answer is grounded on the five near-misses above it. Recall at 5 fails.
Now retrieve the top 50 by embedding and rerank them. The cross-encoder reads each of the 50 against the query and lifts the true passage from rank 23 to rank 2, so it lands in the top 5 you actually feed the model. You ran the expensive cross-encoder on 50 passages, not 2 million, and turned a wrong answer into a right one for a few milliseconds of extra latency. That is the entire economic case for reranking.
| Where RAG quietly breaks | The fix in the NVIDIA stack |
|---|---|
| Tables and charts lost at ingest | Use the table and chart extraction NIMs |
| Plausible-but-wrong top 5 | Add the reranking NIM (retrieve-then-rerank) |
| Query embedded as a passage | Set the correct input_type on the embedding call |
| Cross-language misses | Use the multilingual embedding model |
| Unsafe or off-topic queries | Add NeMo Guardrails input and output rails |
Measure it or you are guessing
The single biggest difference between a RAG system that works and one that embarrasses you in production is whether anyone measured retrieval. You cannot eyeball it. The pipeline returns fluent answers either way, so the only way to know if the right context is reaching the model is to evaluate the retrieval step directly, separate from the generation step.
Two families of metric matter. Retrieval metrics ask whether the correct passage was found and ranked highly: recall at k tells you if the right chunk made the cut, and a rank-aware score like nDCG tells you whether it landed near the top where the model will actually weight it. Grounding metrics ask whether the generated answer is supported by the retrieved context rather than invented, which is where the citation-aware reranker earns its place because it makes that support traceable. Build a small labeled set of real queries with known-correct passages, run it on every change to chunking, embedding model, or rerank depth, and watch the numbers move. A reranker that lifts recall at 5 from 0.7 to 0.92 on your own data is worth far more than a model upgrade, and you will only know it happened if you were measuring. Treat the evaluation set as production infrastructure, not a one-time exercise, because your documents and your queries drift.
What I would actually choose
My recommendation: use the full NeMo Retriever stack, embedding plus reranking plus extraction, and add Guardrails, rather than assembling open-source pieces yourself. Why: the components are tuned to work together, ship GPU-optimized as NIMs, and keep the whole retrieval path under one self-hostable license alongside the Nemotron models from the last part, so nothing in the chain depends on an external API. When it is not the right call: if you already operate a mature embedding and rerank pipeline that you trust and have evaluated, swapping it for parity is not worth the migration. What to validate first: retrieval quality on your own documents and queries, measured with recall and answer-grounding metrics, because a RAG system that is not measured is a RAG system that is quietly wrong. For the VMware-hosted version of this exact pipeline, see the Private AI RAG pipeline walkthrough.
The Verdict
RAG is not a model problem, it is a retrieval problem, and NeMo Retriever is built around the steps that actually decide whether retrieval works: clean extraction, fast embedding, precise reranking, and a guardrail on each end. The reranker is the highest-impact component and the one most homegrown pipelines omit. Run the full stack, measure recall and grounding on your own corpus, and treat Guardrails as one defensive layer rather than a safety guarantee. If you are debugging a RAG system that gives confident wrong answers, add the reranker and fix your chunking before you reach for a bigger model.
Next we move up the stack to agents: NVIDIA Blueprints and AI-Q, where the retrieval pipeline you just built becomes a tool an autonomous agent calls. Bring your measured RAG metrics into that work, because an agent is only as reliable as the retrieval underneath it.
References
NVIDIA NeMo Retriever (developer overview)
NeMo Retriever Text Reranking NIM documentation
NVIDIA RAG Blueprint (GitHub)



