TL;DR
NVIDIA Blueprints are reference workflows: deployable reference architectures that wire NIM, NeMo Retriever, and the NeMo Agent Toolkit into a working application you can fork instead of building from a blank page. AI-Q is the flagship agentic blueprint, an enterprise research agent that classifies intent, runs shallow tool-calling for quick cited answers or deep multi-step planning for long-form reports, all grounded in your data. Underneath sits the NeMo Agent Toolkit, a framework-agnostic library that wraps LangChain, LlamaIndex, CrewAI, and plain Python agents alike, and adds the two things production agents actually need: a profiler that traces tokens and latency down to each tool call, and observability through OpenTelemetry and tools like Phoenix and Langfuse. Start from a blueprint when one fits your use case; the value is the wiring and the evaluation suite, not the novelty. And instrument agents before you trust them, because an unprofiled agent is a cost and reliability risk you cannot see.
Every vendor will tell you agents are easy now. The demo always works. Then you ship it, and a single user question fans out into nine model calls and four tool invocations, the token bill is ten times what you projected, one of the agents quietly loops, and you have no idea which step caused the latency spike because nothing is instrumented. The hard part of agentic AI is not getting one to run. It is running one you can see into, afford, and trust. NVIDIA's answer comes in two pieces: Blueprints that give you a working starting point, and the NeMo Agent Toolkit that makes the running system observable.
What a Blueprint actually is
A NVIDIA Blueprint is a reference workflow: an open, deployable example that composes the building blocks from earlier in this series into a complete application for a common use case. It is not a product you buy and it is not a black box. It is reference code and a reference architecture, published on build.nvidia.com and GitHub, that you fork and adapt. The point is to skip the blank-page problem. Instead of deciding from scratch how a NIM, a retriever, a vector store, and an agent framework fit together, you start from a version that already works and change what your use case needs.
Most blueprints share the same three ingredients you have already met: performance-tuned NIM microservices for the models, NeMo Retriever microservices for grounding, and the NeMo Agent Toolkit for orchestration. What changes between blueprints is the arrangement and the application logic on top. A RAG blueprint wires them one way; the AI-Q research agent wires them another. Because the parts are the same NIMs you would deploy anyway, a blueprint is a head start, not a lock-in.
AI-Q: the enterprise research agent
AI-Q is the flagship agentic blueprint, an open reference example for an agent that connects to your enterprise data, reasons over it, and returns trusted, cited insights. It is built on LangChain DeepAgents and accelerated by the NeMo Agent Toolkit, and it ships with benchmarks and an evaluation suite so you can measure quality rather than guess at it. The design is worth studying even if you never deploy it verbatim, because it encodes a pattern most production agents need.
At the front is an orchestration node that classifies intent. A quick factual question takes the shallow path: bounded tool-calling that returns a fast answer with source citations. A request for a thorough report takes the deep path: long-running, multi-step planning that decomposes the task, gathers evidence across many retrieval calls, and assembles a long-form, citation-backed document. The split matters because using the heavy deep-research machinery for every query is the agentic version of the thinking tax, slow and expensive for questions that did not need it. Routing by intent is the same cost discipline as the model routing from Part 26, applied one level up.
| Property | Shallow research | Deep research |
|---|---|---|
| Trigger | Quick factual question | Report-style request |
| Steps | Bounded, few tool calls | Many, multi-step planned |
| Latency | Fast | Long-running |
| Output | Short cited answer | Long-form cited report |
| Cost per query | Low | High, use deliberately |
The NeMo Agent Toolkit
The toolkit (previously called AgentIQ) is the open-source library underneath, and its most important design decision is that it is framework-agnostic. It works alongside LangChain, LlamaIndex, CrewAI, Microsoft Semantic Kernel, Google ADK, and plain Python agents, rather than asking you to replatform onto a new agent framework. You wrap what you already have. It also speaks Model Context Protocol both ways, acting as an MCP client to consume remote tools and exposing your own tools over an MCP server. That matters because it means adopting the toolkit for its profiling and observability does not force you to throw away the agent code you have written.
# a workflow is declared in YAML, pointing at NIM endpoints and tools
# workflow.yml
llms:
default:
_type: nim
model_name: nvidia/nemotron-3-super-120b-a12b
workflow:
_type: react_agent
tool_names: [retriever, calculator]
# run it, with the profiler on
nat run --config_file workflow.yml --input 'Summarize Q3 GPU spend' --profile
# expected: the answer, plus a per-step trace of tokens and latency
# failure mode: a tool that never returns leaves the agent looping; the
# profiler shows one node consuming the entire latency budgetThe CLI entry point and config schema have changed across versions, including the rename from AgentIQ, so confirm the current command name and YAML keys against the toolkit docs for your installed release. [VERIFY the CLI command (nat vs aiq) and the workflow YAML schema against the NeMo Agent Toolkit docs for your version.]
Profiling and observability are not optional
This is the part teams discover too late. An agent is a non-deterministic program that calls expensive functions in a loop, and without instrumentation you cannot answer the two questions that decide whether it is shippable: where does the time go, and where does the money go. The toolkit's profiler traces an entire workflow down to the individual tool and agent, recording input and output tokens and timings at each node so you can find the bottleneck and the cost sink. Its observability integrates with OpenTelemetry-based systems and dedicated backends like Phoenix, Langfuse, Weave, and LangSmith, so the traces land in tooling your platform team already runs.
The reason this is mandatory rather than nice-to-have is the token math. A multi-agent system re-sends history, tool outputs, and reasoning at every turn, so a task can generate many times the tokens of a single chat. If you cannot see which step is responsible, you cannot bring the cost down, and you cannot tell a genuine improvement from a regression. Treat tracing as a launch requirement, the same way you would not ship a web service with no metrics.
Worked example
An internal research agent was projected at a manageable monthly cost and came in at roughly five times that. The team assumed the 120B model was the culprit and planned to swap to something smaller. Before doing that, they turned on the profiler.
The trace showed one web-search tool consuming about 70 percent of the tokens, because it re-injected the full retrieved page text into the context on every reasoning step instead of a summary. The fix was a one-line change to summarize tool output before it re-entered context, and it cut cost by more than half with no model change. Without the per-node trace, they would have spent a week swapping models and never touched the actual problem. That is the whole argument for instrumenting agents.
| Question | Start from a Blueprint when | Build from scratch when |
|---|---|---|
| Use case fit | A blueprint matches your pattern | Your workflow is genuinely novel |
| Existing agents | Greenfield, no agent code yet | You have agents (wrap with the toolkit) |
| Time to value | You want a running start today | You have time to design the wiring |
| Evaluation | You want the bundled eval suite | You will build your own |
The wider blueprint catalog
AI-Q is the flagship, but it is one of a growing catalog of NVIDIA Blueprints, and they all follow the same composition principle. There are blueprints for straightforward RAG, for multimodal search and summarization over large video archives, for digital-human customer-service avatars, and for document-heavy enterprise workflows, among others. Each is a different arrangement of the same NIM and NeMo building blocks plus use-case-specific application logic, published as open reference code you can read and fork.
The practical way to use the catalog is as a pattern library, not a product shelf. Even when no blueprint matches your use case exactly, reading the closest one shows you how NVIDIA expects the pieces to fit: which retrieval calls go where, how the agent loop is structured, where guardrails attach, and how evaluation is wired in. That reference value survives even when you end up writing your own application, and it is often worth more than the code you actually reuse. Browse the catalog on build.nvidia.com before you design an agent from first principles, because someone has usually already solved the wiring for something close to your problem, and starting from a known-good arrangement beats rediscovering it.
What I would actually choose
My recommendation: if a blueprint matches your use case, start from it rather than building the wiring yourself, and adopt the NeMo Agent Toolkit for profiling and observability regardless of which agent framework you use. Why: the blueprint saves you the integration work and gives you an evaluation suite, and the toolkit gives you the per-node visibility that production agents cannot run safely without. When it is not the right call: if your workflow is genuinely novel, force-fitting a blueprint costs more than starting clean, though you should still wrap the result in the toolkit. What to validate first: cost and latency per query under realistic traffic with the profiler on, because the demo cost and the production cost are rarely the same number.
The position I will defend: the interesting engineering in agentic AI is no longer making an agent work, it is making one observable, affordable, and reliable. NVIDIA's blueprints and agent toolkit are aimed squarely at that second problem, which is the one that actually decides whether your agent reaches production.
The Verdict
NVIDIA Blueprints give you a working agentic application to fork instead of a blank page, AI-Q shows the intent-routing pattern that keeps multi-step agents affordable, and the NeMo Agent Toolkit wraps whatever framework you already use and adds the profiling and observability that turn an agent from a demo into a system you can operate. The single highest-return action in this whole space is turning the profiler on before you trust the cost projection. If you are about to ship an agent, instrument it first, route by intent, and measure reliability across many runs rather than one.
Next we close the technical arc with operations: GPU observability and honest multi-tenancy with DCGM, the signals that tell you whether your expensive cluster is actually busy. The instrumentation instinct from this part carries straight into it.
References
NVIDIA AI-Q Blueprint (build.nvidia.com)
NVIDIA NeMo Agent Toolkit
AI-Q NVIDIA Blueprint (GitHub)



