Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NVIDIA Blueprints and Agentic AI: AI-Q and the NeMo Agent Toolkit (NVIDIA AI Series, Part 28)

NVIDIA Blueprints, the AI-Q enterprise research agent, and the framework-agnostic NeMo Agent Toolkit: how to build agents you can profile, afford, and trust in production.

NVIDIA AI Series · Part 28 of 30

TL;DR

NVIDIA Blueprints are reference workflows: deployable reference architectures that wire NIM, NeMo Retriever, and the NeMo Agent Toolkit into a working application you can fork instead of building from a blank page. AI-Q is the flagship agentic blueprint, an enterprise research agent that classifies intent, runs shallow tool-calling for quick cited answers or deep multi-step planning for long-form reports, all grounded in your data. Underneath sits the NeMo Agent Toolkit, a framework-agnostic library that wraps LangChain, LlamaIndex, CrewAI, and plain Python agents alike, and adds the two things production agents actually need: a profiler that traces tokens and latency down to each tool call, and observability through OpenTelemetry and tools like Phoenix and Langfuse. Start from a blueprint when one fits your use case; the value is the wiring and the evaluation suite, not the novelty. And instrument agents before you trust them, because an unprofiled agent is a cost and reliability risk you cannot see.

Who this is for: AI-infrastructure architects and platform engineers moving from single-model serving to multi-step agents. Prerequisites: the model layer from Part 26 and the retrieval layer from Part 27. This part is the agent and application layer that sits on top of both.

Every vendor will tell you agents are easy now. The demo always works. Then you ship it, and a single user question fans out into nine model calls and four tool invocations, the token bill is ten times what you projected, one of the agents quietly loops, and you have no idea which step caused the latency spike because nothing is instrumented. The hard part of agentic AI is not getting one to run. It is running one you can see into, afford, and trust. NVIDIA's answer comes in two pieces: Blueprints that give you a working starting point, and the NeMo Agent Toolkit that makes the running system observable.

What a Blueprint actually is

A NVIDIA Blueprint is a reference workflow: an open, deployable example that composes the building blocks from earlier in this series into a complete application for a common use case. It is not a product you buy and it is not a black box. It is reference code and a reference architecture, published on build.nvidia.com and GitHub, that you fork and adapt. The point is to skip the blank-page problem. Instead of deciding from scratch how a NIM, a retriever, a vector store, and an agent framework fit together, you start from a version that already works and change what your use case needs.

Most blueprints share the same three ingredients you have already met: performance-tuned NIM microservices for the models, NeMo Retriever microservices for grounding, and the NeMo Agent Toolkit for orchestration. What changes between blueprints is the arrangement and the application logic on top. A RAG blueprint wires them one way; the AI-Q research agent wires them another. Because the parts are the same NIMs you would deploy anyway, a blueprint is a head start, not a lock-in.

A Blueprint is a compositionSame building blocks, arranged into a working applicationNIM microservicesthe modelsNeMo RetrievergroundingNeMo Agent ToolkitorchestrationBlueprint appfork and adaptopen reference code you own
The blueprint is the wiring and the application logic. The components inside are the same NIM and NeMo services you would run regardless, so adopting one does not lock you into anything new.

AI-Q: the enterprise research agent

AI-Q is the flagship agentic blueprint, an open reference example for an agent that connects to your enterprise data, reasons over it, and returns trusted, cited insights. It is built on LangChain DeepAgents and accelerated by the NeMo Agent Toolkit, and it ships with benchmarks and an evaluation suite so you can measure quality rather than guess at it. The design is worth studying even if you never deploy it verbatim, because it encodes a pattern most production agents need.

At the front is an orchestration node that classifies intent. A quick factual question takes the shallow path: bounded tool-calling that returns a fast answer with source citations. A request for a thorough report takes the deep path: long-running, multi-step planning that decomposes the task, gathers evidence across many retrieval calls, and assembles a long-form, citation-backed document. The split matters because using the heavy deep-research machinery for every query is the agentic version of the thinking tax, slow and expensive for questions that did not need it. Routing by intent is the same cost discipline as the model routing from Part 26, applied one level up.

AI-Q routes by intentCheap path for quick answers, deep path for reportsOrchestrationclassify intentShallow researchbounded tool-calling, citationsDeep researchmulti-step planning,long-form cited reportNeMo Retriever+ NIM modelsground every step
Both paths ground their output in NeMo Retriever and the NIM models. The orchestration node is the cost control: it keeps the expensive deep path off questions that did not need it.
PropertyShallow researchDeep research
TriggerQuick factual questionReport-style request
StepsBounded, few tool callsMany, multi-step planned
LatencyFastLong-running
OutputShort cited answerLong-form cited report
Cost per queryLowHigh, use deliberately

The NeMo Agent Toolkit

The toolkit (previously called AgentIQ) is the open-source library underneath, and its most important design decision is that it is framework-agnostic. It works alongside LangChain, LlamaIndex, CrewAI, Microsoft Semantic Kernel, Google ADK, and plain Python agents, rather than asking you to replatform onto a new agent framework. You wrap what you already have. It also speaks Model Context Protocol both ways, acting as an MCP client to consume remote tools and exposing your own tools over an MCP server. That matters because it means adopting the toolkit for its profiling and observability does not force you to throw away the agent code you have written.

# a workflow is declared in YAML, pointing at NIM endpoints and tools
# workflow.yml
llms:
  default:
    _type: nim
    model_name: nvidia/nemotron-3-super-120b-a12b
workflow:
  _type: react_agent
  tool_names: [retriever, calculator]

# run it, with the profiler on
nat run --config_file workflow.yml --input 'Summarize Q3 GPU spend' --profile
# expected: the answer, plus a per-step trace of tokens and latency
# failure mode: a tool that never returns leaves the agent looping; the
# profiler shows one node consuming the entire latency budget

The CLI entry point and config schema have changed across versions, including the rename from AgentIQ, so confirm the current command name and YAML keys against the toolkit docs for your installed release. [VERIFY the CLI command (nat vs aiq) and the workflow YAML schema against the NeMo Agent Toolkit docs for your version.]

Profiling and observability are not optional

This is the part teams discover too late. An agent is a non-deterministic program that calls expensive functions in a loop, and without instrumentation you cannot answer the two questions that decide whether it is shippable: where does the time go, and where does the money go. The toolkit's profiler traces an entire workflow down to the individual tool and agent, recording input and output tokens and timings at each node so you can find the bottleneck and the cost sink. Its observability integrates with OpenTelemetry-based systems and dedicated backends like Phoenix, Langfuse, Weave, and LangSmith, so the traces land in tooling your platform team already runs.

The reason this is mandatory rather than nice-to-have is the token math. A multi-agent system re-sends history, tool outputs, and reasoning at every turn, so a task can generate many times the tokens of a single chat. If you cannot see which step is responsible, you cannot bring the cost down, and you cannot tell a genuine improvement from a regression. Treat tracing as a launch requirement, the same way you would not ship a web service with no metrics.

What the profiler showsTokens and latency per node, so the cost sink is visibleAgent (root)total 41k tok / 7.2sRetriever tool6k tok / 0.6sWeb-search tool29k tok / 5.1s (hot spot)Summarizer: 6k tok / 1.5sOne node is 70% of thetokens. Fix that, not the model.
Numbers are illustrative. The pattern is the point: the cost is almost never spread evenly, and the profiler tells you which single node to attack first.

Worked example

An internal research agent was projected at a manageable monthly cost and came in at roughly five times that. The team assumed the 120B model was the culprit and planned to swap to something smaller. Before doing that, they turned on the profiler.

The trace showed one web-search tool consuming about 70 percent of the tokens, because it re-injected the full retrieved page text into the context on every reasoning step instead of a summary. The fix was a one-line change to summarize tool output before it re-entered context, and it cut cost by more than half with no model change. Without the per-node trace, they would have spent a week swapping models and never touched the actual problem. That is the whole argument for instrumenting agents.

QuestionStart from a Blueprint whenBuild from scratch when
Use case fitA blueprint matches your patternYour workflow is genuinely novel
Existing agentsGreenfield, no agent code yetYou have agents (wrap with the toolkit)
Time to valueYou want a running start todayYou have time to design the wiring
EvaluationYou want the bundled eval suiteYou will build your own
Gotcha: an agent is non-deterministic, which breaks the testing habits you brought from regular software. The same input can take a different path on two runs, so a single passing test proves little. You need an evaluation set run many times and judged on outcome, plus guardrails on tool use, because an agent with access to real tools can take real, wrong actions. Treat agent reliability as a statistical property you measure, not a behavior you verify once.

The wider blueprint catalog

AI-Q is the flagship, but it is one of a growing catalog of NVIDIA Blueprints, and they all follow the same composition principle. There are blueprints for straightforward RAG, for multimodal search and summarization over large video archives, for digital-human customer-service avatars, and for document-heavy enterprise workflows, among others. Each is a different arrangement of the same NIM and NeMo building blocks plus use-case-specific application logic, published as open reference code you can read and fork.

The practical way to use the catalog is as a pattern library, not a product shelf. Even when no blueprint matches your use case exactly, reading the closest one shows you how NVIDIA expects the pieces to fit: which retrieval calls go where, how the agent loop is structured, where guardrails attach, and how evaluation is wired in. That reference value survives even when you end up writing your own application, and it is often worth more than the code you actually reuse. Browse the catalog on build.nvidia.com before you design an agent from first principles, because someone has usually already solved the wiring for something close to your problem, and starting from a known-good arrangement beats rediscovering it.

What I would actually choose

My recommendation: if a blueprint matches your use case, start from it rather than building the wiring yourself, and adopt the NeMo Agent Toolkit for profiling and observability regardless of which agent framework you use. Why: the blueprint saves you the integration work and gives you an evaluation suite, and the toolkit gives you the per-node visibility that production agents cannot run safely without. When it is not the right call: if your workflow is genuinely novel, force-fitting a blueprint costs more than starting clean, though you should still wrap the result in the toolkit. What to validate first: cost and latency per query under realistic traffic with the profiler on, because the demo cost and the production cost are rarely the same number.

The position I will defend: the interesting engineering in agentic AI is no longer making an agent work, it is making one observable, affordable, and reliable. NVIDIA's blueprints and agent toolkit are aimed squarely at that second problem, which is the one that actually decides whether your agent reaches production.

The Verdict

NVIDIA Blueprints give you a working agentic application to fork instead of a blank page, AI-Q shows the intent-routing pattern that keeps multi-step agents affordable, and the NeMo Agent Toolkit wraps whatever framework you already use and adds the profiling and observability that turn an agent from a demo into a system you can operate. The single highest-return action in this whole space is turning the profiler on before you trust the cost projection. If you are about to ship an agent, instrument it first, route by intent, and measure reliability across many runs rather than one.

Next we close the technical arc with operations: GPU observability and honest multi-tenancy with DCGM, the signals that tell you whether your expensive cluster is actually busy. The instrumentation instinct from this part carries straight into it.

NVIDIA AI Series · Part 28 of 30
« Previous: Part 27  |  NVIDIA AI Guide  |  Next: Part 29 »

References

NVIDIA AI-Q Blueprint (build.nvidia.com)
NVIDIA NeMo Agent Toolkit
AI-Q NVIDIA Blueprint (GitHub)

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading