Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

What the NVIDIA AI Stack Actually Is, End to End (NVIDIA AI Series, Part 1)

NVIDIA AI is not one product, it is a stack roughly nine layers deep from silicon to agents. Part 1 maps the whole thing: GPUs, CUDA, the operators, TensorRT-LLM, Triton, Dynamo, NIM, NeMo, Nemotron, Blueprints and the AI Enterprise wrapper that supports it all.

NVIDIA AI Series · Part 1 of 30
TL;DR · Key Takeaways
  • NVIDIA AI is not one product. It is a stack roughly nine layers deep, from the GPU silicon up to agent workflows, and the pieces only pay off when they fit together.
  • The layers, bottom up: silicon, system software (driver/CUDA), cluster orchestration (GPU Operator, Run:ai), optimize-and-serve (TensorRT-LLM, Triton, Dynamo), packaging (NIM), build (NeMo), models (Nemotron), and workflows (Blueprints, AI-Q).
  • NVIDIA AI Enterprise is the commercial wrapper: it is what makes the open-source pieces supportable, patched and secure, not a separate product layer.
  • Everything ends in a standard API. A NIM exposes an OpenAI-compatible endpoint, which is why the stack drops into existing apps without rewrites.
  • My take: learn the stack as layers, adopt the one you need, and resist buying all of it on day one. This series walks each layer; Part 1 is the map.
Who this is for: AI-infrastructure architects, platform engineers and technical leads who have to design, buy, or operate the NVIDIA AI stack on-prem or on a private cloud.
Prerequisites: general familiarity with GPUs and containers. No prior NVIDIA-specific knowledge assumed; this is the orientation post for the series.

NVIDIA AI is not a product. It is a stack, roughly nine layers deep, from the silicon up to the agent framework, and most teams meet it one piece at a time: a GPU here, a NIM container there, a NeMo fine-tune later. That piecemeal view is how you end up with tools that do not fit together and a support contract that covers half of them. The whole point of understanding the stack is knowing which layer a problem belongs to before you reach for a tool. This series maps the whole thing, layer by layer. Part 1 is the map.

The stack in one view

Read the stack bottom to top and each layer exists to make the one above it usable. Silicon is useless without drivers; drivers are useless at scale without orchestration; a trained model is useless in production until something serves it behind an API. The mistake is treating any single layer as the whole platform. A GPU is not an AI platform, and neither is a model.

The NVIDIA AI stack, bottom to top Each layer exists to make the one above it usable NVIDIA AI Enterprise support, security and lifecycle across every layer Agents & workflows — Blueprints, AI-Q, Omniverse Foundation models — Nemotron, open and partner models Build & customize — NeMo (Curator, Customizer, Guardrails, Retriever) Package — NIM inference microservices (standard API) Optimize & serve — TensorRT-LLM, Triton, Dynamo Cluster & orchestration — GPU Operator, Run:ai, Kubernetes System software — GPU driver, CUDA, cuDNN, Container Toolkit Silicon — Hopper, Blackwell (B200/B300), Rubin GPUs NVLink / NVSwitch, the fabric that ties GPUs together
Nine layers, plus the AI Enterprise wrapper that spans all of them. You rarely adopt all at once; you adopt the layer your problem lives in.

Layer by layer, and what each one is for

Here is the same stack as a reference, with the job each layer does. The rest of this series takes one or two of these per part and goes deep.

LayerKey componentsWhat it does
SiliconHopper, Blackwell, Rubin; NVLink/NVSwitchCompute and the scale-up interconnect
System softwareDriver, CUDA, cuDNN, Container ToolkitMakes the GPU usable to software and containers
OrchestrationGPU Operator, Network Operator, Run:ai, KubernetesProvisions and schedules GPUs at scale
Optimize & serveTensorRT, TensorRT-LLM, Triton, DynamoCompiles and serves models efficiently
PackageNIM inference microservicesA model behind a standard API, in a container
Build & customizeNeMo: Curator, Customizer, Guardrails, RetrieverCurate data, fine-tune, guard, and retrieve
ModelsNemotron family, open and partner modelsThe starting weights you build on
Workflows & agentsBlueprints, AI-Q, OmniverseReference apps and agentic workflows
Commercial wrapperNVIDIA AI EnterpriseSupport, security and lifecycle across all layers

The bottom three layers are plumbing

Silicon, system software and orchestration are where infrastructure teams live. The GPU choice (Hopper for availability, Blackwell for density, Rubin on the horizon) sets your ceiling; the driver and CUDA stack set compatibility; the GPU Operator and Run:ai decide whether GPUs are shared fairly or hoarded by the first team to grab them. Most production pain that looks like an AI problem is actually a plumbing problem in these three layers: a driver mismatch, an unscheduled GPU, a saturated fabric.

The middle layers are where models become services

Optimize-and-serve and packaging are the layers people underestimate. A model checkpoint is not a service. TensorRT-LLM compiles it for the target GPU, Triton or Dynamo serves it, and a NIM wraps the whole thing in a container with a standard API and sensible defaults. The difference between a research demo and something on call at 2am is almost entirely in these layers.

How the pieces actually connect

The layers are not just a stack, they are a pipeline. You start from a model, optionally customize it with NeMo, compile and serve it, package it as a NIM, and consume it from an app or agent. The payoff of the whole chain is the last step: a standard endpoint your application already knows how to call.

From model to a callable endpoint The chain ends in a standard API your app already speaks Model Nemotron / open Customize NeMo (optional) Optimize/serve TensorRT-LLM Package NIM container App / agent
You can enter this chain at any box. Many teams skip customize entirely and just serve a model through a NIM.

That last box matters more than it looks. A NIM exposes an OpenAI-compatible API, so the application calling it does not need to know anything about GPUs, TensorRT or the model format underneath.

# A NIM is the whole stack behind one standard endpoint
curl -s http://nim.local:8000/v1/chat/completions 
  -H "Authorization: Bearer $NGC_API_KEY" 
  -H "Content-Type: application/json" 
  -d '{"model":"nvidia/llama-3.1-nemotron-70b-instruct","messages":[{"role":"user","content":"hello"}]}'

Expected result: a standard chat-completions JSON response, identical in shape to what your app would get from a public LLM API, served entirely on your own GPUs. That compatibility is the quiet reason the NVIDIA stack adopts well: the top of the stack looks like the API your developers already use.

NVIDIA AI Enterprise: the wrapper, not a layer

Most of the stack is available as open source. NVIDIA AI Enterprise is the commercial subscription that wraps it with enterprise support, security patching, validated versions and a defined lifecycle, plus orchestration (Run:ai) and the packaged NIM and NeMo microservices. Think of it as the difference between pulling a community container and running a vendor-supported, CVE-patched build with someone to call. Part 2 of this series digs into exactly what the subscription includes and when it is worth it; for now, place it correctly in your mental model: it spans every layer rather than sitting on top as one more box.

In practice: the first question I ask a team adopting NVIDIA AI is not which GPU, it is which layer they actually need. A team that just wants to self-host a chatbot needs the serving and packaging layers and a model. They do not need NeMo training, and buying it on day one is money and complexity spent on a layer they will not touch for a year.

Where to enter the stack

You do not start at the bottom and climb. You start from what you are trying to do and pull in only the layers that goal touches. Four common entry points cover most teams.

Start from the goal, not the silicon Each goal pulls in a different primary tool What do you need? Run a model NIM Customize a model NeMo Scale inference Dynamo + TensorRT-LLM Build an agent Blueprints / AI-Q
The same GPUs and system software sit under all four. The goal decides which upper layer you actually adopt.
Worked example
A team wants an internal support assistant on their own GPUs. Map it to layers: they need silicon (a couple of L40S or H100-class GPUs), system software and the GPU Operator, a NIM serving a Nemotron or open model, and a retrieval layer (NeMo Retriever) over their docs. They touch five of the nine layers. They do not need NeMo training, Omniverse, or a multi-node fabric. Scoping it this way turns a vague AI project into a concrete, costable shortlist, and keeps them from buying an NVL72 rack to run a chatbot.

How this series is organized

The 30 parts follow the stack: foundations and the GPU lineup, then GPU infrastructure (partitioning, NVLink, InfiniBand vs Spectrum-X, storage, power), the software platform (drivers, operators, NGC), inference (NIM, TensorRT-LLM, Triton, Dynamo, economics), customization and training (NeMo), models and agents (Nemotron, RAG, Blueprints), and finally operations and the verdict. Where a topic meets VMware, this series links to the VMware Private AI series rather than repeating it: this series is the NVIDIA stack itself; that one is how to run it on VCF.

Gotcha
The names move fast and overlap. NeMo is both a training framework and a set of microservices; NIM packages models but some NIMs also package NeMo microservices; Blueprints bundle several of these together. Do not try to memorize the marketing taxonomy. Anchor on the job each tool does in the pipeline (build, optimize, serve, package, orchestrate) and the names stop being confusing.

Two journeys: run a model, or build one

Almost every team on this stack is on one of two journeys, and confusing them is the most common planning error I see. The run journey is short: take an existing model, serve it through a NIM, and point an app or agent at the endpoint. It touches the bottom layers plus packaging, it is mostly an infrastructure and operations exercise, and most enterprises should start here because it delivers a working capability in weeks without a data-science team.

The build journey is longer and pulls in the NeMo layer: curate data, fine-tune or align a model, evaluate it, then serve it. It is a data and ML engineering exercise as much as an infrastructure one, and it pays off only when an off-the-shelf model genuinely cannot do the job, which is rarer than most teams assume. The honest test before committing to the build path: have you proven that a well-prompted, retrieval-grounded stock model fails the task? If you have not, you are on the run journey, and that is good news for your timeline and budget. Knowing which journey you are on tells you which half of this series to read first.

The Bottom Line

Treat NVIDIA AI as a stack, learn it as layers, and adopt only the layers your goal touches. The reason is cost and focus: the stack is deep and expensive, and the teams that struggle are the ones that bought the whole thing before they knew which layer their problem lived in. My recommendation for anyone starting: pin down the goal, map it to layers using the table above, stand up the bottom three (silicon, system software, orchestration) properly because everything rests on them, and enter the upper stack at exactly one point. When would I go broader on day one? Only for a dedicated AI platform team chartered to serve many use cases, where breadth is the job. For everyone else, narrow and deep beats wide and shallow. Next up, Part 2: what NVIDIA AI Enterprise actually includes and whether the subscription is worth it. Which layer does your current project actually live in?

NVIDIA AI Series · Part 1 of 30
NVIDIA AI Guide  |  Next: Part 2 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading