Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NVIDIA Nemotron Foundation Models: Open Weights from Nano to Ultra (NVIDIA AI Series, Part 26)

NVIDIA’s Nemotron family explained: genuinely open weights, data and recipes; the hybrid Mamba-Transformer MoE architecture; Nano, Super and Ultra; and when to self-host open models instead of calling a proprietary API.

NVIDIA AI Series · Part 26 of 30

TL;DR

Nemotron is NVIDIA's family of open-weight models, and open here is the real thing: weights, training data, and the full training recipe are all published. The current generation is Nemotron 3, in three sizes, Nano (31.6B total / 3.6B active), Super (120B / 12B active), and Ultra (550B / 55B active), all built on a hybrid Mamba-Transformer MoE backbone with a 1M-token context and native NVFP4 pretraining for Blackwell. The point of Nemotron is not to beat the frontier closed models on every benchmark. It is to give you a permissively licensed model you can run on your own GPUs, fine-tune on your data, and route intelligently: Nano for cheap per-step work, Super for planning, a proprietary model reserved for the genuinely hard calls. Pick Nemotron when data control, cost-per-token, and customization matter more than the last few points of frontier accuracy.

Who this is for: AI-infrastructure architects and platform engineers deciding which models to host and standardize on. Prerequisites: the inference stack from Part 16 (NIM) and the customization options from Part 23. This part is about the models themselves, not the serving layer.

Most teams treat "open model" and "open weights" as the same thing. They are not. A lot of so-called open models ship a weights file and nothing else: no training data, no recipe, no way to reproduce or audit what went in. Nemotron is the rare case where NVIDIA publishes the weights, the datasets, and the end-to-end training recipe. That difference is the whole reason Nemotron matters to an infrastructure team, and it is the thread running through this part.

What Nemotron actually is

Nemotron is a family of open models NVIDIA builds and trains itself, aimed squarely at agentic reasoning: code, tool use, multi-step planning, and long-context analysis. The current generation is Nemotron 3, and it comes in three sizes that share an architecture and differ in capacity. They are released with open weights on Hugging Face, packaged as NVIDIA NIM for serving, and accompanied by the pretraining and post-training datasets plus a reproducible recipe. You can pull a checkpoint, inspect the data it was trained on, and rerun the published evaluation yourself.

That openness is licensed for commercial use. Super ships under the NVIDIA Nemotron Open Model License, which is written to let enterprises keep data control and deploy anywhere. Ultra was published with weights, data, and recipes under a permissive open license as well. Confirm the exact license text for the specific checkpoint you ship, because the family spans more than one license and the terms matter for redistribution. [VERIFY the exact license name and version for each checkpoint at deployment time, since Nano, Super and Ultra do not all carry identical license files.]

The Nemotron 3 size ladderSame architecture, three capacity points (MoE: total / active per token)Nano31.6B / 3.6B active1M context~25 GB VRAMcheap per-stepagent workSuper120B / 12B active1M context1-2 GPU classplanning andmulti-agentUltra550B / 55B activelong-context reasoningmulti-GPU / rackhardest openreasoning tasks
Active parameters, not total, drive inference cost. Nano activates 3.6B per token despite holding 31.6B, which is why it runs on a single consumer-class GPU.
ModelTotal / active paramsContextTypical placementBest for
Nemotron 3 Nano31.6B / 3.6B1M tokensSingle GPU, ~25 GB VRAMHigh-volume per-step agent tasks
Nemotron 3 Super120B / 12B1M tokens1-2 GPUsPlanning, multi-agent, hard coding
Nemotron 3 Ultra550B / 55Blong-context [VERIFY]Multi-GPU / rackThe hardest open reasoning work
Nano OmniNano-class, multimodal1M tokensSingle GPU, ~25 GB VRAMVision + audio + text in one model

The architecture that makes it cheap to run

Nemotron 3 is not a plain transformer, and the differences are the reason it serves long context at a price that makes sense for always-on agents. Four design choices do most of the work.

Hybrid Mamba-Transformer backbone

The backbone interleaves Mamba-2 state-space layers with transformer attention layers. Mamba layers process sequences in linear time, which is what makes a 1M-token context practical instead of theoretical, because attention cost grows with the square of sequence length and a pure-attention model at that context would be ruinous. The attention layers are kept at key depths to preserve precise recall, the needle-in-a-haystack lookups that pure state-space models struggle with. You get long-context memory without the quadratic bill.

Latent MoE and multi-token prediction

Super introduces latent MoE: tokens are projected into a compressed low-rank space before the router sends them to experts, so the model can consult roughly 4x as many specialist experts for the same compute as a standard MoE. That buys finer specialization, distinct experts for Python versus SQL, without raising per-token cost. On top of that, multi-token prediction trains shared-weight heads to forecast several future tokens at once, which both sharpens chain-of-thought during training and provides built-in speculative decoding at inference for meaningful wall-clock speedups on structured output like code and tool calls.

Native NVFP4 pretraining

Most quantized models are trained in higher precision and compressed afterward, which loses accuracy. Nemotron 3 runs the majority of its pretraining math in NVFP4, the NVIDIA 4-bit floating-point format built for Blackwell, so the model learns to be accurate inside 4-bit arithmetic from the first gradient step. Super was pretrained this way on 25 trillion tokens drawn from 10 trillion unique curated tokens. The payoff at inference is a smaller memory footprint and a large speedup on B200 versus FP8 on H100, which ties straight back to the precision discussion in Part 4.

One repeating blockMamba carries the sequence, attention preserves recall, MoE adds capacityMamba-2sequenceLatent MoEcapacityMamba-2sequenceAttentionrecallMamba-2sequenceLatent MoEcapacityThis block repeats through the depth of the network; only one attention layer per block keeps the quadratic cost contained.
Most layers are Mamba-2 (linear-time). Attention appears sparingly. That ratio is the lever behind cheap 1M-token context. Exact layer counts vary by model size; verify against the model card.
In practice: the 1M-token context is the feature that actually changes agent design. Multi-agent systems re-send history, tool outputs, and reasoning at every turn, and that context explosion is where agents drift off-task. A model that holds the whole working set in context without the attention bill lets you keep the agent coherent over long runs instead of constantly summarizing and truncating.

Open versus proprietary: what you are really choosing

The honest framing is not Nemotron versus a frontier closed model on a benchmark leaderboard. It is about who controls the model and where it runs. An open-weight Nemotron lives on your GPUs, inside your network, under a license that permits commercial use and customization. A proprietary frontier model lives behind someone else's API, usually scores a few points higher on the hardest reasoning evals, and sends your prompts off-premises. Those are different products solving different constraints.

Where open wins: regulated or air-gapped environments, high-volume workloads where per-token API cost dominates, and anywhere you need to fine-tune on proprietary data without shipping it out. Where it does not: a low-volume application that needs the absolute best reasoning on novel problems and has no data-residency constraint, where paying a frontier API per call is cheaper than standing up and operating GPUs. Be honest about which situation you are in before you commit a cluster.

DimensionOpen Nemotron (self-hosted)Proprietary frontier API
Data controlStays in your networkLeaves your perimeter
Cost modelFixed GPU cost, cheap at volumePer-token, cheap at low volume
CustomizationFull LoRA / SFT / RL on your dataLimited or none
Peak reasoningStrong, class-leading among openUsually a few points higher
Operational burdenYou run the GPUs and the stackVendor runs it

The Super-plus-Nano routing pattern

The most useful thing NVIDIA published alongside the models is a deployment pattern, not a benchmark. In a real agent system you do not send every sub-task to your biggest model. You route by difficulty. Nano handles the high-volume, targeted steps: a simple merge request, a routine tool call, a classification. Super takes the planning and the hard reasoning: understanding a whole codebase, sequencing a multi-tool workflow. The genuinely expert-level calls, the ones where the last few points of accuracy pay for themselves, can still go to a proprietary model. This is a cost architecture as much as a quality one.

Route by difficulty, not by defaultMost tokens go to the cheapest model that can do the jobAgent routerscores difficultyNanosimple steps, high volumeSuperplanning, hard reasoningProprietaryexpert-level onlyVolume of callsfalls sharply asdifficulty rises
The cost win comes from the shape of the traffic: the overwhelming majority of agent steps are easy and go to Nano, while only a thin tail reaches Super or a proprietary model.

Worked example

Take an internal coding agent doing 2 million steps a month, where roughly 90 percent are routine (lint fixes, small diffs, lookups) and 10 percent need real planning. Route the 1.8M routine steps to Nano on a single GPU and the 200K hard steps to Super on a two-GPU node. Nano's 3.6B active parameters keep its per-step cost a small fraction of a dense 120B, so the bulk of the volume runs cheap.

Compare that to sending all 2M steps to Super, or worse, to a proprietary API at per-token rates: the routine 90 percent is where the money leaks. The routing pattern is not a micro-optimization, it is the difference between an agent platform that pencils out and one that does not. Size the split from your own traffic before you pick GPUs.

Serving a Nemotron model

Operationally, Nemotron models ship as NVIDIA NIM, so serving one is the same workflow from Part 16: run the container, hit the OpenAI-compatible endpoint. The artifact below starts a Super NIM and sends a chat request.

docker run --rm --gpus all --shm-size=16g \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:latest

# then call the OpenAI-compatible endpoint
curl -s http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/nemotron-3-super-120b-a12b","messages":[{"role":"user","content":"Plan the steps to refactor this module."}],"max_tokens":512}'

# expected: JSON with choices[0].message.content holding the plan

The failure mode to expect is not a crash, it is a refusal to start. A wrong or unentitled NGC_API_KEY, or a GPU that cannot hold the model at the chosen precision, leaves the container exiting during model load. Check the entitlement and the GPU memory first. The exact image tag changes with releases, so confirm the current tag on build.nvidia.com before you pin it. [VERIFY the exact NIM image tag and any required profile for your GPU before deploying.]

Gotcha: open weights and open data do not mean no license obligations. The Nemotron licenses permit broad commercial use and customization, but they still carry terms, and the family spans more than one license across Nano, Super and Ultra. Read the actual license file for the checkpoint you ship, especially if you plan to redistribute a fine-tuned derivative.

Beyond the core LLMs

Nemotron is more than three text models. The family extends into the supporting models an agent platform actually needs, and they carry the same open, self-hostable posture. Nemotron Vision handles visual understanding for document and screen tasks. Nemotron RAG ships embedding and reranking models that feed retrieval pipelines. Guardrail models provide content and safety filtering you can run in-line. There is even a Nemotron Speech line for ASR. The strategic point for an architect is consistency: if you standardize on the Nemotron family, the embedding model, the reranker, the safety filter, and the reasoning model all live under licenses you can deploy on your own GPUs, with no part of the chain quietly depending on an external API.

That matters most for retrieval-augmented generation, where a single user query touches an embedding model, a vector search, a reranker, a guardrail check, and finally the LLM. If four of those five steps are open and self-hosted but one is a closed API, you have not actually kept your data on-premises. The Nemotron RAG and Guardrail models exist to close that gap, and they are the direct input to the next part. Treat model selection as a decision about the whole pipeline, not just the chat model at the end of it.

What I would actually choose

My recommendation: standardize your agent platform on Nemotron Nano and Super as the default workhorses, and keep a proprietary frontier model on a short leash for the hardest tail. Why: you get data control, a fixed and favorable cost curve at volume, and full freedom to fine-tune, which is exactly what an enterprise platform needs. When it is not the right call: a low-volume product with no data-residency constraint that needs best-in-class reasoning on novel problems, where operating GPUs costs more than paying an API. What to validate first: run your own evaluation on your own tasks, because published benchmarks rarely match your workload, and confirm the license terms for the specific checkpoint you intend to ship.

The trap I see most often is treating model selection as a single global choice. It is not. The right answer is usually a routed mix, and Nemotron exists precisely to be the open, customizable, self-hosted majority of that mix.

The Verdict

Nemotron is the model family that lets an infrastructure team own its stack instead of renting it. The hybrid Mamba-Transformer MoE design makes long-context agents affordable, the genuinely open release (weights, data, recipe) makes the models auditable and customizable, and the Nano-Super-Ultra ladder maps cleanly onto a route-by-difficulty cost architecture. It will not always top a closed frontier model on the hardest reasoning eval, and it should not have to: reserve the proprietary model for the thin tail and run the rest on Nemotron. If you are standing up an agent platform this quarter, benchmark Nano and Super on your own tasks before you sign an API contract sized for all of your traffic.

Next we wire these models into retrieval: NeMo Retriever, embeddings, reranking, and Guardrails for grounded, safe answers. Bring the model choice you made here into that pipeline.

NVIDIA AI Series · Part 26 of 30
« Previous: Part 25  |  NVIDIA AI Guide  |  Next: Part 27 »

References

NVIDIA Nemotron foundation models overview
Introducing Nemotron 3 Super: hybrid Mamba-Transformer MoE (NVIDIA Technical Blog)
NVIDIA Nemotron developer asset hub (GitHub)

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading