Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

The NVIDIA NeMo Framework: Training and Fine-Tuning at Scale (NVIDIA AI Series, Part 22)

What the NVIDIA NeMo framework is: Megatron-Core parallelism, NeMo 2.0 Python recipes and NeMo-Run, Megatron Bridge for Hugging Face interop, and when to fine-tune instead of pretrain.

NVIDIA AI Series · Part 22 of 30

TL;DR

NeMo is NVIDIA's open-source training framework. It sits on top of Megatron-Core and Transformer Engine and gives you the parallelism, recipes, and launch tooling to train or fine-tune large models without rewriting the distributed stack from scratch. The thing you actually configure is parallelism: tensor, pipeline, data, sequence/context, and expert, and the effective data-parallel degree falls out of the other choices. NeMo 2.0 replaced the old YAML configs with Python-native recipes, and NeMo-Run launches the same recipe unchanged on your laptop, on Slurm, or on Kubernetes. Megatron Bridge converts checkpoints to and from Hugging Face, so you can pull an open model, fine-tune it with NeMo's scaling, and convert it back. Reach for NeMo when you are doing real multi-node training or fine-tuning; for a quick LoRA on a single GPU, lighter tools are fine. Almost nobody should be pretraining from scratch.

Who this is for: AI-infrastructure architects and platform engineers who need to run LLM training or fine-tuning jobs on GPU clusters. Prerequisites: comfort with Docker, distributed PyTorch basics, and GPU memory budgeting from Part 4. The next two parts go deep on customization (Part 23) and multi-node training; this part is the framework underneath both.

Here is the question every team asks first: do I need NeMo, or do I just write PyTorch? For a single GPU and a small model, plain PyTorch is fine. The moment your model does not fit on one GPU, you are in the business of splitting weights, activations, and optimizer states across devices and keeping them in sync, and that is a stack most teams should not be hand-rolling. NeMo exists so you configure parallelism instead of implementing it. The cost of getting that wrong is measured in idle GPUs and weeks of debugging, which is exactly the cost NeMo is built to remove.

What NeMo actually is

NeMo is NVIDIA's fully open-source, end-to-end framework for pretraining, post-training, and reinforcement learning of large generative models, covering LLMs, multimodal, and speech. It runs from a single GPU up to thousand-node clusters, and it works with both Hugging Face / PyTorch models and Megatron-native models. The point is not that it does something you could not do yourself; it is that it packages the hard distributed-training engineering into configuration you set rather than code you write and debug.

Three layers do the heavy lifting. Megatron-Core is the parallelism engine, the optimized CUDA implementation of tensor, pipeline, sequence, context, and expert parallelism that has demonstrated superlinear scaling across thousands of H100 GPUs. Transformer Engine provides the FP8 and FP4 mixed-precision kernels for Hopper and Blackwell, and leaving it out forfeits a large share of your throughput. PyTorch Lightning, wrapped by NeMo's trainer with a MegatronStrategy, provides the training-loop abstraction that routes distributed calls down into Megatron-Core. You configure the top; the lower layers do the work.

The NeMo stackYou configure the top; the layers below do the distributed workGPUs + NVLink + InfiniBandCUDA + Transformer Engine (FP8 / FP4)Megatron-Core (parallelism engine)NeMo collections + recipesNeMo-Run executor (local / Slurm / Kubernetes)
Drop Transformer Engine and you lose FP8/FP4 throughput. Drop Megatron-Core and you are reimplementing parallelism. NeMo's value is wiring these together so you only set the recipe.

Parallelism is the thing you configure

Almost every meaningful decision in a NeMo job is a parallelism choice. Get the degrees right and the cluster runs near its theoretical throughput; get them wrong and you either run out of memory or leave most of the GPUs idle. There are five dimensions, and they interact.

Tensor and pipeline parallelism

Tensor parallelism (set by tensor_model_parallel_size) splits the weight matrices inside each layer across GPUs, recombining partial results with an all-reduce at layer boundaries. It is bandwidth-hungry and belongs inside a node where NVLink is available, typically at degree 2, 4, or 8. Pipeline parallelism (pipeline_model_parallel_size) assigns groups of whole layers to different GPUs and streams micro-batches through the stages. It crosses nodes happily because it only passes activations between stages, not sharded weights, but it introduces a pipeline bubble that virtual pipeline scheduling reduces at the cost of more communication.

Sequence, context, and expert parallelism

Sequence parallelism shards the parts of a layer that tensor parallelism leaves replicated, cutting activation memory. Context parallelism (context_parallel_size) splits along the sequence dimension, which is what makes very long context lengths trainable. Expert parallelism (expert_model_parallel_size) distributes the experts of a mixture-of-experts model across GPUs. Data parallelism is the one degree you do not set directly: the effective data-parallel size is world_size / (TP × PP), so it falls out of the other choices, and gradients sync across replicas after each step.

Where each split livesFast links inside a node, slower links across nodes and replicasReplica 1 (pipeline across 2 nodes)Node A: stage 1TP across NVLinkNode B: stage 2TP across NVLinkpipeline activations cross the node linkReplica 2 (identical copy)data parallelism = world_size / (TP x PP)gradients all-reduce between replicasTP stays inside a node; PP spans nodes; DP replicates the whole thing.
Placement matters as much as the degree. Tensor parallelism on a slow link, or pipeline parallelism where you meant tensor, will cripple throughput even when the math fits.
DimensionWhat it splitsLink it needsTypical setting
Tensor (TP)Weights within a layerNVLink, in-node2, 4, or 8
Pipeline (PP)Groups of whole layersCross-node OK2 to 16+
Data (DP)The batch (full replicas)Cross-node OKDerived, not set
Context (CP)The sequence lengthCross-node OK1 to 8 for long context
Expert (EP)MoE expertsCross-node OKMoE models only

NeMo 2.0 and NeMo-Run

NeMo 2.0 made one change that matters more than it sounds: it replaced the old YAML-driven configuration with Python-native recipes built from run.Config and run.Partial objects. A recipe is now ordinary Python you can inspect in an IDE, version in git, and test, instead of a wall of YAML you edit by guesswork. NeMo-Run is the companion launcher. You write the recipe once and hand it to an executor, and the same recipe runs unchanged locally, on Slurm, or on Kubernetes. The executor is a parameter, not a rewrite, which is what makes the path from a laptop experiment to a thousand-GPU run a configuration change rather than a porting project.

import nemo_run as run
from nemo.collections import llm

# a fine-tuning recipe with explicit parallelism
recipe = llm.llama3_70b.finetune_recipe(
    num_nodes=4, num_gpus_per_node=8,
)
recipe.trainer.strategy.tensor_model_parallel_size = 8
recipe.trainer.strategy.pipeline_model_parallel_size = 4

# same recipe, choose where it runs
run.run(recipe, executor=run.SlurmExecutor(...))   # or LocalExecutor / a K8s executor

# expected: a launched job; logs show the global batch and parallel degrees
# failure mode: world_size (32) must equal TP*PP*DP. If 8*4 already uses all
# 32 GPUs, DP=1; ask for more nodes or lower TP/PP, or the job will not start.

Confirm the exact recipe names and strategy field paths against the NeMo version you run, since the 2.x API is still moving. [VERIFY the recipe factory name and the strategy attribute paths against the NeMo 2.x docs for your installed version.]

Worked example

Fine-tune a 70B model on 4 nodes of 8 H100s, so world_size is 32. Set tensor parallel to 8 to keep each layer's weights inside one NVLink-connected node, and pipeline parallel to 4 to span the four nodes. That uses 8 times 4, which is 32, the entire cluster, so data parallelism is 1: every GPU is part of a single model copy.

If you wanted data parallelism of 2 to push more batch through, you would need 8 nodes (64 GPUs) at the same TP and PP, or you would drop TP to 4 and fit two replicas on the four nodes at the cost of more cross-GPU traffic per layer. There is no free lunch: every reshuffle trades memory headroom against communication. The point is that you reason about it with the simple identity world_size equals TP times PP times DP, before you ever submit the job.

Megatron Bridge and Hugging Face interop

The historical friction with Megatron-style training was checkpoints: models lived on Hugging Face in one format and trained in Megatron in another, and moving between them was a fragile conversion script. Megatron Bridge, a library inside the NeMo framework, makes that conversion bidirectional and verified for current models. You pull an open checkpoint from Hugging Face, fine-tune it with NeMo's parallelism, and convert the result back to the Hugging Face format for serving or sharing. In the current stack it also acts as the packaging layer that brings in the latest throughput work, including full-iteration CUDA graphs and communication-overlap optimizations. For most teams this is the feature that makes NeMo practical: you are not locked into a proprietary checkpoint format to get the scaling.

Gotcha: the most common reason a NeMo job refuses to start is a parallelism mismatch. The product of your tensor, pipeline, and (for MoE) expert degrees has to divide the world size cleanly, and what is left becomes data parallelism. If TP times PP does not divide the GPU count, the job fails at init, not at step 100. Compute the split on paper before you submit, and treat a job that will not launch as an arithmetic bug, not a cluster problem.

Framework or microservices, and which you need

A frequent and costly confusion: NeMo the framework is not the same thing as NeMo microservices. The framework is Python code you compose and run on your own cluster to train or fine-tune a model with full control over parallelism. The microservices, including NeMo Customizer, Evaluator, Retriever, and Guardrails, are REST APIs you run on Kubernetes that wrap managed fine-tuning, evaluation, retrieval, and safety behind an endpoint. They share a name and some internals, and they solve different problems. Provision the wrong one and you either hand a research team an API that hides the controls they need, or you saddle a platform team with a training stack they did not want to operate.

The split usually follows the team. A platform group offering self-service customization to application developers leans on the microservices, because the value there is an API a non-specialist can call. A training or research group pushing parallelism and throughput lives in the framework, because the value there is control. Plenty of shops run both: the framework produces a tuned checkpoint, and the microservices serve and govern it. The lifecycle below is the path a model takes from an open base to something you deploy, and it is the same regardless of which side you start on.

DimensionNeMo frameworkNeMo microservices
InterfacePython code and recipesREST API
Runs onAny executor (local / Slurm / K8s)Kubernetes
ControlMaximum, every parallelism knobManaged, opinionated defaults
Best forTraining and research teamsPlatform self-service
Skill neededDistributed trainingAPI integration
From open base to deploymentThe framework owns the middle; serving picks up at the endOpen baseNemotron / LlamaNeMo fine-tuneparallelism + TEMegatron Bridgeconvert to HFPackage as NIMoptimized serveDeployto productionNeMo frameworkNIM serving (Part 16)
Megatron Bridge is the hinge: it lets you train in Megatron and hand a Hugging Face checkpoint to the serving layer without a lossy custom conversion.

What I would actually choose

My recommendation: use NeMo for any real multi-node training or fine-tuning run, and reach for it specifically when you need Megatron-Core's parallelism and FP8/FP4 throughput at scale. Why: it removes the distributed-training engineering that is genuinely hard to get right, and the Hugging Face bridge means you keep portability. When it is not the right call: a quick LoRA on a single GPU, or a small experiment, where NeMo's machinery is more setup than the task warrants and a lighter fine-tuning library will move faster. What to validate first: your parallelism plan against the GPU count and the memory budget, because that arithmetic decides whether the job runs at all and whether it runs efficiently.

And the position worth stating plainly: almost nobody should pretrain from scratch. Pretraining a frontier model costs hundreds of GPU-days and a curation pipeline most organizations do not have. Fine-tuning an open base like Nemotron or Llama with NeMo gets you the customization you actually need on a budget you can defend. Treat pretraining as the rare exception, not the default ambition.

The Verdict

NeMo is the framework that turns large-scale training from a distributed-systems project into a configuration exercise. Megatron-Core supplies the parallelism, Transformer Engine supplies the low-precision throughput, NeMo 2.0 supplies Python-native recipes, NeMo-Run supplies executor-agnostic launch, and Megatron Bridge keeps you portable to and from Hugging Face. The skill that matters is parallelism planning, and the identity world_size equals TP times PP times DP is the first thing to internalize. If you are about to fine-tune a model bigger than a single GPU, start with NeMo and start with the parallelism math, not the model.

Next we take the framework and apply it to customization specifically: LoRA, SFT, and RLHF, and when each one is worth the data and compute it demands. Bring the parallelism intuition from this part into that work.

NVIDIA AI Series · Part 22 of 30
« Previous: Part 21  |  NVIDIA AI Guide  |  Next: Part 23 »

References

NVIDIA NeMo Framework User Guide
NVIDIA Megatron-Core
NVIDIA Megatron-Bridge (GitHub)

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading