Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

Multi-Node LLM Training: Scheduling, Checkpointing and Fault Tolerance (NVIDIA AI Series, Part 25)

At thousands of GPUs, failures are routine. This part covers gang scheduling (Slurm vs Kubernetes vs NVIDIA Run:ai), async distributed checkpointing with NeMo, and the NVIDIA Resiliency Extension stack for fault tolerance, straggler detection, and elastic restart.

NVIDIA AI Series · Part 25 of 30
TL;DR
  • At 1,000+ GPUs, node failures are not edge cases. They are scheduled maintenance you did not plan for.
  • Slurm remains the production choice for pure training throughput; Kubernetes wins when you share infrastructure with inference and CI; NVIDIA Run:ai sits on top of Kubernetes and adds gang scheduling and quota management.
  • Async distributed checkpointing with torch_dist format cuts checkpoint stall from 2-3 minutes to under 10 seconds on large clusters.
  • The NVIDIA Resiliency Extension layers fault tolerance, straggler detection, and local checkpointing. No single layer covers every failure mode. Stack them.
  • The math: at 512 nodes with a 72-hour MTBF per node, expected time to first failure is about 8.5 minutes. Checkpoint every 15 minutes, lose 7.5 minutes on average.
Who This Is For: Platform and infrastructure engineers running or planning multi-node GPU clusters for LLM pretraining or fine-tuning. Assumes familiarity with distributed training concepts (data parallelism, tensor parallelism) and basic Kubernetes or Slurm operations. Read Part 22 (NeMo framework overview) and Part 24 (NeMo Curator) first if you are new to the NeMo stack.

Your 256-node training job dies 18 hours in. The NCCL watchdog fired. One node kernel-panicked, the collective stalled, and PyTorch eventually timed out on TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC. You check the checkpoint. The last valid save was 47 minutes ago. That is 47 minutes times 2,048 GPUs of wasted compute, gone. This is not a hypothetical. At scale, this scenario describes a Tuesday.

Multi-node LLM training has three problems that compound each other: scheduling the right nodes in the right topology, writing checkpoints fast enough that a failure does not cost hours, and detecting and recovering from faults before the entire collective deadlocks. This part covers all three, with the specific tools NVIDIA ships and the trade-offs I have seen in practice.

The Scheduling Problem: Gang Placement or Nothing

Distributed training is an all-or-nothing workload. Every rank must start simultaneously and stay in lockstep for the life of the job. A scheduler that starts 200 of 256 pods and leaves the rest pending does not give you a partially functional training job. It gives you 200 processes that will block on their first AllReduce and eventually time out.

This is the fundamental tension between general-purpose container schedulers and HPC-style batch systems. Gang scheduling is the requirement: every process in the job starts together or the job does not start at all. Beyond gang scheduling, topology awareness matters enormously. An H100 DGX node communicates over NVLink at 900 GB/s within the node and drops to InfiniBand or Ethernet across nodes. A scheduler that spreads your 64-GPU job across 8 nodes in three different racks, crossing multiple spine switches, will see meaningfully worse AllReduce bandwidth than one that keeps those 8 nodes on the same leaf switch or in the same NVLink domain.

Figure 1: Topology-Aware Job Placement
Scheduler to job to node topology; rail-optimized placement vs random placement
SCHEDULER (Slurm / K8s / Run:ai) TRAINING JOB (gang alloc) LEAF SWITCH rail-optimized Node group A Node group B Node group C Node group D Topology-aware: all nodes on same leaf = max NVLink + low-hop IB bandwidth Random placement crosses spines; AllReduce bandwidth collapses
Gang scheduling + topology placement: all ranks start together and land on nodes sharing the same leaf switch, preserving NVLink and low-hop InfiniBand bandwidth for collective operations.

Slurm, Kubernetes, and NVIDIA Run:ai: The Real Comparison

Three schedulers dominate production GPU training. Here is a direct comparison without the marketing layer:

Dimension Slurm Kubernetes (native) NVIDIA Run:ai on K8s
Gang scheduling Native; all-or-nothing by design Not native; needs Volcano, Coscheduler, or DRA Built-in gang scheduling via Run:ai gang plugin
Topology awareness Mature; topology.conf, switch constraints, TRES Improving; NVIDIA Topograph + DRA ComputeDomains in 2026 NVIDIA topology identifiers exposed; GPU affinity policies
Container support Pyxis/Enroot add-on; Slinky for K8s bridge Native; first-class OCI containers Native (K8s); Run:ai console for image management
Fault tolerance hooks NeMo ft_launcher + Resiliency Ext; Slurm requeue PyTorchJob restartPolicy; manual pod replacement Run:ai job policies; integrates NeMo resiliency
Multi-tenant quota Fair-share scheduling; partitions; QOS ResourceQuota per namespace; coarse Projects, departments, GPU quotas with over-provision
Operational overhead High; HPC admin expertise required Moderate; cloud-native team can manage Low for users; licensing cost adds up per GPU
Best fit Dedicated training clusters, HPC heritage teams Mixed inference + training, cloud-native orgs Shared GPU platforms needing SLA-backed scheduling

NVIDIA Slinky bridges the gap: it runs Slurm inside Kubernetes, letting you submit Slurm jobs to a K8s-managed pool. NVIDIA has tested Slinky on clusters up to 8,000 GPUs and reports performance matching bare-metal Slurm deployments. The topology support in Slinky 1.1.0 (2026) includes dynamic topology registration keyed to the underlying Kubernetes node labels, which means NVLink domain and InfiniBand fabric hierarchy are visible to the Slurm scheduler without manual topology.conf updates. [VERIFY: Slinky 1.1.0 GA date and max tested node count]

Gotcha: Kubernetes-native gang scheduling with PodGroup (Volcano) works in principle but in practice I have seen training jobs get partial allocations after a node eviction because the scheduler re-evaluated individual pods rather than the full gang. With Slurm or Run:ai, the entire allocation is atomic. On a shared K8s cluster without a dedicated gang scheduler, plan to lose training runs to partial evictions.

Checkpointing: The Math Behind the Interval

Checkpointing is the safety net, but naive checkpointing at large scale introduces its own problems. A synchronous checkpoint on a 500B parameter model in BF16 across 1,024 GPUs is writing roughly 1 TB of state to shared network storage. If that write takes 3 minutes and you checkpoint every 30 minutes, you are burning 10% of your compute time on I/O stalls. Checkpoint too infrequently, and a failure costs you hours.

The MTBF Calculation You Should Actually Run

For a cluster of N nodes each with an individual MTBF of M hours, the expected time to first failure in the cluster is M/N hours. At 512 nodes with a 72-hour per-node MTBF (realistic for GPU nodes under sustained training load including NVLink faults, driver crashes, and network timeouts), the expected time to first cluster-level failure is 72/512 = 0.14 hours, or about 8.5 minutes. If your checkpoint interval is 30 minutes, average lost work per failure is 15 minutes. At 512 x 8 GPUs per node = 4,096 GPUs, that is 4,096 GPU-hours of wasted compute per failure event, at whatever your GPU cost rate is.

The right checkpoint interval is the one that minimizes total wasted GPU-hours: wasted = (checkpoint_overhead_hours) + (MTBF_cluster / 2). At an 8.5-minute mean inter-failure interval, any checkpoint that completes in under 4 minutes and runs every 10-15 minutes is worth the I/O cost. This is exactly why async distributed checkpointing matters.

Figure 2: Sync vs Async Checkpoint Timeline
Synchronous checkpoint blocks training; async overlaps I/O with the next forward pass
SYNC CHECKPOINT Training steps 1-100 CHECKPOINT STALL ~3 min Training steps 101-200 CHECKPOINT STALL ~3 min Training steps 201+ ASYNC CHECKPOINT Training steps 1-100 Training steps 101-200 Training steps 201-300 Background write (async worker) Async: schedule save at step 100, continue training immediately. Background worker flushes to storage. Requires torch_dist format.
Synchronous checkpointing stalls training during every write. Async checkpointing snapshots state in memory then offloads the write to a background worker, overlapping I/O with the next training steps.

Checkpoint Strategy Trade-offs

Strategy Write Latency Recovery Time Storage Required Best When
Sync to NFS/Lustre 2-5 min (large model) Fast (shared FS) 1x model size Small clusters, simple setup
Async (torch_dist) 6-15 sec visible stall Fast (shared FS) 1x model size Most production clusters; best latency reduction
Local checkpoint + replicate Very fast (local NVMe) Fast if nodes survive Nx (replication degree) FS contention is the bottleneck; high MTBF clusters
Distributed async (Megatron-Core) Near-zero stall Moderate (reassemble shards) 1x across all nodes Largest scale; tensor/pipeline parallel runs

NeMo and Megatron-Core use the torch_dist format for distributed checkpointing. Each rank writes its own shard in parallel. PyTorch async checkpointing (available since PyTorch 2.4+) takes a snapshot of the optimizer and model states into CPU memory, then flushes in a background thread. PyTorch’s own benchmarks show a 7B parameter model checkpoint reduced from 148.8 seconds to 6.3 seconds effective stall — a 23x improvement. At 70B parameters the absolute numbers scale, but the relative benefit holds.

Local checkpointing in the NeMo Resiliency Extension writes first to node-local NVMe, then replicates shards to a configurable number of peer nodes. This avoids parallel NFS writes from hundreds of nodes hammering a shared filesystem. The trade-off: if a node fails before replication completes, you lose its shard. Set replication degree to at least 2 for any cluster where simultaneous multi-node failure is plausible.

Fault Tolerance: What NVIDIA Ships and What Actually Helps

The NVIDIA Resiliency Extension (nvidia-resiliency-ext) packages several fault-tolerance capabilities under one roof. The table from the NeMo Megatron Bridge docs (which now hosts the authoritative resiliency documentation) breaks it down clearly:

  • Fault tolerance with ft_launcher: hang detection via section-based watchdog; automatic in-job restart (within the same Slurm allocation) or new job launch. Requires Slurm and ft_launcher, not torchrun.
  • NVRx straggler detection: monitors per-rank GPU performance scores, both relative and absolute. Reports every 300 seconds by default with negligible throughput impact tested up to 1,000 H100 GPUs on Llama 3.1 training. Does not terminate training unless stop_if_detected=True is set explicitly.
  • Preemption handling: catches SIGTERM, saves a checkpoint, exits cleanly. Slurm-specific via the PreemptionPlugin.
  • Async checkpoint save: background workers with the torch_dist format only. Other formats (zarr, fsdp_dtensor) do not support this path.
  • In-process restart (experimental): restarts training within the same OS process, avoiding container/CUDA context re-init overhead. Requires PyTorch >= 2.5.1 and NCCL >= 2.26.2. Not compatible with NeMo-Run or Slurm preemption plugins.

The practical caveat from the docs is worth repeating verbatim: no single resiliency feature covers all failure modes. Layer them. In production this means: fault tolerance for hang detection and restart, straggler detection to catch degraded nodes before they cause hangs, async checkpoint to minimize replay cost, and preemption handling for shared clusters with time limits.

Figure 3: Failure Detection and Restart Flow
From NCCL hang to checkpoint restore and job restart
NODE FAULT GPU/NIC/driver NCCL HANG AllReduce stalls WATCHDOG ft_launcher fires TERMINATE all ranks killed REQUEUE Slurm requeues job RESTORE CKPT load last valid state RESUME training continues Total recovery time: restart overhead + checkpoint load (typically 5-15 min on large clusters) Wasted compute = (steps since last checkpoint) x (GPU count) x (step time) In-process restart (experimental) skips container and CUDA context re-init, cutting recovery to seconds
The full failure-to-restart path: NCCL hang triggers the ft_launcher watchdog, which kills all ranks, Slurm requeues the job, and training resumes from the last valid checkpoint. The total wall-clock cost is the restart overhead plus replay time.

NCCL 2.27 and ncclCommShrink

NCCL 2.27 introduced ncclCommShrink, which allows a training or inference job to remove faulted ranks from the communicator without tearing down and rebuilding from scratch. Recovery from a fault previously required every rank to call ncclCommAbort on the old communicator, then coordinate a fresh ncclCommInit with a new ncclUniqueId. ncclCommShrink with the NCCL_SHRINK_ABORT flag cancels hung operations and builds a new communicator from the surviving ranks, reusing the rank metadata from the old one. This is particularly useful for elastic inference serving where worker count changes; for training, where all ranks must participate in every collective, shrink is typically followed by a checkpoint restore and job requeue rather than continued training with a smaller communicator.

One common misconfiguration: if TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC is left at default (600 seconds on some builds), a stalled AllReduce will sit for 10 minutes before PyTorch’s watchdog fires. Meanwhile, healthy ranks burn GPU cycles on empty iterations. Set this to a value shorter than your checkpoint interval. 180 seconds is a reasonable starting point for a cluster where AllReduce operations genuinely complete in under 60 seconds. Also set NCCL_SOCKET_NTHREADS and NCCL_NSOCKS_PERTHREAD for your InfiniBand fabric; the defaults are sized for small clusters.

The Production Artifact: PyTorchJob with Checkpoint and Restart

Below is a representative Kubernetes PyTorchJob manifest for a multi-node NeMo training run with distributed checkpointing and fault-tolerant restart policy. This uses the NVIDIA GPU Operator for device plugin support and assumes a shared NFS mount at /checkpoints. The critical failure mode follows the manifest.

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: nemo-llm-pretrain-256node
  namespace: ml-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"
          containers:
            - name: pytorch
              image: nvcr.io/nvidia/nemo:25.04
              command:
                - ft_launcher
                - --fault-tol-cfg-path=/cfg/ft_config.yaml
                - --
                - python
                - /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py
                - --config-path=/cfg
                - --config-name=pretrain_70b.yaml
              env:
                - name: TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC
                  value: "180"
                - name: NCCL_NVLS_ENABLE
                  value: "0"
                - name: NCCL_DEBUG
                  value: "WARN"
              resources:
                limits:
                  nvidia.com/gpu: 8
                  rdma/hca: 1
              volumeMounts:
                - name: checkpoints
                  mountPath: /checkpoints
                - name: cfg
                  mountPath: /cfg
          volumes:
            - name: checkpoints
              nfs:
                server: nfs-server.cluster.local
                path: /exports/checkpoints
            - name: cfg
              configMap:
                name: nemo-pretrain-cfg
    Worker:
      replicas: 255
      restartPolicy: OnFailure
      template:
        spec:
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"
          containers:
            - name: pytorch
              image: nvcr.io/nvidia/nemo:25.04
              command:
                - ft_launcher
                - --fault-tol-cfg-path=/cfg/ft_config.yaml
                - --
                - python
                - /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py
                - --config-path=/cfg
                - --config-name=pretrain_70b.yaml
              env:
                - name: TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC
                  value: "180"
                - name: NCCL_NVLS_ENABLE
                  value: "0"
              resources:
                limits:
                  nvidia.com/gpu: 8
                  rdma/hca: 1
              volumeMounts:
                - name: checkpoints
                  mountPath: /checkpoints
                - name: cfg
                  mountPath: /cfg
          volumes:
            - name: checkpoints
              nfs:
                server: nfs-server.cluster.local
                path: /exports/checkpoints
            - name: cfg
              configMap:
                name: nemo-pretrain-cfg

Expected behavior: ft_launcher on each pod monitors the training process with section-based watchdog timers (different timeouts for init, training step, and checkpoint). On a transient NCCL hang, ft_launcher terminates the process, Kubernetes sees a non-zero exit code, and restarts the pod per restartPolicy: OnFailure. NeMo auto-resumes from the latest checkpoint directory on restart.

Common failure mode — the single straggler: One pod on a degraded node completes forward passes 3x slower than the rest. Because AllReduce is a synchronous barrier, every other rank waits. The straggler does not fail; it just runs slowly. The NCCL watchdog does not fire because NCCL is making forward progress. Your effective training throughput collapses and you do not see an error anywhere. This is exactly the failure straggler detection solves. Enable it explicitly in your NeMo config:

# In your NeMo YAML trainer config
trainer:
  plugins:
    - _target_: nemo.lightning.pytorch.plugins.MegatronDataSamplerPlugin
straggler_detection:
  enabled: true
  straggler_report_time_interval: 300
  stop_if_detected: true
  relative_gpu_performance_threshold: 0.7

Worked Example

Setup: 512 H100 nodes (4,096 GPUs), 70B parameter Llama-family model, BF16, tensor parallel 8 + pipeline parallel 4 + data parallel 16. Checkpoint size approximately 280 GB distributed across all nodes (shard per rank).

MTBF math: Per-node MTBF 72 hours. Cluster MTBF = 72 / 512 = 0.14 hours = 8.5 minutes. At a 30-minute checkpoint interval: expected wasted GPU-hours per failure = 4,096 x (15 min / 60) = 1,024 GPU-hours. At a 10-minute checkpoint interval with async write (8-second stall): effective checkpoint overhead = 8 sec / (10 min x 60 sec) = 1.3%. Expected wasted GPU-hours per failure = 4,096 x (5 min / 60) = 341 GPU-hours.

Conclusion: Cutting checkpoint interval from 30 to 10 minutes with async write reduces expected wasted GPU-hours per failure by 67%, at a cost of 1.3% throughput overhead. On a 30-day training run at $2/GPU-hour, that is roughly $640 saved per failure event. [AUTHOR: add actual cluster failure rate from a production run to validate this estimate]

Straggler impact observed in practice: One degraded NVLink link reducing a node to 60% throughput brought overall job MFU (model FLOP utilization) from 38% to 24% — a 37% effective throughput loss — with no errors in any log. Straggler detection with a 0.7 relative threshold would have terminated that rank in the next 300-second report cycle and triggered a restart from checkpoint, recovering full throughput. [AUTHOR: add anecdote from specific customer engagement]

Figure 4: MTBF vs Scale — Wasted GPU-Hours
As node count increases, cluster MTBF collapses; checkpoint interval becomes critical
Node count Cluster MTBF (hours) 64 128 256 512 1024 4096 0 0.5h 1h 1.5h 30m ckpt 15m ckpt 10m ckpt Cluster MTBF (72h per-node MTBF)
Cluster MTBF falls inversely with node count. At 512 nodes the expected inter-failure interval is under 10 minutes. Dashed lines show where various checkpoint intervals sit relative to cluster MTBF — anything above the MTBF line means checkpoints longer than expected time to next failure.

What I Would Do in Production

The verdict on a resilient training setup is not a single product choice. It is a stack of decisions that compound:

Scheduler: For a dedicated training cluster, Slurm with topology.conf built from your actual InfiniBand and NVLink fabric, and Slinky if you need container portability. For a shared platform that also runs inference and CI, Kubernetes with NVIDIA Run:ai for gang scheduling and GPU quota management. Do not run pure Kubernetes with only native PodGroup gang scheduling at scale — the partial eviction problem is real and nasty.

Checkpointing: Async torch_dist format, checkpoint every 10-15 minutes for clusters above 256 nodes, local checkpoint + replication if your shared FS is a contention point. Keep two valid checkpoint generations on disk (current and previous), because a checkpoint written during a hardware fault can itself be corrupt.

Resiliency: Layer all of: fault tolerance with ft_launcher (Slurm), straggler detection with a 0.7 relative threshold and stop_if_detected=True, preemption handling if running on shared Slurm with time limits, and async checkpoint save. Set TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC to 180 and monitor NCCL error logs actively — the default timeout of 600 seconds wastes 10 minutes of GPU time per hang event.

When not to use in-process restart: Do not use the experimental in-process restart feature for hardware-level faults (switch failures, NIC failures, GPU resets) — it is designed for software exceptions and deadlocks only. For hardware faults, you need a full job requeue and NCCL communicator rebuild. Mixing both paths without clear failure classification will leave you with a job that half-recovered and corrupt optimizer state.

Validate first: Before running a 30-day training job, inject failure intentionally. Kill one pod. Verify the watchdog fires within your configured timeout, the job requeues, and the checkpoint loads cleanly. Then kill a pod during a checkpoint write and verify the resulting checkpoint is not corrupt. This is the drill that actually validates your resilience stack — not reading the docs.

In Practice: Two straggler detectors exist in the NeMo/Megatron codebase: the NVRx version (from nvidia-resiliency-ext) and a legacy MCore implementation. Enable only the NVRx version. Running both simultaneously causes double-counting of performance scores and can trigger false positives that terminate healthy ranks. The NeMo Megatron Bridge docs explicitly call this out.
Disclaimer: NCCL environment variable names, PyTorch version requirements, and NeMo config schema fields change across releases. Verify the exact parameters against the NeMo container version you are using before applying to a production training run. The values in this post are verified against nemo:25.04 and NCCL 2.27.3 as of June 2026.

The Verdict

Multi-node training at thousands of GPUs is less about writing the training loop and more about building an operations layer that treats failure as the default state. The math is not optional: at 512 nodes with a realistic per-node MTBF, you will see a failure roughly every 8.5 minutes. If your checkpoint interval is longer than that and your restart is manual, you will burn most of your compute on replay rather than progress.

The NVIDIA toolchain gives you the pieces: Slurm with topology-aware gang scheduling, NeMo with ft_launcher and the Resiliency Extension, async torch_dist checkpointing, and NCCL 2.27 with ncclCommShrink for elastic recovery. None of these work in isolation. The teams I have seen run long training jobs without wasting massive amounts of compute are the ones that treat their resilience stack as a product with a test suite, not as a checkbox.

The next part of this series covers NVIDIA foundation models: the Nemotron family and the open vs proprietary question. For the VCF-specific deployment of NeMo training clusters on NVIDIA AI Enterprise, the Private AI Series covers orchestration and platform configuration.

If your training cluster is below 64 nodes and you are not seeing failures, your checkpoint interval matters less than getting the topology placement right. At 256+ nodes, the resilience stack is non-optional. At 1,024+ nodes, it is the difference between completing a training run and burning a budget.

NVIDIA AI Series · Part 25 of 30
« Previous: Part 24  |  NVIDIA AI Guide  |  Next: Part 26 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading