- At 1,000+ GPUs, node failures are not edge cases. They are scheduled maintenance you did not plan for.
- Slurm remains the production choice for pure training throughput; Kubernetes wins when you share infrastructure with inference and CI; NVIDIA Run:ai sits on top of Kubernetes and adds gang scheduling and quota management.
- Async distributed checkpointing with
torch_distformat cuts checkpoint stall from 2-3 minutes to under 10 seconds on large clusters. - The NVIDIA Resiliency Extension layers fault tolerance, straggler detection, and local checkpointing. No single layer covers every failure mode. Stack them.
- The math: at 512 nodes with a 72-hour MTBF per node, expected time to first failure is about 8.5 minutes. Checkpoint every 15 minutes, lose 7.5 minutes on average.
Your 256-node training job dies 18 hours in. The NCCL watchdog fired. One node kernel-panicked, the collective stalled, and PyTorch eventually timed out on TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC. You check the checkpoint. The last valid save was 47 minutes ago. That is 47 minutes times 2,048 GPUs of wasted compute, gone. This is not a hypothetical. At scale, this scenario describes a Tuesday.
Multi-node LLM training has three problems that compound each other: scheduling the right nodes in the right topology, writing checkpoints fast enough that a failure does not cost hours, and detecting and recovering from faults before the entire collective deadlocks. This part covers all three, with the specific tools NVIDIA ships and the trade-offs I have seen in practice.
The Scheduling Problem: Gang Placement or Nothing
Distributed training is an all-or-nothing workload. Every rank must start simultaneously and stay in lockstep for the life of the job. A scheduler that starts 200 of 256 pods and leaves the rest pending does not give you a partially functional training job. It gives you 200 processes that will block on their first AllReduce and eventually time out.
This is the fundamental tension between general-purpose container schedulers and HPC-style batch systems. Gang scheduling is the requirement: every process in the job starts together or the job does not start at all. Beyond gang scheduling, topology awareness matters enormously. An H100 DGX node communicates over NVLink at 900 GB/s within the node and drops to InfiniBand or Ethernet across nodes. A scheduler that spreads your 64-GPU job across 8 nodes in three different racks, crossing multiple spine switches, will see meaningfully worse AllReduce bandwidth than one that keeps those 8 nodes on the same leaf switch or in the same NVLink domain.
Slurm, Kubernetes, and NVIDIA Run:ai: The Real Comparison
Three schedulers dominate production GPU training. Here is a direct comparison without the marketing layer:
| Dimension | Slurm | Kubernetes (native) | NVIDIA Run:ai on K8s |
|---|---|---|---|
| Gang scheduling | Native; all-or-nothing by design | Not native; needs Volcano, Coscheduler, or DRA | Built-in gang scheduling via Run:ai gang plugin |
| Topology awareness | Mature; topology.conf, switch constraints, TRES | Improving; NVIDIA Topograph + DRA ComputeDomains in 2026 | NVIDIA topology identifiers exposed; GPU affinity policies |
| Container support | Pyxis/Enroot add-on; Slinky for K8s bridge | Native; first-class OCI containers | Native (K8s); Run:ai console for image management |
| Fault tolerance hooks | NeMo ft_launcher + Resiliency Ext; Slurm requeue | PyTorchJob restartPolicy; manual pod replacement | Run:ai job policies; integrates NeMo resiliency |
| Multi-tenant quota | Fair-share scheduling; partitions; QOS | ResourceQuota per namespace; coarse | Projects, departments, GPU quotas with over-provision |
| Operational overhead | High; HPC admin expertise required | Moderate; cloud-native team can manage | Low for users; licensing cost adds up per GPU |
| Best fit | Dedicated training clusters, HPC heritage teams | Mixed inference + training, cloud-native orgs | Shared GPU platforms needing SLA-backed scheduling |
NVIDIA Slinky bridges the gap: it runs Slurm inside Kubernetes, letting you submit Slurm jobs to a K8s-managed pool. NVIDIA has tested Slinky on clusters up to 8,000 GPUs and reports performance matching bare-metal Slurm deployments. The topology support in Slinky 1.1.0 (2026) includes dynamic topology registration keyed to the underlying Kubernetes node labels, which means NVLink domain and InfiniBand fabric hierarchy are visible to the Slurm scheduler without manual topology.conf updates. [VERIFY: Slinky 1.1.0 GA date and max tested node count]
PodGroup (Volcano) works in principle but in practice I have seen training jobs get partial allocations after a node eviction because the scheduler re-evaluated individual pods rather than the full gang. With Slurm or Run:ai, the entire allocation is atomic. On a shared K8s cluster without a dedicated gang scheduler, plan to lose training runs to partial evictions.
Checkpointing: The Math Behind the Interval
Checkpointing is the safety net, but naive checkpointing at large scale introduces its own problems. A synchronous checkpoint on a 500B parameter model in BF16 across 1,024 GPUs is writing roughly 1 TB of state to shared network storage. If that write takes 3 minutes and you checkpoint every 30 minutes, you are burning 10% of your compute time on I/O stalls. Checkpoint too infrequently, and a failure costs you hours.
The MTBF Calculation You Should Actually Run
For a cluster of N nodes each with an individual MTBF of M hours, the expected time to first failure in the cluster is M/N hours. At 512 nodes with a 72-hour per-node MTBF (realistic for GPU nodes under sustained training load including NVLink faults, driver crashes, and network timeouts), the expected time to first cluster-level failure is 72/512 = 0.14 hours, or about 8.5 minutes. If your checkpoint interval is 30 minutes, average lost work per failure is 15 minutes. At 512 x 8 GPUs per node = 4,096 GPUs, that is 4,096 GPU-hours of wasted compute per failure event, at whatever your GPU cost rate is.
The right checkpoint interval is the one that minimizes total wasted GPU-hours: wasted = (checkpoint_overhead_hours) + (MTBF_cluster / 2). At an 8.5-minute mean inter-failure interval, any checkpoint that completes in under 4 minutes and runs every 10-15 minutes is worth the I/O cost. This is exactly why async distributed checkpointing matters.
Checkpoint Strategy Trade-offs
| Strategy | Write Latency | Recovery Time | Storage Required | Best When |
|---|---|---|---|---|
| Sync to NFS/Lustre | 2-5 min (large model) | Fast (shared FS) | 1x model size | Small clusters, simple setup |
| Async (torch_dist) | 6-15 sec visible stall | Fast (shared FS) | 1x model size | Most production clusters; best latency reduction |
| Local checkpoint + replicate | Very fast (local NVMe) | Fast if nodes survive | Nx (replication degree) | FS contention is the bottleneck; high MTBF clusters |
| Distributed async (Megatron-Core) | Near-zero stall | Moderate (reassemble shards) | 1x across all nodes | Largest scale; tensor/pipeline parallel runs |
NeMo and Megatron-Core use the torch_dist format for distributed checkpointing. Each rank writes its own shard in parallel. PyTorch async checkpointing (available since PyTorch 2.4+) takes a snapshot of the optimizer and model states into CPU memory, then flushes in a background thread. PyTorch’s own benchmarks show a 7B parameter model checkpoint reduced from 148.8 seconds to 6.3 seconds effective stall — a 23x improvement. At 70B parameters the absolute numbers scale, but the relative benefit holds.
Local checkpointing in the NeMo Resiliency Extension writes first to node-local NVMe, then replicates shards to a configurable number of peer nodes. This avoids parallel NFS writes from hundreds of nodes hammering a shared filesystem. The trade-off: if a node fails before replication completes, you lose its shard. Set replication degree to at least 2 for any cluster where simultaneous multi-node failure is plausible.
Fault Tolerance: What NVIDIA Ships and What Actually Helps
The NVIDIA Resiliency Extension (nvidia-resiliency-ext) packages several fault-tolerance capabilities under one roof. The table from the NeMo Megatron Bridge docs (which now hosts the authoritative resiliency documentation) breaks it down clearly:
- Fault tolerance with ft_launcher: hang detection via section-based watchdog; automatic in-job restart (within the same Slurm allocation) or new job launch. Requires Slurm and ft_launcher, not torchrun.
- NVRx straggler detection: monitors per-rank GPU performance scores, both relative and absolute. Reports every 300 seconds by default with negligible throughput impact tested up to 1,000 H100 GPUs on Llama 3.1 training. Does not terminate training unless
stop_if_detected=Trueis set explicitly. - Preemption handling: catches SIGTERM, saves a checkpoint, exits cleanly. Slurm-specific via the PreemptionPlugin.
- Async checkpoint save: background workers with the
torch_distformat only. Other formats (zarr, fsdp_dtensor) do not support this path. - In-process restart (experimental): restarts training within the same OS process, avoiding container/CUDA context re-init overhead. Requires PyTorch >= 2.5.1 and NCCL >= 2.26.2. Not compatible with NeMo-Run or Slurm preemption plugins.
The practical caveat from the docs is worth repeating verbatim: no single resiliency feature covers all failure modes. Layer them. In production this means: fault tolerance for hang detection and restart, straggler detection to catch degraded nodes before they cause hangs, async checkpoint to minimize replay cost, and preemption handling for shared clusters with time limits.
NCCL 2.27 and ncclCommShrink
NCCL 2.27 introduced ncclCommShrink, which allows a training or inference job to remove faulted ranks from the communicator without tearing down and rebuilding from scratch. Recovery from a fault previously required every rank to call ncclCommAbort on the old communicator, then coordinate a fresh ncclCommInit with a new ncclUniqueId. ncclCommShrink with the NCCL_SHRINK_ABORT flag cancels hung operations and builds a new communicator from the surviving ranks, reusing the rank metadata from the old one. This is particularly useful for elastic inference serving where worker count changes; for training, where all ranks must participate in every collective, shrink is typically followed by a checkpoint restore and job requeue rather than continued training with a smaller communicator.
One common misconfiguration: if TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC is left at default (600 seconds on some builds), a stalled AllReduce will sit for 10 minutes before PyTorch’s watchdog fires. Meanwhile, healthy ranks burn GPU cycles on empty iterations. Set this to a value shorter than your checkpoint interval. 180 seconds is a reasonable starting point for a cluster where AllReduce operations genuinely complete in under 60 seconds. Also set NCCL_SOCKET_NTHREADS and NCCL_NSOCKS_PERTHREAD for your InfiniBand fabric; the defaults are sized for small clusters.
The Production Artifact: PyTorchJob with Checkpoint and Restart
Below is a representative Kubernetes PyTorchJob manifest for a multi-node NeMo training run with distributed checkpointing and fault-tolerant restart policy. This uses the NVIDIA GPU Operator for device plugin support and assumes a shared NFS mount at /checkpoints. The critical failure mode follows the manifest.
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: nemo-llm-pretrain-256node
namespace: ml-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: pytorch
image: nvcr.io/nvidia/nemo:25.04
command:
- ft_launcher
- --fault-tol-cfg-path=/cfg/ft_config.yaml
- --
- python
- /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py
- --config-path=/cfg
- --config-name=pretrain_70b.yaml
env:
- name: TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC
value: "180"
- name: NCCL_NVLS_ENABLE
value: "0"
- name: NCCL_DEBUG
value: "WARN"
resources:
limits:
nvidia.com/gpu: 8
rdma/hca: 1
volumeMounts:
- name: checkpoints
mountPath: /checkpoints
- name: cfg
mountPath: /cfg
volumes:
- name: checkpoints
nfs:
server: nfs-server.cluster.local
path: /exports/checkpoints
- name: cfg
configMap:
name: nemo-pretrain-cfg
Worker:
replicas: 255
restartPolicy: OnFailure
template:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: pytorch
image: nvcr.io/nvidia/nemo:25.04
command:
- ft_launcher
- --fault-tol-cfg-path=/cfg/ft_config.yaml
- --
- python
- /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py
- --config-path=/cfg
- --config-name=pretrain_70b.yaml
env:
- name: TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC
value: "180"
- name: NCCL_NVLS_ENABLE
value: "0"
resources:
limits:
nvidia.com/gpu: 8
rdma/hca: 1
volumeMounts:
- name: checkpoints
mountPath: /checkpoints
- name: cfg
mountPath: /cfg
volumes:
- name: checkpoints
nfs:
server: nfs-server.cluster.local
path: /exports/checkpoints
- name: cfg
configMap:
name: nemo-pretrain-cfg
Expected behavior: ft_launcher on each pod monitors the training process with section-based watchdog timers (different timeouts for init, training step, and checkpoint). On a transient NCCL hang, ft_launcher terminates the process, Kubernetes sees a non-zero exit code, and restarts the pod per restartPolicy: OnFailure. NeMo auto-resumes from the latest checkpoint directory on restart.
Common failure mode — the single straggler: One pod on a degraded node completes forward passes 3x slower than the rest. Because AllReduce is a synchronous barrier, every other rank waits. The straggler does not fail; it just runs slowly. The NCCL watchdog does not fire because NCCL is making forward progress. Your effective training throughput collapses and you do not see an error anywhere. This is exactly the failure straggler detection solves. Enable it explicitly in your NeMo config:
# In your NeMo YAML trainer config
trainer:
plugins:
- _target_: nemo.lightning.pytorch.plugins.MegatronDataSamplerPlugin
straggler_detection:
enabled: true
straggler_report_time_interval: 300
stop_if_detected: true
relative_gpu_performance_threshold: 0.7
Worked Example
Setup: 512 H100 nodes (4,096 GPUs), 70B parameter Llama-family model, BF16, tensor parallel 8 + pipeline parallel 4 + data parallel 16. Checkpoint size approximately 280 GB distributed across all nodes (shard per rank).
MTBF math: Per-node MTBF 72 hours. Cluster MTBF = 72 / 512 = 0.14 hours = 8.5 minutes. At a 30-minute checkpoint interval: expected wasted GPU-hours per failure = 4,096 x (15 min / 60) = 1,024 GPU-hours. At a 10-minute checkpoint interval with async write (8-second stall): effective checkpoint overhead = 8 sec / (10 min x 60 sec) = 1.3%. Expected wasted GPU-hours per failure = 4,096 x (5 min / 60) = 341 GPU-hours.
Conclusion: Cutting checkpoint interval from 30 to 10 minutes with async write reduces expected wasted GPU-hours per failure by 67%, at a cost of 1.3% throughput overhead. On a 30-day training run at $2/GPU-hour, that is roughly $640 saved per failure event. [AUTHOR: add actual cluster failure rate from a production run to validate this estimate]
Straggler impact observed in practice: One degraded NVLink link reducing a node to 60% throughput brought overall job MFU (model FLOP utilization) from 38% to 24% — a 37% effective throughput loss — with no errors in any log. Straggler detection with a 0.7 relative threshold would have terminated that rank in the next 300-second report cycle and triggered a restart from checkpoint, recovering full throughput. [AUTHOR: add anecdote from specific customer engagement]
What I Would Do in Production
The verdict on a resilient training setup is not a single product choice. It is a stack of decisions that compound:
Scheduler: For a dedicated training cluster, Slurm with topology.conf built from your actual InfiniBand and NVLink fabric, and Slinky if you need container portability. For a shared platform that also runs inference and CI, Kubernetes with NVIDIA Run:ai for gang scheduling and GPU quota management. Do not run pure Kubernetes with only native PodGroup gang scheduling at scale — the partial eviction problem is real and nasty.
Checkpointing: Async torch_dist format, checkpoint every 10-15 minutes for clusters above 256 nodes, local checkpoint + replication if your shared FS is a contention point. Keep two valid checkpoint generations on disk (current and previous), because a checkpoint written during a hardware fault can itself be corrupt.
Resiliency: Layer all of: fault tolerance with ft_launcher (Slurm), straggler detection with a 0.7 relative threshold and stop_if_detected=True, preemption handling if running on shared Slurm with time limits, and async checkpoint save. Set TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC to 180 and monitor NCCL error logs actively — the default timeout of 600 seconds wastes 10 minutes of GPU time per hang event.
When not to use in-process restart: Do not use the experimental in-process restart feature for hardware-level faults (switch failures, NIC failures, GPU resets) — it is designed for software exceptions and deadlocks only. For hardware faults, you need a full job requeue and NCCL communicator rebuild. Mixing both paths without clear failure classification will leave you with a job that half-recovered and corrupt optimizer state.
Validate first: Before running a 30-day training job, inject failure intentionally. Kill one pod. Verify the watchdog fires within your configured timeout, the job requeues, and the checkpoint loads cleanly. Then kill a pod during a checkpoint write and verify the resulting checkpoint is not corrupt. This is the drill that actually validates your resilience stack — not reading the docs.
The Verdict
Multi-node training at thousands of GPUs is less about writing the training loop and more about building an operations layer that treats failure as the default state. The math is not optional: at 512 nodes with a realistic per-node MTBF, you will see a failure roughly every 8.5 minutes. If your checkpoint interval is longer than that and your restart is manual, you will burn most of your compute on replay rather than progress.
The NVIDIA toolchain gives you the pieces: Slurm with topology-aware gang scheduling, NeMo with ft_launcher and the Resiliency Extension, async torch_dist checkpointing, and NCCL 2.27 with ncclCommShrink for elastic recovery. None of these work in isolation. The teams I have seen run long training jobs without wasting massive amounts of compute are the ones that treat their resilience stack as a product with a test suite, not as a checkbox.
The next part of this series covers NVIDIA foundation models: the Nemotron family and the open vs proprietary question. For the VCF-specific deployment of NeMo training clusters on NVIDIA AI Enterprise, the Private AI Series covers orchestration and platform configuration.
If your training cluster is below 64 nodes and you are not seeing failures, your checkpoint interval matters less than getting the topology placement right. At 256+ nodes, the resilience stack is non-optional. At 1,024+ nodes, it is the difference between completing a training run and burning a budget.
References
- NVIDIA NeMo Megatron Bridge: Resiliency Features (docs.nvidia.com, 2026)
- Building Scalable and Fault-Tolerant NCCL Applications (NVIDIA Technical Blog, Nov 2025)
- Enabling Fast Inference and Resilient Training with NCCL 2.27 (NVIDIA Technical Blog)
- Reducing Model Checkpointing Times by Over 10x with PyTorch Distributed Asynchronous Checkpointing (pytorch.org)
- Running AI Workloads on Rack-Scale Supercomputers: Topology-Aware Scheduling (NVIDIA Technical Blog)
- NVIDIA Slinky: Run Slurm Inside Kubernetes (nvidia.com)
- NVIDIA Resiliency Extension (github.com/NVIDIA)



