Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NVLink and NVSwitch: How NVIDIA Builds the Scale-Up Fabric (NVIDIA AI Series, Part 7)

Fifth-generation NVLink delivers 1.8 TB/s per GPU, and NVSwitch builds a non-blocking 130 TB/s all-to-all fabric across 72 GPUs in the GB200 NVL72. Here is how the domain forms, why it determines your tensor and expert parallelism strategy, and where the boundary falls.

NVIDIA AI Series · Part 7 of 30
TL;DR: NVLink (5th-gen, 1.8 TB/s per GPU) and NVSwitch form a non-blocking all-to-all fabric inside a node and across an NVL72 rack (72 GPUs, 130 TB/s aggregate). That fabric is the reason tensor parallelism and expert parallelism work at scale without InfiniBand bottlenecks. But the domain has a hard boundary: once your parallelism strategy crosses the NVLink domain you fall onto InfiniBand, and the latency and bandwidth gap is large enough to dictate how you split your model. Know the boundary before you plan your parallelism layout.
Who this is for: AI infrastructure architects and platform engineers sizing or operating DGX/HGX/NVL72 systems for LLM training or inference. You should already know what a GPU is and why it has HBM. If you are brand new to the stack, start with the NVIDIA AI guide. If you want the scale-out (InfiniBand/Spectrum-X) side, that is Part 8.

Run nvidia-smi topo -m on a DGX H100 and you will see every GPU-to-GPU cell marked NV18. That code means 18 active NVLink lanes between those two GPUs, each lane carrying data at around 100 GB/s, for a combined 1.8 TB/s bidirectional bandwidth between any pair. Now imagine needing that same bandwidth simultaneously between all eight GPUs in the node, and then across all 72 GPUs in a rack, without any link becoming a bottleneck. That is the engineering problem NVSwitch solves. Getting this fabric right determines whether a 405-billion-parameter model fits inside one NVLink domain or needs to be sharded across InfiniBand, which is a fundamentally different and slower proposition. The fabric is not a footnote to GPU specs; it is the reason the GB200 NVL72 can serve trillion-parameter models in real time.

NVLink Generations: What Changed and Why It Matters

NVLink started on Pascal in 2016 with a modest 160 GB/s per GPU. The progression since then has been consistent: roughly double the per-GPU bandwidth every two GPU generations. But the architectural decision to use NVSwitch as a crossbar fabric, rather than daisy-chaining GPUs in a ring or mesh, changed the game on Volta (NVLink 2, 2017) and made all subsequent scaling practical.

Generation Architecture Links per GPU BW per GPU (bidir) Max GPU Domain
NVLink 1Pascal (P100)4160 GB/s2 (peer only)
NVLink 2Volta (V100)6300 GB/s8 (via NVSwitch)
NVLink 3Ampere (A100)12600 GB/s8 (via NVSwitch)
NVLink 4Hopper (H100)18900 GB/s8 (via NVSwitch)
NVLink 5Blackwell (B200/GB200)181,800 GB/s72 (NVL72), 576 (multi-node)
NVLink 6Rubin (Vera Rubin)363,600 GB/s72 (NVL72, 260 TB/s), 576+ (multi-node)

Two things stand out in that table. First, the link count per GPU stayed at 18 from Hopper to Blackwell; NVIDIA doubled bandwidth by increasing signaling speed per lane, not by adding more physical wires. Second, Blackwell is the first generation where the NVSwitch-based domain broke out of the 8-GPU node limit and scaled to 72 GPUs in a rack. That expansion required a new NVLink Switch chip with enough crossbar capacity to run all 72 GPUs at full 1.8 TB/s, non-blocking. On Rubin (NVLink 6), link count doubles to 36 and per-GPU bandwidth reaches 3.6 TB/s, pushing the NVL72 aggregate to 260 TB/s.

NVLink Per-GPU Bandwidth by Generation Bidirectional bandwidth per GPU (GB/s) 160 Pascal 300 Volta 600 Ampere 900 Hopper 1,800 Blackwell 3,600 Rubin
Fig 1: NVLink per-GPU bidirectional bandwidth growth across six generations. Blackwell doubles Hopper; Rubin doubles Blackwell again.

How NVSwitch Builds the All-to-All Fabric

NVLink by itself is a point-to-point protocol. Connect GPU A to GPU B and they can talk at 1.8 TB/s. But a training AllReduce across eight GPUs needs every GPU to send to every other GPU simultaneously. A direct mesh of eight GPUs would require 28 separate cables and each GPU would need a port for every other peer. That does not scale.

NVSwitch solves this with a non-blocking crossbar. Every GPU connects its NVLink lanes to an NVSwitch chip (or to multiple NVSwitch chips on larger systems), and the switch fabric routes any GPU-to-GPU transfer without head-of-line blocking. The crossbar ensures that when GPU 0 is sending to GPU 3 and GPU 4 is sending to GPU 7, both transfers run simultaneously at full link speed with no interference. This is the same property that makes a 64-port non-blocking Ethernet switch more useful than a daisy-chained ring, and it matters enormously for collective operations.

NVSwitch Crossbar: DGX H100 (8 GPU, 4 NVSwitch) Each GPU connects to each NVSwitch; all pairs reachable at full NVLink speed GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 NVSwitch 0 NVSwitch 1 NVSwitch 2 NVSwitch 3 Non-blocking: all GPU pairs communicate simultaneously at full NVLink speed
Fig 2: NVSwitch crossbar inside a DGX/HGX H100 node. Four NVSwitch chips create a full crossbar for 8 GPUs; any pair can transfer at 900 GB/s (NVLink 4) with no blocking.

SHARP: In-Network Reductions

Starting with the 5th-gen NVSwitch, NVIDIA added SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) engines inside the switch chip itself. SHARP offloads AllReduce operations to the switch hardware, so gradient accumulation during training does not need to bounce all-reduce traffic through GPU memory. The NVSwitch chip can sum FP8 tensors in-flight. At 72 GPU scale this is meaningful: an AllReduce that previously required 71 inter-GPU transfers can be completed with a single broadcast from the switch. NVIDIA cites 4x bandwidth efficiency gain from SHARP FP8 support on the 5th-gen NVSwitch. [VERIFY exact efficiency multiplier in production workloads]

In practice: On H100 DGX nodes, SHARP is active for collective operations when NCCL is version 2.12 or later and the NVSwitch firmware supports it. You do not configure SHARP manually in NCCL; it is detected and activated automatically. The place things break is when some nodes in a cluster are on older NVSwitch firmware that does not report SHARP capability, causing NCCL to fall back to ring-based AllReduce for the entire job rather than only those nodes. Validate firmware homogeneity across all nodes before you run distributed training at scale.

The NVL72: How 72-GPU All-to-All Works at Rack Scale

The GB200 NVL72 is a 72-GPU rack-scale system: 36 Grace CPU sockets paired with 72 Blackwell GPUs, all connected in a single NVLink domain by nine NVLink Switch trays. The total aggregate GPU-to-GPU bandwidth is 130 TB/s. Every GPU in the rack can talk to every other GPU at its full 1.8 TB/s link speed, simultaneously, with no blocking. NVIDIA describes this as 72 GPUs acting as a single GPU, which is marketing language, but the underlying physics is real: the latency and bandwidth inside the domain are homogeneous regardless of which two GPUs are communicating. That homogeneity is what allows you to place tensor-parallel shards arbitrarily across all 72 GPUs without worrying about slower hops for certain shard pairs.

The 30 TB of unified HBM3e memory across all 72 GPUs is reachable by any GPU in the domain via NVLink with no CPU mediation. This is what makes serving a 405B-parameter model (Llama 3.1 405B, for example) in real time practical: the weights fit in the domain aggregate memory and the activations can be streamed across the fabric without copying through host DRAM or going across a PCIe root complex.

GB200 NVL72 Rack: Single NVLink Domain 72 Blackwell GPUs + 9 NVLink Switch trays = 130 TB/s all-to-all fabric NVLink Switch Trays (x9) 130 TB/s crossbar non-blocking SHARP FP8 engines GB200 x8 GB200 x8 GB200 x8 GB200 x8 GB200 x8 GB200 x8 GB200 x8 GB200 x8 GB200 x8 GB200 x8 5 GPU compute trays 5 GPU compute trays Total: 72 Blackwell GPUs | 30 TB HBM3e | 130 TB/s NVLink fabric
Fig 3: GB200 NVL72 rack topology. Nine NVLink Switch trays form the crossbar spine; all GPU trays connect to all switch trays. The 72 GPUs share a single NVLink domain with 130 TB/s aggregate bandwidth.

Multi-Rack NVLink: Up to 576 GPUs

For workloads that require more than 72 GPUs within the NVLink domain, NVIDIA supports multi-rack NVLink configurations that can span up to 576 Blackwell GPUs. This is achieved by connecting NVLink Switch trays across racks using NVLink cables, extending the same non-blocking all-to-all fabric to a larger set of GPUs. The bandwidth homogeneity is maintained; every GPU-to-GPU pair still communicates at 1.8 TB/s. This differs fundamentally from InfiniBand scale-out, where you accept higher latency and lower effective bisection bandwidth as you cross switch tiers.

NVLink-C2C: The Grace-to-Blackwell Link

NVLink-C2C (Chip-to-Chip) is a variant of the NVLink protocol implemented as a direct die-to-die interconnect. In the Grace Blackwell Superchip, one Grace CPU and two Blackwell GPU dies are packaged together, connected by NVLink-C2C at 900 GB/s per direction (1.8 TB/s bidirectional) for the CPU-GPU link, and a 10 TB/s die-to-die link connects the two Blackwell dies internally. The full GB200 Superchip delivers 3.6 TB/s total bidirectional bandwidth between the two GPU dies and the Grace CPU.

Why does the CPU-GPU link matter for an AI infrastructure architect? Because it determines how efficiently the CPU can stage data for the GPU and how much CPU-side memory (the Grace CPU has up to 480 GB of LPDDR5X) participates in the memory hierarchy. NVLink-C2C makes the Grace memory directly addressable by the GPU at close to NVLink speeds, which changes the effective working set for large-context inference. The GPU does not have to first copy data to its own HBM; it can read Grace DRAM across the C2C link. That matters for very long context lengths where KV caches overflow HBM.

Worked example

Consider Llama 3.1 405B in FP8 precision: the model weights require roughly 405 GB. A single NVL72 with 72 GPUs at 141 GB HBM3e each has ~10 TB HBM available, so 405B fits easily in GPU memory with room for KV caches. Tensor parallelism (TP) = 72 means every attention head computation triggers an AllReduce across all 72 GPUs over NVLink at 1.8 TB/s. Compare this to an 8-GPU node with Hopper at 900 GB/s: you need pipeline parallelism (PP) across nodes, which adds inter-node InfiniBand latency to every pipeline bubble. The NVL72 eliminates those pipeline bubbles for this model size. For a 1T parameter model in FP8 (~1 TB weights), you would need multiple NVL72 racks; at that point you are back to pipeline or expert parallelism crossing the InfiniBand boundary.

Why Scale-Up Changes Your Parallelism Strategy

There are three main parallelism strategies for large models: tensor parallelism (TP), pipeline parallelism (PP), and expert parallelism (EP) for mixture-of-experts models. Each has a different sensitivity to the interconnect.

Tensor parallelism splits each weight matrix across GPUs. Every forward pass triggers two AllReduce operations per transformer layer. At high tensor-parallel degree, the AllReduce volume is large and the latency of the operation is directly on the critical path. Tensor parallelism works well only inside a single NVLink domain. Once you cross the InfiniBand boundary, the latency of an AllReduce at 200 Gb/s InfiniBand is 10 to 50x higher than across NVLink, and TP degree above 8 over InfiniBand is nearly always net-negative for throughput. The practical consequence: keep TP within one NVLink domain (8 GPUs for Hopper, 72 for Blackwell NVL72).

Pipeline parallelism stages model layers across GPUs sequentially. The communication cost is the activation tensor sent between pipeline stages; this is much smaller than an AllReduce, so it tolerates higher latency and works reasonably well across InfiniBand. The cost is pipeline bubbles: GPUs in earlier stages sit idle while later stages process. Bubble fraction is (PP – 1) / PP of a microbatch time, which grows as you add more pipeline stages. The NVL72 lets you set TP = 72 and PP = 1 for models that fit in 72 GPUs worth of memory, eliminating pipeline bubbles entirely for those models.

Expert parallelism for MoE models routes tokens to a subset of expert layers. The communication pattern is an AlltoAll: each GPU sends token embeddings to whichever GPU holds the relevant expert, receives outputs back, and then aggregates. AlltoAll has a similar latency sensitivity to AllReduce; it should stay within the NVLink domain. MoE models like Mixtral or NVIDIA own Nemotron MoE variants run far more efficiently when expert parallelism degree equals the NVLink domain size. With a 72-GPU NVLink domain, you can run EP = 72, meaning all 72 expert shards are reachable without leaving the fabric.

Parallelism Strategy vs. Required Interconnect Strategy Communication Pattern Interconnect Needed Tensor Parallelism AllReduce (2x/layer) NVLink only (latency- sensitive AllReduce) Pipeline Parallelism Point-to-point activations InfiniBand (tolerates latency) Expert Parallelism (MoE) AlltoAll per token NVLink preferred (AlltoAll latency-sensitive) Data Parallelism Gradient AllReduce InfiniBand (async overlap) Red border = strongly prefers NVLink; Grey = works across scale-out fabric
Fig 4: Parallelism strategy vs. interconnect sensitivity. Tensor and expert parallelism must stay inside the NVLink domain; pipeline and data parallelism can cross InfiniBand.

Reading the NVLink Topology in Production

Two commands tell you everything about your NVLink fabric health: nvidia-smi topo -m for the topology matrix and nvidia-smi nvlink -s -i <gpu_index> for per-link status on a specific GPU.

Topology Matrix: What to Look For

# On a DGX H100 (8 x H100, NVLink 4 / 18 links per GPU):
$ nvidia-smi topo -m

# Truncated output (GPU rows only):
#             GPU0   GPU1   GPU2   GPU3   GPU4   GPU5   GPU6   GPU7
# GPU0         X    NV18   NV18   NV18   NV18   NV18   NV18   NV18
# GPU1        NV18    X    NV18   NV18   NV18   NV18   NV18   NV18
# GPU2        NV18  NV18    X     NV18   NV18   NV18   NV18   NV18
# GPU3        NV18  NV18   NV18    X     NV18   NV18   NV18   NV18
# GPU4        NV18  NV18   NV18   NV18    X     NV18   NV18   NV18
# GPU5        NV18  NV18   NV18   NV18   NV18    X     NV18   NV18
# GPU6        NV18  NV18   NV18   NV18   NV18   NV18    X     NV18
# GPU7        NV18  NV18   NV18   NV18   NV18   NV18   NV18    X

# NV18 = 18 NVLink lanes active between this GPU pair
# If any cell shows PIX, NODE, or SYS instead of NV18, that pair
# is NOT connected via NVLink and will route through PCIe or QPI.
# That GPU pair will see dramatically lower bandwidth (~64 GB/s PCIe
# Gen 4 vs 900 GB/s NVLink 4), breaking tensor-parallel performance.

# Per-link status for GPU 0:
$ nvidia-smi nvlink -s -i 0

# Expected output:
# GPU 00000000:03:00.0
#         Link 0: Active
#         Link 1: Active
#         ...
#         Link 17: Active
# All 18 links must show Active. A single Inactive link reduces
# GPU0's effective NVLink bandwidth by ~100 GB/s (1/18th of 1.8 TB/s
# on Blackwell, 1/18th of 900 GB/s on Hopper).

# Check link error counters (replay errors indicate physical layer issues):
$ nvidia-smi nvlink --errorcounters -i 0

Gotcha

If nvidia-smi topo -m shows SYS between two GPUs that should be NVLink-connected, the cause is almost always a failed NVSwitch tray rather than a bad cable. A single dead NVSwitch chip degrades the fabric by removing one path from the crossbar; it does not make the link appear inactive, it just silently reduces bisection bandwidth. NCCL will not warn you; your training throughput will drop by roughly (1/num_switches) of what the crossbar should deliver. On an NVL72 with 9 switch trays, one dead tray means roughly 11% less fabric bandwidth. Validate switch tray health through DCGM or the BMC before blaming NCCL configuration or model hyperparameters for unexplained throughput degradation.

The Domain Boundary: Where NVLink Ends and InfiniBand Begins

The NVLink domain has a hard architectural boundary. For a standard Hopper DGX/HGX node, that boundary is 8 GPUs. For a Blackwell NVL72 rack, it is 72 GPUs. For multi-rack NVLink configurations, it can reach 576 Blackwell GPUs. Beyond that boundary, all GPU-to-GPU traffic goes through InfiniBand (or Spectrum-X Ethernet), which is covered in Part 8.

Platform NVLink Domain Size Aggregate BW Best Fits (Model Size) Scale-Out Needed?
DGX H100 / HGX H1008 GPUs7.2 TB/sUp to ~70B params (FP8)Yes, for 70B+ training
GB200 NVL72 (single rack)72 GPUs130 TB/sUp to ~700B params (FP8)Only for 700B+ training
Multi-rack NVLink (Blackwell)Up to 576 GPUs>1 PB/s [VERIFY]Trillion-param+ trainingNot for TP/EP; PP for larger
Vera Rubin NVL7272 GPUs260 TB/sTrillion-param inference [VERIFY shipping date]Only for multi-trillion

The domain boundary matters practically when you are sizing a parallelism configuration. If your model fits within one NVLink domain with your chosen precision, you can run TP = (domain size) and set PP = 1. If your model requires more GPUs than the domain, you must add pipeline parallelism across the InfiniBand fabric, accepting pipeline bubbles and the associated throughput hit. This is the central design trade-off of the NVL72: for models up to roughly 700B parameters in FP8, the entire training run can stay inside NVLink. For larger models or very large batch sizes, you go multi-rack and accept the InfiniBand overhead on the pipeline dimension only.

My Verdict: When NVLink Domain Size Actually Changes What You Do

The NVLink domain size matters most at two thresholds: when your model barely fits in (or barely overflows) the domain, and when your serving latency requirement is tight enough that pipeline bubbles are unacceptable.

For inference of dense models in the 7B to 70B range, an 8-GPU Hopper node with NVLink is entirely sufficient. You do not need an NVL72 unless you are serving at very high concurrency and can fill all 72 GPUs with distinct model replicas. The NVL72 makes the most sense for two scenarios: (1) a single very large model (405B+) that must fit in one domain for real-time inference, and (2) MoE models where expert parallelism degree needs to equal or exceed 72 to avoid cross-domain AlltoAll.

When the NVLink domain size does NOT change what you do: if you are running data parallel training of smaller models (7B, 13B, 70B) and overlapping gradient communication with the backward pass, InfiniBand is sufficient and the NVL72 is overkill. You get better economics by running four DGX H100 nodes (32 GPUs total, 4x more model replicas in parallel) than one NVL72 for the same GPU count at the same model size. The NVL72 is not a universal upgrade; it is a platform for a specific class of workload.

What to validate before committing to either platform: first, confirm your model size and precision against the domain memory capacity. Second, profile your AllReduce and AlltoAll communication volume at the target TP/EP degree using NCCL tests at realistic batch sizes. Third, check whether your training framework (NeMo, Megatron-LM, or others) supports the TP degree you are targeting at that precision. Frameworks have per-degree optimizations that are not always available at TP = 72. For serving, measure time-to-first-token and throughput at your target concurrency on an 8-GPU node first; if you are not bottlenecked on NVLink bandwidth, the NVL72 will not help.

My take: The 5th-gen NVLink at 1.8 TB/s per GPU and the 130 TB/s NVL72 fabric are genuinely new capabilities, not incremental improvements. For the class of workloads they target, they change what is possible. But they are expensive, power-hungry, and operationally complex (nine NVLink Switch trays, liquid cooling, specialized cabling). Do not plan your AI infrastructure around them unless you have confirmed that your model size and serving latency targets require them. For most enterprise AI deployments through 2025 and into 2026, a well-configured DGX H100 cluster with InfiniBand HDR or NDR is the right answer. The NVL72 is for frontier model providers and research labs running trillion-parameter workloads. The Vera Rubin NVL72 with NVLink 6 at 260 TB/s will push those limits further still when it ships.

If you are building a new AI cluster and unsure where the NVLink domain boundary matters for your workload, map your model size, precision, and parallelism strategy against the table above before hardware procurement. That table, combined with an NCCL bandwidth test on a borrowed or cloud instance of your target platform, will tell you more than any benchmark white paper. Continue to Part 8: InfiniBand vs. Spectrum-X Ethernet for the scale-out side of this equation, or return to the full NVIDIA AI guide for context on the rest of the stack.

NVIDIA AI Series · Part 7 of 30
« Previous: Part 6  |  NVIDIA AI Guide  |  Next: Part 8 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading