Run nvidia-smi topo -m on a DGX H100 and you will see every GPU-to-GPU cell marked NV18. That code means 18 active NVLink lanes between those two GPUs, each lane carrying data at around 100 GB/s, for a combined 1.8 TB/s bidirectional bandwidth between any pair. Now imagine needing that same bandwidth simultaneously between all eight GPUs in the node, and then across all 72 GPUs in a rack, without any link becoming a bottleneck. That is the engineering problem NVSwitch solves. Getting this fabric right determines whether a 405-billion-parameter model fits inside one NVLink domain or needs to be sharded across InfiniBand, which is a fundamentally different and slower proposition. The fabric is not a footnote to GPU specs; it is the reason the GB200 NVL72 can serve trillion-parameter models in real time.
NVLink Generations: What Changed and Why It Matters
NVLink started on Pascal in 2016 with a modest 160 GB/s per GPU. The progression since then has been consistent: roughly double the per-GPU bandwidth every two GPU generations. But the architectural decision to use NVSwitch as a crossbar fabric, rather than daisy-chaining GPUs in a ring or mesh, changed the game on Volta (NVLink 2, 2017) and made all subsequent scaling practical.
| Generation | Architecture | Links per GPU | BW per GPU (bidir) | Max GPU Domain |
|---|---|---|---|---|
| NVLink 1 | Pascal (P100) | 4 | 160 GB/s | 2 (peer only) |
| NVLink 2 | Volta (V100) | 6 | 300 GB/s | 8 (via NVSwitch) |
| NVLink 3 | Ampere (A100) | 12 | 600 GB/s | 8 (via NVSwitch) |
| NVLink 4 | Hopper (H100) | 18 | 900 GB/s | 8 (via NVSwitch) |
| NVLink 5 | Blackwell (B200/GB200) | 18 | 1,800 GB/s | 72 (NVL72), 576 (multi-node) |
| NVLink 6 | Rubin (Vera Rubin) | 36 | 3,600 GB/s | 72 (NVL72, 260 TB/s), 576+ (multi-node) |
Two things stand out in that table. First, the link count per GPU stayed at 18 from Hopper to Blackwell; NVIDIA doubled bandwidth by increasing signaling speed per lane, not by adding more physical wires. Second, Blackwell is the first generation where the NVSwitch-based domain broke out of the 8-GPU node limit and scaled to 72 GPUs in a rack. That expansion required a new NVLink Switch chip with enough crossbar capacity to run all 72 GPUs at full 1.8 TB/s, non-blocking. On Rubin (NVLink 6), link count doubles to 36 and per-GPU bandwidth reaches 3.6 TB/s, pushing the NVL72 aggregate to 260 TB/s.
How NVSwitch Builds the All-to-All Fabric
NVLink by itself is a point-to-point protocol. Connect GPU A to GPU B and they can talk at 1.8 TB/s. But a training AllReduce across eight GPUs needs every GPU to send to every other GPU simultaneously. A direct mesh of eight GPUs would require 28 separate cables and each GPU would need a port for every other peer. That does not scale.
NVSwitch solves this with a non-blocking crossbar. Every GPU connects its NVLink lanes to an NVSwitch chip (or to multiple NVSwitch chips on larger systems), and the switch fabric routes any GPU-to-GPU transfer without head-of-line blocking. The crossbar ensures that when GPU 0 is sending to GPU 3 and GPU 4 is sending to GPU 7, both transfers run simultaneously at full link speed with no interference. This is the same property that makes a 64-port non-blocking Ethernet switch more useful than a daisy-chained ring, and it matters enormously for collective operations.
SHARP: In-Network Reductions
Starting with the 5th-gen NVSwitch, NVIDIA added SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) engines inside the switch chip itself. SHARP offloads AllReduce operations to the switch hardware, so gradient accumulation during training does not need to bounce all-reduce traffic through GPU memory. The NVSwitch chip can sum FP8 tensors in-flight. At 72 GPU scale this is meaningful: an AllReduce that previously required 71 inter-GPU transfers can be completed with a single broadcast from the switch. NVIDIA cites 4x bandwidth efficiency gain from SHARP FP8 support on the 5th-gen NVSwitch. [VERIFY exact efficiency multiplier in production workloads]
The NVL72: How 72-GPU All-to-All Works at Rack Scale
The GB200 NVL72 is a 72-GPU rack-scale system: 36 Grace CPU sockets paired with 72 Blackwell GPUs, all connected in a single NVLink domain by nine NVLink Switch trays. The total aggregate GPU-to-GPU bandwidth is 130 TB/s. Every GPU in the rack can talk to every other GPU at its full 1.8 TB/s link speed, simultaneously, with no blocking. NVIDIA describes this as 72 GPUs acting as a single GPU, which is marketing language, but the underlying physics is real: the latency and bandwidth inside the domain are homogeneous regardless of which two GPUs are communicating. That homogeneity is what allows you to place tensor-parallel shards arbitrarily across all 72 GPUs without worrying about slower hops for certain shard pairs.
The 30 TB of unified HBM3e memory across all 72 GPUs is reachable by any GPU in the domain via NVLink with no CPU mediation. This is what makes serving a 405B-parameter model (Llama 3.1 405B, for example) in real time practical: the weights fit in the domain aggregate memory and the activations can be streamed across the fabric without copying through host DRAM or going across a PCIe root complex.
Multi-Rack NVLink: Up to 576 GPUs
For workloads that require more than 72 GPUs within the NVLink domain, NVIDIA supports multi-rack NVLink configurations that can span up to 576 Blackwell GPUs. This is achieved by connecting NVLink Switch trays across racks using NVLink cables, extending the same non-blocking all-to-all fabric to a larger set of GPUs. The bandwidth homogeneity is maintained; every GPU-to-GPU pair still communicates at 1.8 TB/s. This differs fundamentally from InfiniBand scale-out, where you accept higher latency and lower effective bisection bandwidth as you cross switch tiers.
NVLink-C2C: The Grace-to-Blackwell Link
NVLink-C2C (Chip-to-Chip) is a variant of the NVLink protocol implemented as a direct die-to-die interconnect. In the Grace Blackwell Superchip, one Grace CPU and two Blackwell GPU dies are packaged together, connected by NVLink-C2C at 900 GB/s per direction (1.8 TB/s bidirectional) for the CPU-GPU link, and a 10 TB/s die-to-die link connects the two Blackwell dies internally. The full GB200 Superchip delivers 3.6 TB/s total bidirectional bandwidth between the two GPU dies and the Grace CPU.
Why does the CPU-GPU link matter for an AI infrastructure architect? Because it determines how efficiently the CPU can stage data for the GPU and how much CPU-side memory (the Grace CPU has up to 480 GB of LPDDR5X) participates in the memory hierarchy. NVLink-C2C makes the Grace memory directly addressable by the GPU at close to NVLink speeds, which changes the effective working set for large-context inference. The GPU does not have to first copy data to its own HBM; it can read Grace DRAM across the C2C link. That matters for very long context lengths where KV caches overflow HBM.
Worked example
Consider Llama 3.1 405B in FP8 precision: the model weights require roughly 405 GB. A single NVL72 with 72 GPUs at 141 GB HBM3e each has ~10 TB HBM available, so 405B fits easily in GPU memory with room for KV caches. Tensor parallelism (TP) = 72 means every attention head computation triggers an AllReduce across all 72 GPUs over NVLink at 1.8 TB/s. Compare this to an 8-GPU node with Hopper at 900 GB/s: you need pipeline parallelism (PP) across nodes, which adds inter-node InfiniBand latency to every pipeline bubble. The NVL72 eliminates those pipeline bubbles for this model size. For a 1T parameter model in FP8 (~1 TB weights), you would need multiple NVL72 racks; at that point you are back to pipeline or expert parallelism crossing the InfiniBand boundary.
Why Scale-Up Changes Your Parallelism Strategy
There are three main parallelism strategies for large models: tensor parallelism (TP), pipeline parallelism (PP), and expert parallelism (EP) for mixture-of-experts models. Each has a different sensitivity to the interconnect.
Tensor parallelism splits each weight matrix across GPUs. Every forward pass triggers two AllReduce operations per transformer layer. At high tensor-parallel degree, the AllReduce volume is large and the latency of the operation is directly on the critical path. Tensor parallelism works well only inside a single NVLink domain. Once you cross the InfiniBand boundary, the latency of an AllReduce at 200 Gb/s InfiniBand is 10 to 50x higher than across NVLink, and TP degree above 8 over InfiniBand is nearly always net-negative for throughput. The practical consequence: keep TP within one NVLink domain (8 GPUs for Hopper, 72 for Blackwell NVL72).
Pipeline parallelism stages model layers across GPUs sequentially. The communication cost is the activation tensor sent between pipeline stages; this is much smaller than an AllReduce, so it tolerates higher latency and works reasonably well across InfiniBand. The cost is pipeline bubbles: GPUs in earlier stages sit idle while later stages process. Bubble fraction is (PP – 1) / PP of a microbatch time, which grows as you add more pipeline stages. The NVL72 lets you set TP = 72 and PP = 1 for models that fit in 72 GPUs worth of memory, eliminating pipeline bubbles entirely for those models.
Expert parallelism for MoE models routes tokens to a subset of expert layers. The communication pattern is an AlltoAll: each GPU sends token embeddings to whichever GPU holds the relevant expert, receives outputs back, and then aggregates. AlltoAll has a similar latency sensitivity to AllReduce; it should stay within the NVLink domain. MoE models like Mixtral or NVIDIA own Nemotron MoE variants run far more efficiently when expert parallelism degree equals the NVLink domain size. With a 72-GPU NVLink domain, you can run EP = 72, meaning all 72 expert shards are reachable without leaving the fabric.
Reading the NVLink Topology in Production
Two commands tell you everything about your NVLink fabric health: nvidia-smi topo -m for the topology matrix and nvidia-smi nvlink -s -i <gpu_index> for per-link status on a specific GPU.
Topology Matrix: What to Look For
# On a DGX H100 (8 x H100, NVLink 4 / 18 links per GPU):
$ nvidia-smi topo -m
# Truncated output (GPU rows only):
# GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
# GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18
# GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18
# GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18
# GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18
# GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18
# GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18
# GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18
# GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X
# NV18 = 18 NVLink lanes active between this GPU pair
# If any cell shows PIX, NODE, or SYS instead of NV18, that pair
# is NOT connected via NVLink and will route through PCIe or QPI.
# That GPU pair will see dramatically lower bandwidth (~64 GB/s PCIe
# Gen 4 vs 900 GB/s NVLink 4), breaking tensor-parallel performance.
# Per-link status for GPU 0:
$ nvidia-smi nvlink -s -i 0
# Expected output:
# GPU 00000000:03:00.0
# Link 0: Active
# Link 1: Active
# ...
# Link 17: Active
# All 18 links must show Active. A single Inactive link reduces
# GPU0's effective NVLink bandwidth by ~100 GB/s (1/18th of 1.8 TB/s
# on Blackwell, 1/18th of 900 GB/s on Hopper).
# Check link error counters (replay errors indicate physical layer issues):
$ nvidia-smi nvlink --errorcounters -i 0
Gotcha
If nvidia-smi topo -m shows SYS between two GPUs that should be NVLink-connected, the cause is almost always a failed NVSwitch tray rather than a bad cable. A single dead NVSwitch chip degrades the fabric by removing one path from the crossbar; it does not make the link appear inactive, it just silently reduces bisection bandwidth. NCCL will not warn you; your training throughput will drop by roughly (1/num_switches) of what the crossbar should deliver. On an NVL72 with 9 switch trays, one dead tray means roughly 11% less fabric bandwidth. Validate switch tray health through DCGM or the BMC before blaming NCCL configuration or model hyperparameters for unexplained throughput degradation.
The Domain Boundary: Where NVLink Ends and InfiniBand Begins
The NVLink domain has a hard architectural boundary. For a standard Hopper DGX/HGX node, that boundary is 8 GPUs. For a Blackwell NVL72 rack, it is 72 GPUs. For multi-rack NVLink configurations, it can reach 576 Blackwell GPUs. Beyond that boundary, all GPU-to-GPU traffic goes through InfiniBand (or Spectrum-X Ethernet), which is covered in Part 8.
| Platform | NVLink Domain Size | Aggregate BW | Best Fits (Model Size) | Scale-Out Needed? |
|---|---|---|---|---|
| DGX H100 / HGX H100 | 8 GPUs | 7.2 TB/s | Up to ~70B params (FP8) | Yes, for 70B+ training |
| GB200 NVL72 (single rack) | 72 GPUs | 130 TB/s | Up to ~700B params (FP8) | Only for 700B+ training |
| Multi-rack NVLink (Blackwell) | Up to 576 GPUs | >1 PB/s [VERIFY] | Trillion-param+ training | Not for TP/EP; PP for larger |
| Vera Rubin NVL72 | 72 GPUs | 260 TB/s | Trillion-param inference [VERIFY shipping date] | Only for multi-trillion |
The domain boundary matters practically when you are sizing a parallelism configuration. If your model fits within one NVLink domain with your chosen precision, you can run TP = (domain size) and set PP = 1. If your model requires more GPUs than the domain, you must add pipeline parallelism across the InfiniBand fabric, accepting pipeline bubbles and the associated throughput hit. This is the central design trade-off of the NVL72: for models up to roughly 700B parameters in FP8, the entire training run can stay inside NVLink. For larger models or very large batch sizes, you go multi-rack and accept the InfiniBand overhead on the pipeline dimension only.
My Verdict: When NVLink Domain Size Actually Changes What You Do
The NVLink domain size matters most at two thresholds: when your model barely fits in (or barely overflows) the domain, and when your serving latency requirement is tight enough that pipeline bubbles are unacceptable.
For inference of dense models in the 7B to 70B range, an 8-GPU Hopper node with NVLink is entirely sufficient. You do not need an NVL72 unless you are serving at very high concurrency and can fill all 72 GPUs with distinct model replicas. The NVL72 makes the most sense for two scenarios: (1) a single very large model (405B+) that must fit in one domain for real-time inference, and (2) MoE models where expert parallelism degree needs to equal or exceed 72 to avoid cross-domain AlltoAll.
When the NVLink domain size does NOT change what you do: if you are running data parallel training of smaller models (7B, 13B, 70B) and overlapping gradient communication with the backward pass, InfiniBand is sufficient and the NVL72 is overkill. You get better economics by running four DGX H100 nodes (32 GPUs total, 4x more model replicas in parallel) than one NVL72 for the same GPU count at the same model size. The NVL72 is not a universal upgrade; it is a platform for a specific class of workload.
What to validate before committing to either platform: first, confirm your model size and precision against the domain memory capacity. Second, profile your AllReduce and AlltoAll communication volume at the target TP/EP degree using NCCL tests at realistic batch sizes. Third, check whether your training framework (NeMo, Megatron-LM, or others) supports the TP degree you are targeting at that precision. Frameworks have per-degree optimizations that are not always available at TP = 72. For serving, measure time-to-first-token and throughput at your target concurrency on an 8-GPU node first; if you are not bottlenecked on NVLink bandwidth, the NVL72 will not help.
If you are building a new AI cluster and unsure where the NVLink domain boundary matters for your workload, map your model size, precision, and parallelism strategy against the table above before hardware procurement. That table, combined with an NCCL bandwidth test on a borrowed or cloud instance of your target platform, will tell you more than any benchmark white paper. Continue to Part 8: InfiniBand vs. Spectrum-X Ethernet for the scale-out side of this equation, or return to the full NVIDIA AI guide for context on the rest of the stack.
References
- NVIDIA NVLink and NVSwitch Product Page (nvidia.com)
- NVIDIA GB200 NVL72 Product Page (nvidia.com)
- NVIDIA NVLink and NVSwitch: LLM Inference at Scale (developer.nvidia.com)
- Validate System Topology/NVLink — DGX BasePOD Deployment Guide (docs.nvidia.com)
- NVIDIA GB200 NVL Multi-Node Tuning Guide (docs.nvidia.com)
- GB200 NVL72 Delivers Trillion-Parameter LLM Training (developer.nvidia.com)
- NVLink-C2C Chip Interconnect Technology (nvidia.com)



