The Network and Storage Behind Large-Scale AI (GenAI Series, Part 26)

At scale, the network between GPUs is often the real bottleneck. How NVLink, InfiniBand and RoCE, collective operations like all-reduce, and high-throughput storage keep GPUs fed.

by

Dr. Pranay Jha

June 18, 2026

10 minutes

Read Time

Generative AI Series · Part 26 of 30

TL;DR · Key Takeaways

At scale, the network between GPUs is often the real bottleneck. Expensive accelerators can sit idle waiting to talk to each other.
Inside a server, GPUs link over ultra-fast NVLink. Between servers, InfiniBand or high-speed RoCE Ethernet carry the traffic.
Distributed AI constantly runs collective operations (like all-reduce) to keep GPUs in sync, handled by libraries such as NCCL.
Storage for AI is about throughput, streaming huge datasets and loading giant models fast, not the random small-operation speed that classic databases need.

Buy a rack of the most powerful GPUs money can buy, wire them together carelessly, and you will have built an expensive disappointment. This catches people out because all the attention goes to the accelerators, and almost none to the plumbing between them. Yet at scale, AI workloads spend a startling fraction of their time not computing but communicating, GPUs passing data back and forth to stay coordinated, and reading data in from storage. If those paths are too slow, your costly GPUs wait, and a waiting GPU bills exactly the same as a working one. The previous parts kept bumping into the speed of the links; this part finally goes there, into the network and storage that govern large-scale AI.

The interconnect: how GPUs talk

There are two layers of connection, and they operate at very different speeds. Inside a single server, the GPUs are linked directly by NVLink, a dedicated, extremely high-bandwidth bridge that lets cards in the same box exchange data far faster than ordinary connections allow. This is why tensor parallelism from Part 25, which needs constant chatter between GPUs on every token, is kept within one server: NVLink is fast enough to make that chatter cheap. It is the express lane, and it only exists inside the box.

The numbers make the hierarchy vivid. Inside an eight-GPU server, NVLink gives each card on the order of hundreds of gigabytes per second to its neighbours (roughly 900 GB/s per GPU on an H100-class system). Step outside the box and even a fast InfiniBand link, NDR at 400 gigabits per second per port (around 50 GB/s), is close to an order of magnitude slower. That cliff between intra-node and inter-node bandwidth is one of the most important facts in cluster design, and it is exactly why the parallelism that needs the most constant chatter is pinned inside a single server.

Between servers, the traffic crosses the data-center network, and here the choice of fabric matters enormously. InfiniBand is the long-standing favourite for AI and high-performance computing because of its very low latency and its support for RDMA, remote direct memory access, which lets one machine read another’s memory without bothering the CPU. The main alternative is RoCE, which brings RDMA to high-speed Ethernet, appealing to organisations that would rather run one familiar network technology everywhere. Either way, the goal is the same: move data between servers fast enough that the GPUs are not left idling. When people build AI clusters, the interconnect is not an afterthought; it is a primary design decision that can cost as much attention as the GPUs themselves.

This also quietly inverts a decades-old assumption about data-center networks. Traditional applications generate mostly “north-south” traffic, data flowing in and out to users, so networks were tuned for that. AI clusters are dominated by “east-west” traffic, GPUs talking to each other within the cluster, often vastly more bytes than ever reach a user. A few users asking questions can sit behind a torrent of internal gradient and activation exchange. Networks built on the old north-south assumption can choke on this, which is why AI fabrics are designed east-west first, with fat, non-blocking paths between every server so any GPU can reach any other at full speed. If you are planning AI infrastructure on a network designed for ordinary web traffic, this is the assumption most likely to bite you.

Fast within the box, structured between boxes. The whole cluster is only as quick as its slowest hop.

Collective communication: keeping GPUs in sync

Why do GPUs need to talk so much in the first place? Because when you split a job across many of them, they each hold a piece of the work and must regularly combine their pieces. The classic example is training: every GPU computes updates on its slice of the data, and then all of them must agree on a single combined update before the next step. That agreement is a collective operation, the most important being all-reduce, which sums a value across every GPU and hands the total back to all of them. It happens constantly, and until it finishes, everyone waits.

These operations are orchestrated by specialised libraries, most famously NCCL (NVIDIA’s collective communications library), which knows how to route an all-reduce across NVLink and the network fabric in the most efficient pattern for the hardware. The reason this matters to anyone, not just cluster engineers, is that collectives are a synchronization point: the whole group moves at the pace of the slowest participant and the slowest link. A single underperforming GPU or a congested network hop drags the entire cluster down with it. This is why large AI systems care so much about uniform, fast, well-tuned communication, an imbalance anywhere becomes everyone’s problem.

Collectives are the heartbeat of distributed AI, and a heartbeat is only as steady as its weakest link.

Storage: it is about throughput, not IOPS

The third piece of plumbing is storage, and AI stresses it in an unusual way. A traditional database cares about IOPS, the number of small, random read-and-write operations per second, because it is constantly looking up individual records. AI cares about throughput, the sheer volume of data moved per second in big sequential streams. Training reads enormous datasets end to end, over and over. Serving has to load tens of gigabytes of model weights into GPU memory quickly, and any restart means doing it again. Checkpointing a large training run periodically writes a massive snapshot to disk. All of these are throughput problems, big files moving fast, not random-access problems.

Get this wrong and the symptoms are sneaky. GPUs that should be training sit idle waiting for the next batch of data to arrive from slow storage, a problem politely called being “data-loading bound” and impolitely called paying for GPUs to wait on disks. The same bottleneck makes model loading and autoscaling sluggish, because spinning up a new replica means hauling those tens of gigabytes off storage before it can serve anyone. This is why serious AI infrastructure pairs fast GPUs with high-throughput, often parallel, storage systems designed to feed many GPUs at once. Techniques like GPUDirect Storage push this further by letting data move straight from storage into GPU memory, bypassing the CPU and its extra memory copies, which lifts a bottleneck that otherwise caps how fast you can load weights and stream training data. Storage is the least glamorous part of the stack and one of the most common reasons expensive clusters underperform.

Sizing AI storage by database instincts is a classic mismatch. AI wants a firehose, not a fast index.

Reality check: a GPU cluster is a system, not a pile of GPUs, and it runs at the speed of its weakest link. I have seen six- and seven-figure GPU investments throttled by an undersized network or storage tier that cost a fraction of the accelerators. Budget the interconnect and the storage as first-class parts of the design, because the most expensive thing in the building is a GPU that is waiting.

▾ Go Deeper (optional, for technical readers)

Why does network bandwidth cap how far you can scale? Because of the synchronization tax. In data-parallel training, every step ends with an all-reduce of the gradients across all GPUs, and the size of that exchange is proportional to the model’s parameter count. As you add more GPUs to go faster, the compute each one does per step shrinks, but the communication to keep them in sync does not shrink as kindly, so beyond some point you are adding GPUs that spend most of their time talking rather than computing. The result is sub-linear scaling: doubling the GPUs gives well under double the speed, and eventually almost none. The ratio of compute to communication is the lever, which is why faster interconnects (more bandwidth, lower latency) directly raise the ceiling on useful cluster size.

Engineers fight this on several fronts. Overlapping communication with computation hides some of the cost by sending gradients for early layers while later layers are still computing. Topology-aware collective algorithms (ring and tree all-reduce, which NCCL selects between) minimise how much data crosses the slowest links. And the parallelism strategies from Part 25 are chosen partly by their communication profile: tensor parallelism is bandwidth-hungry and stays on NVLink inside a node, while pipeline and data parallelism are arranged to push less traffic across the slower inter-node fabric. For multi-node serving and training specifically, the network is frequently the true scaling limit, not the GPUs, which is why this layer gets so much engineering love. If you want to see these principles applied to a concrete enterprise platform, I cover the segmentation and traffic paths in my networking for Private AI workloads write-up.

This is Part 26 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It extends the multi-GPU scaling of Part 25 and the memory limits of Part 23.

The Bottom Line

Large-scale AI is a team sport, and the network and storage are the field it plays on. GPUs inside a box talk over fast NVLink; boxes talk over InfiniBand or RoCE; and they constantly run collective operations like all-reduce, coordinated by libraries such as NCCL, that move at the pace of the slowest link. Storage, meanwhile, has to be a firehose of throughput to keep those GPUs fed with data, not the random-access index a database wants.

The principle to carry away is simple and easy to forget when the GPU spec sheets are dazzling: a cluster runs at the speed of its weakest link, and the weakest link is often the cheapest component everyone ignored. Design the interconnect and storage as deliberately as the accelerators. With serving infrastructure now mapped end to end, one big strategic question remains before the frontier: where should all of this actually live, on your own premises, in the cloud, or somewhere in between? That honest verdict is the next part.

References

NCCL: collective communication library (NVIDIA)
NVLink and high-speed GPU interconnect (NVIDIA)
Networking for Private AI workloads (drpranayjha.com)

Generative AI Series · Part 26 of 30
« Part 25: scaling inference | Generative AI Complete Guide | Next: Part 27, on-prem vs cloud vs hybrid »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Dr. Pranay Jha