Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

What It Takes to Train a Model Across Thousands of GPUs (GenAI Series, Part 28)

Training a frontier model coordinates thousands of GPUs for months. How data, tensor, pipeline and expert parallelism, the memory math, and checkpointing make it possible.

10 minutes

Read Time

Generative AI Series · Part 28 of 30

TL;DR · Key Takeaways

  • Training a frontier model means coordinating thousands of GPUs for weeks or months. No single machine comes remotely close.
  • The work is split four ways at once: data, tensor, pipeline, and expert parallelism, layered together into a careful 3D-or-more arrangement.
  • Memory, not just compute, forces these choices. Training stores weights, gradients, and bulky optimizer states, several times the model’s own size.
  • At this scale, hardware fails constantly. Frequent checkpointing and fast restart are not niceties; they are what make a months-long run possible at all.

Training a frontier model is one of the largest coordinated computations humanity routinely performs. Picture tens of thousands of GPUs, spread across thousands of servers, all working on the same model for weeks or months without pause, at a cost running into tens or hundreds of millions of dollars. At that scale the hard problems stop being about the math of learning, which we covered back in Part 6, and become problems of pure coordination: how do you make that many machines act as one brain, and keep them working when, inevitably, pieces of them break? This part is a tour of the frontier, the engineering behind the models everyone else builds on.

Why one machine cannot come close

Two walls make single-machine training impossible at the frontier. The first is memory. A large model’s weights alone can exceed any one GPU’s VRAM, and training needs far more memory than just the weights, as we will see. The second is time. Even if a model somehow fit, processing the trillions of tokens of training data through it on one GPU would take not months but centuries. The only way out is to spread the work across a vast number of GPUs that compute in parallel, which immediately raises the central question of distributed training: how exactly do you divide one model and one dataset across thousands of chips so they cooperate rather than collide?

The answer is not one technique but several, used together. Each addresses a different part of the problem, splitting the data, splitting the math within a layer, splitting the layers, and splitting the model’s components, and large runs combine all of them into a layered scheme often called 3D parallelism (or more). Understanding the four building blocks is understanding how frontier training works.

Four ways to split the job DATA PARALLELISM copy the whole model to each GPU, give each a different slice of data, then sync (all-reduce) the gradients TENSOR PARALLELISM split the math inside each layer across several GPUs at once needs the fastest links (NVLink) PIPELINE PARALLELISM different layers on different GPUs, data flows through like a line stretches across many servers EXPERT PARALLELISM for mixture-of-experts models: spread the experts across GPUs route each token to the right ones
Real frontier runs use all four at once, nested together into one enormous, carefully tuned arrangement.

The four parallelisms, briefly

Three of these we have already met. Data parallelism copies the whole model onto each GPU and feeds each copy a different slice of the data, then combines everyone’s learning with the all-reduce from Part 26. It is the simplest and most common, and it scales throughput, but it requires the model to fit on a single GPU, which frontier models do not. Tensor parallelism splits the math inside each layer across several GPUs that work on the same data in lockstep, kept on the fastest links because of the constant chatter. Pipeline parallelism places different layers on different GPUs so a batch flows through them like an assembly line, tolerating slower links and letting a model far too big for one box span many.

The fourth, expert parallelism, is specific to mixture-of-experts models, which we will meet properly next part. Such models contain many specialised sub-networks (“experts”), and only a few are used for any given token. Expert parallelism distributes those experts across GPUs and routes each token to the handful it needs, which is its own coordination puzzle because the routing is dynamic and uneven. Stack all four together, data across groups, pipeline across stages, tensor within a stage, experts spread across the lot, and you get the multi-dimensional parallelism that frontier training actually uses. Designing that arrangement for a specific cluster, balancing each split against the speed of each tier of interconnect, is a deep speciality in its own right.

The memory that forces the design

People assume training is split up to go faster, and it is, but memory is often the harder master. Inference needs only the weights in memory; training needs much more. For every parameter you must also store its gradient (how to adjust it), and the optimizer state, extra bookkeeping numbers the training algorithm keeps for each parameter to make learning stable. The common optimizer keeps two such numbers per parameter, and in mixed-precision training the full set of weights, gradients, and optimizer states can add up to many times the size of the model itself, often on the order of sixteen bytes per parameter against the two bytes inference needs.

This is why a model that would run comfortably on a few GPUs needs far more to train. And it is why a whole family of techniques exists to shard not just the model but the gradients and optimizer states across GPUs, so no single chip has to hold the full bulky set, approaches with names like ZeRO and FSDP. On top of all that sit the activations, the intermediate values from the forward pass that must be kept around for the backward pass, which balloon with batch size and sequence length and are often traded against compute by recomputing them on demand. The upshot is that the parallelism strategy is chosen as much to make everything fit in memory as to make it fast. Memory is the constraint that shapes the whole design.

One model, thousands of GPUs, one fabric rack servers of 8 GPUs × many HIGH-SPEED FABRIC tensor parallel within a server, pipeline and data parallel across the fabric
The cluster’s physical hierarchy mirrors the parallelism: fastest splits stay closest together.

At this scale, things break, constantly

Here is the part outsiders never anticipate. When you run tens of thousands of GPUs for months, hardware failure stops being a rare event and becomes a routine one. Individual GPUs fail, servers crash, network links flap, memory corrupts. Across a cluster that large, something breaks every few hours on average. And because all those GPUs are working on one tightly-coupled computation, a single failure can stall the entire run, the whole synchronized group grinds to a halt waiting for the piece that died. Left unmanaged, a frontier training run would never finish, because it would never go more than a few hours without something tripping it.

The defence is checkpointing: periodically saving the entire training state, all those weights, gradients, and optimizer numbers, to storage, so that when something fails the run can resume from the last save rather than starting over. Get this wrong and a failure six weeks in could cost you six weeks. Get it right and you lose only the work since the last checkpoint, perhaps an hour. This is why so much frontier-training engineering is really fault-tolerance engineering: detecting failures fast, swapping in spare hardware, and restarting from a checkpoint with minimal lost work. It is also where the high-throughput storage of Part 26 earns its keep, because writing a multi-terabyte checkpoint quickly, without stalling the GPUs, is its own demanding problem. The glamorous part is the model; the part that actually determines whether you finish is the unglamorous machinery of surviving constant failure.

Survive failure: checkpoint and restart TRAINsave checkpoint often a GPU / node fails detect & swap inspare hardware RESUMEfrom last checkpoint lose minutes, not weeks
The checkpoint is the safety net that turns inevitable failures into minor setbacks.
Reality check: almost no one needs to do this, and that is worth saying plainly. Frontier pretraining is the domain of a handful of very well-resourced labs. For everyone else, the practical lesson is the opposite of “learn to train across 10,000 GPUs”: it is to appreciate what those labs absorb so you do not have to, and to build on their models through the fine-tuning, RAG, and serving this series spent its first two-thirds on. Knowing how the sausage is made is useful; making it yourself rarely is.
▾  Go Deeper (optional, for technical readers)

The memory math is what forces each parallelism choice, so it is worth making concrete. In mixed-precision training with the common Adam optimizer, per parameter you typically hold: a 16-bit weight (2 bytes) and its 16-bit gradient (2 bytes), plus a 32-bit master copy of the weight and Adam’s two 32-bit moment estimates (12 bytes). That is roughly 16 bytes per parameter just for the model and optimizer, eight times the 2 bytes inference needs. A 70-billion-parameter model therefore implies well over a terabyte of state before you count activations, which no single GPU can hold. ZeRO and FSDP attack this by sharding the optimizer states, gradients, and even the weights across the data-parallel GPUs, so each holds only a fraction and gathers the rest on demand, trading extra communication for drastically lower per-GPU memory.

The other big lever is activation memory, which scales with batch size times sequence length times model width and can rival the weights. Activation checkpointing (also called gradient checkpointing) discards most activations during the forward pass and recomputes them during the backward pass, trading roughly 30% more compute for a large memory saving. Putting it together, a frontier configuration is solved like a constraint problem: tensor parallelism degree is set so a layer’s compute and memory fit within a fast NVLink island; pipeline depth is set so the model’s layers fit across nodes while keeping pipeline bubbles small; data parallelism multiplies the whole thing for throughput; and ZeRO-style sharding plus activation checkpointing squeeze the memory until it all fits. Change the GPU, the interconnect, or the model shape, and the optimal arrangement shifts. This is why large-scale training is a genuine engineering discipline, not a button you press.

This is Part 28 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It is the training-scale counterpart to the serving-scale Part 25 and Part 26.

The Bottom Line

Training at the frontier is a feat of coordination as much as computation. You split one model and one dataset across thousands of GPUs using four kinds of parallelism at once, data, tensor, pipeline, and expert, arranged to fit both the math and the punishing memory budget that training’s gradients and optimizer states impose. And because hardware fails constantly at that scale, you wrap the whole thing in relentless checkpointing so that inevitable breakages cost minutes instead of weeks.

The honest takeaway is double-edged: this is genuinely some of the most impressive systems engineering in the world, and almost none of us should ever attempt it. Its real value to everyone downstream is that a handful of labs absorb this staggering complexity so the rest of us can build on the results. One of the architectural ideas reshaping how those results are built, and the “expert parallelism” we kept deferring, is mixture-of-experts, and that, along with where AI architecture is heading next, is the subject of the penultimate part.

References

Generative AI Series · Part 28 of 30
« Part 27: on-prem vs cloud vs hybrid  |  Generative AI Complete Guide  |  Next: Part 29, mixture-of-experts »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading