Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

Why GenAI Runs on GPUs, and the Memory Wall That Limits It (GenAI Series, Part 23)

Models run on GPUs for parallel matrix math, but generating text is limited by memory, not compute. Why bandwidth caps speed, VRAM caps what runs, and the KV cache fills the gap.

13 minutes

Read Time

Generative AI Series · Part 23 of 30

TL;DR · Key Takeaways

  • Models run on GPUs because their math is thousands of small operations done at once, exactly what a GPU’s many parallel cores are built for.
  • The surprise: when generating text, a GPU is mostly waiting for memory, not crunching numbers. To make each token it must read the entire model from memory.
  • So memory bandwidth, how fast data moves, often matters more than raw compute, and VRAM capacity is the hard ceiling on what you can run at all.
  • The KV cache is a second, growing memory cost that scales with context length and is frequently the thing that fills up your GPU.

Here is a fact that surprises almost everyone the first time: a GPU running a large language model spends most of its time doing nothing. Not idle in the sense of unused, but stalled, waiting for data to arrive from memory so its powerful arithmetic units have something to chew on. We picture these chips as relentless number-crunchers, and for some jobs they are. But generating text, one token at a time, turns out to be less a compute problem than a logistics problem, and that single insight explains the design of every serving system in the rest of this phase. To see why, start with the more basic question: why a GPU at all?

Few strong cores, or thousands of small ones CPU a few powerful cores great at sequential work GPU thousands of small cores great at doing one thing in bulk
A model is mostly the same simple operation repeated a trillion times. That is a GPU’s native language.

Why GPUs, not CPUs

A CPU is a brilliant generalist. It has a handful of powerful cores designed to do complicated, varied tasks quickly, one after another. That is perfect for running an operating system or a spreadsheet, where each step depends on the last. A GPU is the opposite kind of specialist: thousands of simpler cores that all do the same operation at the same time on different data. It was built to colour millions of pixels at once for video games, and that talent turned out to be exactly what neural networks need.

Recall from Part 6 that a model is, underneath, a tower of matrix multiplications, vast grids of numbers multiplied and added. A matrix multiply is the same simple multiply-and-add repeated across thousands of independent positions, with no step waiting on another. Hand that to a CPU’s few cores and they grind through it sequentially; hand it to a GPU’s thousands of cores and they do enormous chunks of it simultaneously. For the workload that defines AI, the GPU is not a little faster than a CPU, it is orders of magnitude faster, which is why the entire field effectively runs on them. So far this is the story most people know. The interesting part is what limits the GPU once you are using it.

The twist: the bottleneck is moving data, not crunching it

A GPU has two relevant capabilities, and people fixate on the wrong one. There is its compute, how many calculations per second it can perform, and there is its memory bandwidth, how fast it can move data between its memory and its cores. For generating text token by token, the second one is usually the binding constraint. Here is why: to produce a single next token, the GPU must read every one of the model’s weights out of memory and use each one exactly once. A 70-billion-parameter model means reading tens of gigabytes of weights, for one token. The actual arithmetic on those weights is quick; the slow part is hauling all that data from memory to the cores.

So during text generation the cores are frequently sitting idle, waiting on the memory pipe to deliver the next slab of weights. This is the memory wall: performance limited not by how fast the chip can think but by how fast it can be fed. It reframes what makes a GPU good for inference. A card with monstrous compute but modest memory bandwidth will underperform on token generation, while bandwidth is the spec that often predicts real throughput. It also explains, as Part 22 hinted, why batching is magic: if you read the weights once and use them for fifty requests at the same time, you have amortised the expensive memory read across fifty tokens instead of one. The compute was never the problem.

The memory wall, in one picture GPU MEMORYholds the model weights read ALL weights, every token this pipe is the bottleneck CORESfast, but often waiting
The cores are quick. Feeding them the whole model for every single token is what takes the time.

VRAM: the ceiling you cannot argue with

Bandwidth governs speed, but memory capacity, the amount of VRAM on the card, governs whether you can run a model at all. Everything the GPU needs during inference has to fit in that memory at once: the model weights, the working activations, and the KV cache we are about to meet. If the total exceeds the card’s VRAM, the model does not run slowly, it simply fails to load, the dreaded out-of-memory error. This is the hardest wall in the whole stack, because no amount of patience gets you past it. It is why quantization from Part 20 matters so much in practice: halving the bytes per weight can be the difference between fitting on the GPU you have and not running at all.

This capacity ceiling is the reason GPU memory size, the 24GB, 48GB, 80GB on a card’s spec sheet, is the first number professionals look at. It sets the menu of what you can serve. A model’s weights might take 40GB, leaving the rest for everything else, and how much “everything else” you need depends heavily on how many users you serve and how long their conversations run. Sizing a deployment is, more than anything, a memory budgeting exercise, working out what must coexist in VRAM and ensuring it fits with room to spare.

To make this concrete, take the data-center cards in common use as of 2026. An NVIDIA A100 ships with 40GB or 80GB; an H100 has 80GB; and the newest accelerators push to 141GB (H200) or 192GB (AMD’s MI300X). Their memory bandwidth, the figure that governs the speed wall above, runs from roughly 2 TB/s on an A100 to over 3 TB/s on an H100. Now do the arithmetic for a 70-billion-parameter model: at FP16 (2 bytes per parameter) the weights alone are about 140GB, which will not fit on a single 80GB card, so you either split it across two GPUs or quantize to INT4 (about 35GB) to land it on one. That single calculation, weights in bytes versus a card’s VRAM, is the first sizing question every infrastructure team answers. And while this series uses NVIDIA as the running example because it dominates the field, the same memory math applies to AMD’s ROCm cards and other accelerators: the vendor changes, the wall does not.

The KV cache, the memory that grows as you talk

There is a second memory cost that quietly dominates at scale, and it follows directly from attention. To generate each new token, the model needs to attend back over all the previous tokens. Recomputing the attention information for the entire history on every step would be wildly wasteful, so models cache it: for every token processed, they store its key and value vectors and reuse them. This KV cache is what lets generation stay fast. But it has to live in VRAM, and it grows with every token in the conversation, for every concurrent user.

That growth is the catch. A long context or a large batch of simultaneous users can make the KV cache balloon until it rivals or exceeds the size of the model weights themselves. Suddenly the thing filling your GPU is not the model but the accumulated conversation state of everyone using it. This is the real, physical reason long context windows are expensive and why serving many users at once is a memory juggling act rather than a compute one. Much of the cleverness in modern inference engines, which the next part is about, is essentially clever management of this one growing cache so that more users and longer contexts fit in the same fixed VRAM.

What fills the GPU memory short conversation, few users model weights (fixed) KV cache free long conversations, many users model weights (fixed) KV cache has grown, free space nearly gone The weights are constant; the KV cache is the part that quietly eats the GPU.
Run out of room here and you serve fewer users or shorter contexts. It is the central trade in serving.
Reality check: when sizing GPUs, do not just check that the model weights fit. The mistake I see most often is budgeting VRAM for the weights and forgetting the KV cache, then watching the system fall over the moment real users hold long conversations. Plan memory for the weights plus the worst-case cache at your target concurrency and context length, or your “it fits” will become “it crashed under load.”
▾  Go Deeper (optional, for technical readers)

The formal tool here is arithmetic intensity: the ratio of compute operations to bytes moved from memory. Plot a workload on a roofline chart and if its arithmetic intensity is low, it is memory-bound (limited by bandwidth); if high, it is compute-bound (limited by the cores). Token-by-token generation, the decode phase, has dismally low arithmetic intensity: for each weight you read from memory you do roughly one multiply-add, so you are almost entirely bandwidth-bound and the expensive compute units idle. This is the precise, measurable reason single-stream generation wastes a GPU’s arithmetic might.

It is worth seeing the KV-cache size as a formula, because it is what you actually budget against. For each token you cache a key and a value in every layer, so the memory is roughly 2 (key and value) × layers × hidden size × bytes per number, per token, multiplied by the tokens in context and the number of concurrent sequences. A rough worked figure: a 70B-class model with 80 layers and an 8192 hidden dimension at 2 bytes per number stores on the order of 2.5MB of KV cache per token. Fill a 100,000-token context and that is roughly 250GB for one long conversation, more than the weights themselves and far past any single card. Modern models use grouped-query attention to share keys and values across heads, cutting this by a large factor, but even after that reduction the KV cache is the dominant memory pressure at long context and high concurrency, which is exactly why every serving trick in the next parts is ultimately about taming this one number.

This also explains the two-phase personality of inference. Prefill, processing the input prompt, handles many tokens at once, so it reuses each weight across many positions, giving high arithmetic intensity and behaving compute-bound. Decode, generating output one token at a time, reuses each weight only once per step, giving low intensity and behaving memory-bound. The same model on the same GPU is limited by different things in these two phases, which is why advanced serving systems schedule and even physically separate them. Batching is the main lever against the decode problem: stacking many sequences together raises arithmetic intensity by reusing each weight read across the whole batch, pushing decode back toward the compute roofline and lifting tokens-per-second-per-dollar. If you want the memory math worked through for real deployments, including how to size weights plus KV cache against a card’s VRAM, I cover it in my Private AI sizing and cost write-up.

This is Part 23, the start of Phase 5, in a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It pays off quantization (Part 20) and the cost breakdown (Part 22).

GPU sizing cheat sheet (weights only)

Model sizeFP16 weightsINT4 weightsTypically fits on
7B~14 GB~4 GBA workstation/laptop GPU
13B~26 GB~7 GBOne mid-range GPU
70B~140 GB~35 GB2+ GPUs at FP16, or 1x 40-80GB at INT4
175B~350 GB~88 GBA multi-GPU node
Weights only. Add the KV cache (which grows with context and concurrency) on top.

The Bottom Line

Generative AI runs on GPUs because its core work, massive parallel matrix math, is precisely what GPUs were born to do. But once you are on the GPU, the limit is rarely raw compute. To generate each token the chip must haul the entire model out of memory, so memory bandwidth sets the speed and the cores spend much of their time waiting, the memory wall. And memory capacity, VRAM, sets the harder ceiling: the weights, the activations, and the ever-growing KV cache must all fit, or nothing runs.

Hold these two ideas, bandwidth limits speed and capacity limits possibility, and the rest of Phase 5 falls into place, because every serving trick ahead is an answer to one of them. The KV cache in particular is the pressure point: manage it well and you fit more users and longer contexts on the same hardware; manage it badly and you crash. That management is exactly what inference engines are built for, and comparing the leading ones is where we go next.

Frequently Asked Questions

Why do generative AI models need GPUs?

Their core work is massive parallel matrix multiplication, which GPUs with thousands of cores perform far faster than CPUs. Running large models on CPUs alone is impractically slow.

What is the memory wall in AI?

The memory wall is when a GPU spends most of its time waiting for data rather than computing. Generating each token requires reading the entire model from memory, so memory bandwidth, not raw compute, often limits speed.

How much GPU memory does a 70B model need?

Roughly 140GB at 16-bit precision, which exceeds a single 80GB card, so you split it across GPUs or quantize to about 35GB at 4-bit to fit on one, before counting the KV cache.

References

Generative AI Series · Part 23 of 30
« Part 22: where the money goes  |  Generative AI Complete Guide  |  Next: Part 24, inference engines compared »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading