Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

GPU Memory and Precision: HBM3e, HBM4 and What Actually Fits (NVIDIA AI Series, Part 4)

A 70B model in FP16 needs 140 GB of weights before a single token of context. Here is the GPU memory and precision math that decides what fits, why HBM (not FLOPS) is the real ceiling, and where FP8 and NVFP4 buy you headroom.

NVIDIA AI Series · Part 4 of 30

TL;DR

HBM capacity, not raw FLOPS, is what decides whether a model runs on a given GPU. Your memory budget is three things: model weights, the KV cache, and a few GB of overhead. Weights scale linearly with precision, so dropping from FP16 to FP8 halves the weight footprint and NVFP4 roughly quarters it.

The capacity ladder in 2026 runs H100 80 GB, H200 141 GB, B200 192 GB, B300 288 GB, and Rubin with HBM4 at 288 GB. FP8 is the safe inference default on Hopper and newer; NVFP4 on Blackwell buys another ~1.8x over FP8 with under ~1% accuracy loss on many models, but you must measure it on your model, not trust the brochure.

Who this is for: Infrastructure architects and platform engineers sizing GPU servers for LLM inference or fine-tuning. Prerequisites: you know what an LLM parameter is and have read Part 3 on the GPU lineup. No CUDA programming required.

A 70-billion-parameter model in FP16 needs 140 GB of weights on the GPU before you serve a single token of context. That one number ends a lot of architecture debates. It is why a single H200 with 141 GB cannot host a 70B model in FP16 with any room left for a request, and why the conversation about "which GPU" is really a conversation about memory and precision. FLOPS get the headlines. Memory pays the bills.

The number that decides everything: bytes per parameter

Weight footprint is parameters times bytes per parameter. Nothing else. Precision is the dial that sets the bytes, and it moves the footprint in big linear steps. FP32 is four bytes and lives almost entirely in training reference math now. BF16 and FP16 are two bytes and are the baseline most people picture. FP8 is one byte. The 4-bit formats are half a byte plus a small scale overhead. Read the table top to bottom and you are watching a 70B model shrink from a two-GPU problem into a single-GPU problem.

PrecisionBytes / param70B weightsWhere it lives
FP324280 GBReference / select training math
BF16 / FP162140 GBBaseline weights and training
FP8 (E4M3)170 GBInference default, Hopper and newer
NVFP4 / MXFP4~0.5 + scale~38 GBBlackwell inference (and some training)
Weight footprint for a 70B model by precision. The NVFP4 figure includes block-scale overhead, which is why it is above the naive 35 GB.
Precision is a capacity lever70B weight footprint as precision dropsFP32   280 GBFP16   140 GBFP8   70 GBNVFP4   ~38 GBH200141 GBline
Halving the bytes per parameter halves the weight bar. FP8 is the first format where a 70B model clears the 141 GB H200 line with room to spare.

HBM is the real constraint, not FLOPS

Why care about the memory type at all? Because LLM inference is memory-bound for most of its life. Generating a token means streaming the entire weight set through the tensor cores once per token. A GPU that can do enormous math but can only feed it 3.35 TB/s of weights spends most of its time waiting on memory. That is the whole reason H200 exists as a Hopper refresh: same compute as H100, but 141 GB of HBM3e at 4.8 TB/s instead of 80 GB at 3.35 TB/s. More capacity, more bandwidth, more tokens per second, no new compute silicon.

The 2026 capacity ladder

GPUMemoryTypeBandwidthNotes
H100 SXM80 GBHBM33.35 TB/sHopper baseline
H200141 GBHBM3e4.8 TB/sHopper refresh, capacity bump
B200192 GBHBM3e8.0 TB/sBlackwell, dual-die
B300 / GB300288 GBHBM3e8 TB/sBlackwell Ultra, ~1,400 W
Rubin R100288 GBHBM4~13 TB/s [VERIFY]Vera Rubin, ~H2 2026
Per-GPU memory. HBM4 on Rubin is the bandwidth story for 2026; capacity holds at 288 GB while bandwidth jumps.
In practice: When someone asks for "an H100 box" for a 70B model, I push back. In FP8 a 70B model is 70 GB of weights, which fits one 80 GB H100 with almost nothing left for KV cache. You end up serving one or two requests. An H200 or a single Blackwell GPU is the honest minimum for real concurrency. The cheaper card is not cheaper once it cannot hold the working set.

What actually fits: the three-part memory budget

Every byte of GPU memory at inference time goes to one of three places. Get this model in your head and you can size any deployment on the back of a napkin.

Weights are fixed: parameters times bytes, as above. KV cache grows with how much context you hold live, across every concurrent request. Overhead is activations plus the runtime, usually a few GB. Weights you can shrink with precision. The KV cache is the part that quietly eats a GPU when traffic climbs, because it scales with sequence length times batch size.

One GPU, three claims on HBM70B in FP8 on a 141 GB H200H200 141 GBWeights 70 GBKV cache ~55 GBoverheadFixed by precision.FP8 halves this vs FP16.Grows with context lengthx concurrency. The real risk.
FP8 weights leave roughly 55 to 65 GB for KV cache and overhead on an H200. In FP16 the weights alone (140 GB) leave nothing.

Worked example

Llama-3-70B has 80 layers, 8 KV heads (grouped-query attention) and a head dimension of 128. KV cache per token in FP16 is 2 (K and V) × 80 × 8 × 128 × 2 bytes = 327,680 bytes, about 320 KB per token.

At an 8K context that is ~2.5 GB for a single sequence. Serve 32 concurrent 8K sequences and the KV cache alone is ~80 GB. Add 70 GB of FP8 weights and you are at ~150 GB, over a single H200. The fix is either fewer concurrent long contexts, KV-cache quantization to FP8 (which roughly halves that 80 GB), or a 192 GB B200.

This is why the sizing question is never "does the model fit" but "does the model plus the traffic I expect fit". See the deeper TCO math in the Private AI sizing and cost guide.

FP8 and FP4: precision is a capacity lever, with a catch

Hopper introduced FP8 and Blackwell added native FP4 through its second-generation Transformer Engine. The Blackwell FP4 tensor cores deliver roughly double the throughput of FP8 on the same silicon, and the format halves the weight bytes again. That is two wins at once: more memory headroom and more math per second. The catch is accuracy. Four bits is a coarse representation, and how you scale those bits decides whether the model holds up.

NVFP4 vs MXFP4

Both formats store each weight as a 4-bit E2M1 value and share a scale across a small block, so outliers do not wreck the whole tensor. The difference is granularity. NVFP4 uses 16-element blocks with a higher-precision FP8 (E4M3) scale. MXFP4, the open OCP standard, uses 32-element blocks with a power-of-two E8M0 scale. Smaller blocks and a finer scale mean NVFP4 tracks outliers better and generally lands lower perplexity for the same model and calibration set.

TraitNVFP4MXFP4
Block size16 elements32 elements
Scale typeFP8 E4M3E8M0 (power of two)
OriginNVIDIA BlackwellOCP open standard
Relative accuracyHigher per-blockCoarser, simpler
Memory vs FP16~3.5x smaller~3.5x smaller
NVFP4 pays ~2x more scale overhead than MXFP4 to get finer granularity. On Blackwell, NVFP4 is the one to reach for first.
Block scaling: finer blocks, better outliersNVFP4 · 16-element blockFP8 scaleMXFP4 · 32-element blockE8M0 scale
One shared scale covers fewer weights in NVFP4, so a single large value distorts less of the block.
Gotcha: "Under 1% accuracy loss" for NVFP4 is a real result on many models, but it is not a guarantee for yours. Reasoning and code models are more sensitive, and a calibration set that does not look like your traffic can quietly cost you several points on a benchmark you care about. Quantize, then run your own eval. Never ship a 4-bit model on a vendor slide alone.

When not to cut precision

Lower precision is not free, and there are deployments where I leave it alone. The first is training and fine-tuning: you quantize for inference, but the master weights and optimizer states stay in BF16 or FP32, because gradient math at four bits diverges in ways that are expensive to debug. The second is small models. A 7B model in FP16 is 14 GB and fits almost any card, so the accuracy risk of quantizing buys you nothing you needed. The third is anything where a wrong answer is costly: medical coding, legal extraction, financial reconciliation. For those I would rather pay for a second GPU than explain a regression that a four-bit rounding error introduced.

There is also a latency angle people miss. At very low batch sizes the bottleneck is loading weights, so cutting weight bytes with FP8 or NVFP4 genuinely shortens each token. At high batch sizes you become compute-bound, and it is the FP4 tensor-core throughput that helps, not the smaller weights. Knowing which regime you sit in tells you whether precision is buying you capacity, speed, or both. If you cannot say which, measure before you commit, because the answer changes your GPU count.

Measure it: a KV budget you can run

Before you commit to a GPU count, put real numbers behind the KV cache. This snippet is the same math from the worked example, parameterized. Run it for your model and your expected concurrency, and check it against free memory with nvidia-smi.

# kv_budget.py  estimate KV cache footprint on one GPU
def kv_cache_gb(layers, kv_heads, head_dim, bytes_per_elem, seq_len, batch):
    per_token = 2 * layers * kv_heads * head_dim * bytes_per_elem  # K and V
    return per_token * seq_len * batch / 1024**3

# Llama-3-70B: 80 layers, 8 KV heads (GQA), head_dim 128, FP16 KV (2 bytes)
print(round(kv_cache_gb(80, 8, 128, 2, 8192, 32), 1))   # -> 80.0  (GB)

# Same traffic with FP8 KV cache (1 byte) halves it
print(round(kv_cache_gb(80, 8, 128, 1, 8192, 32), 1))   # -> 40.0  (GB)
# confirm real free HBM before you load the model
nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv

# expected on an idle H200:
# memory.total [MiB], memory.used [MiB], memory.free [MiB]
# 144384 MiB, 1 MiB, 144383 MiB

Failure mode to watch: if free memory is well under total on an "idle" GPU, something else is resident (a stuck process, another container, or a leaked allocation). Loading on top of that is how you get a confusing out-of-memory error halfway through warm-up.

Leave headroom, never pack to the line

One mistake I see often: someone computes weights plus KV cache, sees it lands at 138 GB on a 141 GB H200, and calls it a fit. It is not. The serving runtime allocates the KV cache in blocks and reserves a pool up front, fragmentation means you rarely get the last few percent, and CUDA graphs plus the framework take their own slice. Plan to use roughly 85 to 90 percent of stated HBM and keep the rest as a buffer. The cost of being wrong is not a warning in a log; it is an out-of-memory crash under peak load, which is exactly when you cannot afford one.

The same logic argues for sizing on the next model up, not the one you run today. Weights grow, context windows grow, and a deployment sized to the millimetre this quarter becomes a forklift upgrade next quarter. A little capacity headroom now is far cheaper than re-architecting a serving tier in six months. If you are choosing between two GPUs and the math is close, take the one with more memory.

HBM4 and what changes in 2026

Rubin brings HBM4. The headline for 2026 is bandwidth, not capacity: per-GPU memory holds around 288 GB while bandwidth jumps to roughly 13 TB/s [VERIFY], well over double the H200. For memory-bound inference that bandwidth translates fairly directly into tokens per second, which is why HBM4 matters even when the capacity number looks flat. The other reality is supply. HBM is the bottleneck of the whole industry right now, and capacity is allocated quarters in advance. If your roadmap assumes Rubin in volume the moment it ships, build a fallback on Blackwell.

My take: I size in FP8 first because it is predictable and the accuracy hit is negligible on most models. I treat NVFP4 as a deliberate optimization you earn with an eval, not a default. And I always size the KV cache for the worst realistic concurrency, not the average, because the average is what looks fine in a demo and the peak is what pages you at 2 a.m.
Disclaimer: Quantizing a production model changes its outputs. Validate on your own evaluation set in a staging environment before you route live traffic, and keep the FP8 or FP16 build available for rollback.

The Verdict

Size by memory, not by FLOPS. Compute the three-part budget, weights plus KV cache plus a few GB of overhead, and pick the smallest GPU configuration where your model and your peak traffic both fit with margin. FP8 is the recommended inference default: why, because it halves the weight footprint with negligible accuracy loss; when not, almost never for inference on Hopper or newer; what to validate, that your serving runtime actually has FP8 kernels for your model. Reach for NVFP4 on Blackwell when you need the extra headroom or throughput and you have run the eval to prove the accuracy holds. Skip 4-bit entirely if you cannot afford the validation effort or the model is accuracy-critical.

Next up in the series we leave the chip and look at how these GPUs are assembled into systems: DGX, HGX, MGX and the NVL72 racks. If you are sizing a deployment this week, run the KV budget snippet against your real model and post your numbers in the comments. I will tell you whether your GPU choice holds.

NVIDIA AI Series · Part 4 of 30
« Previous: Part 3  |  NVIDIA AI Guide  |  Next: Part 5 »

References

NVIDIA H200 product page
NVIDIA Blackwell architecture and the second-generation Transformer Engine
NVIDIA Technical Blog: Introducing NVFP4 for efficient low-precision inference
NVIDIA Technical Blog: Mastering LLM inference, KV cache math

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading