TL;DR
HBM capacity, not raw FLOPS, is what decides whether a model runs on a given GPU. Your memory budget is three things: model weights, the KV cache, and a few GB of overhead. Weights scale linearly with precision, so dropping from FP16 to FP8 halves the weight footprint and NVFP4 roughly quarters it.
The capacity ladder in 2026 runs H100 80 GB, H200 141 GB, B200 192 GB, B300 288 GB, and Rubin with HBM4 at 288 GB. FP8 is the safe inference default on Hopper and newer; NVFP4 on Blackwell buys another ~1.8x over FP8 with under ~1% accuracy loss on many models, but you must measure it on your model, not trust the brochure.
A 70-billion-parameter model in FP16 needs 140 GB of weights on the GPU before you serve a single token of context. That one number ends a lot of architecture debates. It is why a single H200 with 141 GB cannot host a 70B model in FP16 with any room left for a request, and why the conversation about "which GPU" is really a conversation about memory and precision. FLOPS get the headlines. Memory pays the bills.
The number that decides everything: bytes per parameter
Weight footprint is parameters times bytes per parameter. Nothing else. Precision is the dial that sets the bytes, and it moves the footprint in big linear steps. FP32 is four bytes and lives almost entirely in training reference math now. BF16 and FP16 are two bytes and are the baseline most people picture. FP8 is one byte. The 4-bit formats are half a byte plus a small scale overhead. Read the table top to bottom and you are watching a 70B model shrink from a two-GPU problem into a single-GPU problem.
HBM is the real constraint, not FLOPS
Why care about the memory type at all? Because LLM inference is memory-bound for most of its life. Generating a token means streaming the entire weight set through the tensor cores once per token. A GPU that can do enormous math but can only feed it 3.35 TB/s of weights spends most of its time waiting on memory. That is the whole reason H200 exists as a Hopper refresh: same compute as H100, but 141 GB of HBM3e at 4.8 TB/s instead of 80 GB at 3.35 TB/s. More capacity, more bandwidth, more tokens per second, no new compute silicon.
The 2026 capacity ladder
What actually fits: the three-part memory budget
Every byte of GPU memory at inference time goes to one of three places. Get this model in your head and you can size any deployment on the back of a napkin.
Weights are fixed: parameters times bytes, as above. KV cache grows with how much context you hold live, across every concurrent request. Overhead is activations plus the runtime, usually a few GB. Weights you can shrink with precision. The KV cache is the part that quietly eats a GPU when traffic climbs, because it scales with sequence length times batch size.
Worked example
Llama-3-70B has 80 layers, 8 KV heads (grouped-query attention) and a head dimension of 128. KV cache per token in FP16 is 2 (K and V) × 80 × 8 × 128 × 2 bytes = 327,680 bytes, about 320 KB per token.
At an 8K context that is ~2.5 GB for a single sequence. Serve 32 concurrent 8K sequences and the KV cache alone is ~80 GB. Add 70 GB of FP8 weights and you are at ~150 GB, over a single H200. The fix is either fewer concurrent long contexts, KV-cache quantization to FP8 (which roughly halves that 80 GB), or a 192 GB B200.
This is why the sizing question is never "does the model fit" but "does the model plus the traffic I expect fit". See the deeper TCO math in the Private AI sizing and cost guide.
FP8 and FP4: precision is a capacity lever, with a catch
Hopper introduced FP8 and Blackwell added native FP4 through its second-generation Transformer Engine. The Blackwell FP4 tensor cores deliver roughly double the throughput of FP8 on the same silicon, and the format halves the weight bytes again. That is two wins at once: more memory headroom and more math per second. The catch is accuracy. Four bits is a coarse representation, and how you scale those bits decides whether the model holds up.
NVFP4 vs MXFP4
Both formats store each weight as a 4-bit E2M1 value and share a scale across a small block, so outliers do not wreck the whole tensor. The difference is granularity. NVFP4 uses 16-element blocks with a higher-precision FP8 (E4M3) scale. MXFP4, the open OCP standard, uses 32-element blocks with a power-of-two E8M0 scale. Smaller blocks and a finer scale mean NVFP4 tracks outliers better and generally lands lower perplexity for the same model and calibration set.
When not to cut precision
Lower precision is not free, and there are deployments where I leave it alone. The first is training and fine-tuning: you quantize for inference, but the master weights and optimizer states stay in BF16 or FP32, because gradient math at four bits diverges in ways that are expensive to debug. The second is small models. A 7B model in FP16 is 14 GB and fits almost any card, so the accuracy risk of quantizing buys you nothing you needed. The third is anything where a wrong answer is costly: medical coding, legal extraction, financial reconciliation. For those I would rather pay for a second GPU than explain a regression that a four-bit rounding error introduced.
There is also a latency angle people miss. At very low batch sizes the bottleneck is loading weights, so cutting weight bytes with FP8 or NVFP4 genuinely shortens each token. At high batch sizes you become compute-bound, and it is the FP4 tensor-core throughput that helps, not the smaller weights. Knowing which regime you sit in tells you whether precision is buying you capacity, speed, or both. If you cannot say which, measure before you commit, because the answer changes your GPU count.
Measure it: a KV budget you can run
Before you commit to a GPU count, put real numbers behind the KV cache. This snippet is the same math from the worked example, parameterized. Run it for your model and your expected concurrency, and check it against free memory with nvidia-smi.
# kv_budget.py estimate KV cache footprint on one GPU
def kv_cache_gb(layers, kv_heads, head_dim, bytes_per_elem, seq_len, batch):
per_token = 2 * layers * kv_heads * head_dim * bytes_per_elem # K and V
return per_token * seq_len * batch / 1024**3
# Llama-3-70B: 80 layers, 8 KV heads (GQA), head_dim 128, FP16 KV (2 bytes)
print(round(kv_cache_gb(80, 8, 128, 2, 8192, 32), 1)) # -> 80.0 (GB)
# Same traffic with FP8 KV cache (1 byte) halves it
print(round(kv_cache_gb(80, 8, 128, 1, 8192, 32), 1)) # -> 40.0 (GB)
# confirm real free HBM before you load the model
nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv
# expected on an idle H200:
# memory.total [MiB], memory.used [MiB], memory.free [MiB]
# 144384 MiB, 1 MiB, 144383 MiB
Failure mode to watch: if free memory is well under total on an "idle" GPU, something else is resident (a stuck process, another container, or a leaked allocation). Loading on top of that is how you get a confusing out-of-memory error halfway through warm-up.
Leave headroom, never pack to the line
One mistake I see often: someone computes weights plus KV cache, sees it lands at 138 GB on a 141 GB H200, and calls it a fit. It is not. The serving runtime allocates the KV cache in blocks and reserves a pool up front, fragmentation means you rarely get the last few percent, and CUDA graphs plus the framework take their own slice. Plan to use roughly 85 to 90 percent of stated HBM and keep the rest as a buffer. The cost of being wrong is not a warning in a log; it is an out-of-memory crash under peak load, which is exactly when you cannot afford one.
The same logic argues for sizing on the next model up, not the one you run today. Weights grow, context windows grow, and a deployment sized to the millimetre this quarter becomes a forklift upgrade next quarter. A little capacity headroom now is far cheaper than re-architecting a serving tier in six months. If you are choosing between two GPUs and the math is close, take the one with more memory.
HBM4 and what changes in 2026
Rubin brings HBM4. The headline for 2026 is bandwidth, not capacity: per-GPU memory holds around 288 GB while bandwidth jumps to roughly 13 TB/s [VERIFY], well over double the H200. For memory-bound inference that bandwidth translates fairly directly into tokens per second, which is why HBM4 matters even when the capacity number looks flat. The other reality is supply. HBM is the bottleneck of the whole industry right now, and capacity is allocated quarters in advance. If your roadmap assumes Rubin in volume the moment it ships, build a fallback on Blackwell.
The Verdict
Size by memory, not by FLOPS. Compute the three-part budget, weights plus KV cache plus a few GB of overhead, and pick the smallest GPU configuration where your model and your peak traffic both fit with margin. FP8 is the recommended inference default: why, because it halves the weight footprint with negligible accuracy loss; when not, almost never for inference on Hopper or newer; what to validate, that your serving runtime actually has FP8 kernels for your model. Reach for NVFP4 on Blackwell when you need the extra headroom or throughput and you have run the eval to prove the accuracy holds. Skip 4-bit entirely if you cannot afford the validation effort or the model is accuracy-critical.
Next up in the series we leave the chip and look at how these GPUs are assembled into systems: DGX, HGX, MGX and the NVL72 racks. If you are sizing a deployment this week, run the KV budget snippet against your real model and post your numbers in the comments. I will tell you whether your GPU choice holds.
References
NVIDIA H200 product page
NVIDIA Blackwell architecture and the second-generation Transformer Engine
NVIDIA Technical Blog: Introducing NVFP4 for efficient low-precision inference
NVIDIA Technical Blog: Mastering LLM inference, KV cache math



