- Two jobs decide the whole choice. Training is bound by FLOPS and interconnect; inference is bound by memory capacity and bandwidth. Buy for the job, not the headline FLOPS on the slide.
- Hopper is not dead. H100 (80 GB HBM3, 3.35 TB/s) is still the broad workhorse; H200 (141 GB HBM3e, 4.8 TB/s) is the cheaper way to fit and serve a big model today.
- Blackwell is the current build target. B200 (180 GB HBM3e, about 8 TB/s) and B300 / Blackwell Ultra (288 GB HBM3e, 8 TB/s, 15 PFLOPS dense FP4) add FP4 and FP6, which Hopper does not have.
- The rack is the real product. GB200 NVL72 ties 72 Blackwell GPUs over NVLink 5 into one memory domain; GB300 NVL72 pushes roughly 1.1 ExaFLOPS of FP4 per rack.
- Rubin is the next jump: 288 GB HBM4, about 22 TB/s, NVLink 6, full production announced at GTC in 2026 with partner systems landing H2 2026. Plan for it, do not stall a build waiting on it.
Pick a GPU by the biggest FLOPS number on the keynote slide and you will buy compute you cannot feed. I have watched teams order a rack of the fastest chip available, then serve a 70B model at a fraction of the throughput they paid for, because the model was waiting on memory bandwidth the whole time and the tensor cores sat idle. The data-center lineup looks like a simple ladder, H100 to H200 to B200 to B300 to Rubin, but the right rung depends on what the silicon is being asked to do. This part lays out the current lineup as of 2026, what actually changed generation to generation, and how to choose for training versus inference without overspending on either.
Two jobs, two different bottlenecks
Before any model number matters, separate the two jobs a data-center GPU does, because they stress different parts of the chip. Training a model pushes raw math throughput and the interconnect between GPUs: you are running huge batches forward and backward, syncing gradients across dozens or hundreds of devices, and the limiter is FLOPS plus how fast GPUs talk to each other. Inference is a different animal. Once a model is trained, serving it is dominated by how much memory you have (the weights plus the KV cache for every concurrent request have to live on the GPU) and how fast you can stream that memory (token generation is memory-bandwidth bound, not compute bound). The same chip is good at both, but the spec that saves you is different in each case.
This is why a memory upgrade with the same compute can matter more than a faster core. The H200 is the same Hopper architecture as the H100 with more and faster memory, and it serves large language models meaningfully faster purely because it is less starved. The memory wall is the single most useful lens for reading this lineup.
Hopper: still the workhorse
H100 vs H200
Hopper is the generation most production estates are actually running today, and it is not going away because Blackwell exists. The H100 SXM ships with 80 GB of HBM3 at 3.35 TB/s and a Transformer Engine that supports FP8. It trains and serves almost anything, and the broad install base means tooling, drivers and known-good configurations are mature. The H200 keeps the identical Hopper compute but swaps in 141 GB of HBM3e at 4.8 TB/s. That is roughly 76 percent more capacity and about 43 percent more bandwidth for the same FLOPS.
For inference that gap is the whole point. More memory means a bigger model or a longer context or more concurrent requests fit on one card; more bandwidth means tokens come out faster. Reported gains land around 1.4 to 1.9 times the H100 throughput on large models, with no change to the training math. If your problem is that a model barely fits or barely serves on H100, H200 is the cheaper fix than jumping a whole architecture. What neither Hopper card has is FP4 or FP6, and that is the line Blackwell crosses.
Blackwell: B200, B300 and the FP4 step
Blackwell is the current architecture you build new on. The B200 carries 180 GB of HBM3e at roughly 8 TB/s, about 2.3 times the H100 bandwidth, and a second-generation Transformer Engine that adds FP6 and FP4. FP4 is the headline: dropping to four-bit precision for inference, where the math tolerates it, lets a Blackwell card push far more tokens per second and pack a quantized model into far less memory than FP8 or FP16 would need. NVIDIA positions Blackwell at up to about 4 times H100 training and up to 15 times H100 inference on large models, and while those are best-case marketing figures, the FP4 path is real and Hopper simply cannot run it.
B300 / Blackwell Ultra
The B300, branded Blackwell Ultra, pushes memory to 288 GB of HBM3e (12-high stacks) at 8 TB/s and delivers about 15 PFLOPS of dense FP4, at a 1,400 W TDP. The extra capacity is aimed squarely at reasoning models and long-context inference, where the KV cache balloons and 180 GB starts to bind. The cost is power and heat: at 1,400 W a card, air cooling is off the table for dense configurations, which is why the high end ships as liquid-cooled rack systems rather than loose PCIe cards.
Gotcha
FP4 is not free accuracy. It is a quantization format, and pushing a model to four bits can move quality on tasks that are sensitive to it. The throughput and memory wins are real, but you validate output quality on your own evals before you bank the FP4 numbers in a capacity plan. A chip that runs FP4 is not the same as your model running correctly at FP4.
The lineup at a glance
| GPU | Arch | Memory | Bandwidth | Low precision | Best at |
|---|---|---|---|---|---|
| H100 SXM | Hopper | 80 GB HBM3 | 3.35 TB/s | FP8 | General training + inference |
| H200 SXM | Hopper | 141 GB HBM3e | 4.8 TB/s | FP8 | Memory-bound inference |
| B200 | Blackwell | 180 GB HBM3e | ~8 TB/s | FP4 / FP6 / FP8 | High-throughput training + inference |
| B300 (Ultra) | Blackwell Ultra | 288 GB HBM3e | 8 TB/s | FP4 (15 PFLOPS dense) | Reasoning + long-context inference |
| Rubin | Rubin | 288 GB HBM4 | ~22 TB/s | NVFP4 | Next-gen, H2 2026 |
Figures are SXM / module-class parts from NVIDIA and reported specs as of 2026. Rubin per-GPU compute figures are still being confirmed and are marked [VERIFY] in the text below.
The rack is the real product: GB200 and GB300 NVL72
At the top of the stack you do not buy a GPU, you buy a rack. The GB200 NVL72 connects 72 Blackwell GPUs and 36 Grace CPUs over fifth-generation NVLink at 1.8 TB/s per GPU, so the whole rack behaves like one large memory domain rather than 72 separate cards behind a slower network. That matters because the biggest models and the largest training runs need GPUs to share state at NVLink speed, not Ethernet speed. GB300 NVL72 is the Blackwell Ultra version of the same idea, delivering roughly 1.1 ExaFLOPS of FP4 per rack, about 1.5 times the GB200 rack.
The reason this is a system and not a card is physics. Seventy-two GPUs at Blackwell power draw is a liquid-cooled rack pulling well over a hundred kilowatts, with a copper NVLink spine carrying the all-to-all traffic. You cannot approximate it by buying loose GPUs and wiring them together; the interconnect is the value. If your workload is a single very large model or a frontier-scale training job, the NVL72 is the unit of capacity. If your workload is many independent smaller models, you are usually better served by individual H200 or B200 nodes, and the rack is overkill.
Rubin: what is coming in H2 2026
Rubin, paired with the Vera CPU as the Vera Rubin platform, is the next architecture after Blackwell. NVIDIA announced full production at GTC in 2026, with partner systems arriving in the second half of the year. The per-GPU memory stays at 288 GB but moves to HBM4, and the bandwidth jumps to about 22 TB/s, which is the part that matters: that is roughly 4.6 times the H200 and far past Blackwell, aimed directly at the bandwidth-bound nature of inference. NVLink moves to its sixth generation at 3.6 TB/s. Reported per-GPU figures of around 50 PFLOPS NVFP4 inference and 35 PFLOPS NVFP4 training, and a transistor count near 336 billion, are circulating but I would treat the exact per-GPU compute numbers as [VERIFY] until they are confirmed against NVIDIA datasheets. The rack-scale Vera Rubin NVL72 is quoted at about 3.6 ExaFLOPS of NVFP4 with 260 TB/s of all-to-all NVLink bandwidth.
How to actually choose
Start from the workload, not the catalog. Ask three questions in order: is this training or inference, does one model have to span many GPUs at once, and what does the model plus its working memory actually weigh. The answers point at a tier without you ever ranking chips by FLOPS. A single large training run or a frontier model that does not fit on one node points at NVL72. Memory-bound serving of models that fit on a card points at H200 or B200 depending on whether FP4 buys you anything. Models that already run fine on what you have point at buying more of the same, not a generational jump.
Worked example
Serving a 70B-parameter model. At FP16 the weights alone are about 140 GB, which does not fit on an 80 GB H100 without splitting across two cards. On an H200 at 141 GB the weights fit on a single card, but you still need headroom for the KV cache of every concurrent request, so in practice you serve it at FP8 (roughly 70 GB of weights) and spend the rest of the 141 GB on KV cache and concurrency. Move to a B200 and FP4 quantization (roughly 35 GB of weights, quality permitting) and one card holds the model plus far more concurrent sessions, which is where the throughput multiplier comes from. The lesson: the precision you can run decides how many GPUs you need as much as the model size does, so size memory against your real serving precision and batch, not the FP16 paper weight.
The Verdict
Choose by the bottleneck, not the brochure. For most inference work in 2026, H200 is the value pick: it fits the models people actually serve and the stack is stable and available. Reach for Blackwell, B200 or B300, when FP4 genuinely changes your throughput or memory math and you have validated quality at that precision; the FP4 step is the real reason to move off Hopper, not the bigger headline FLOPS. Buy the NVL72 rack only when one model must span many GPUs at NVLink speed, because the interconnect is what you are paying for and it is wasted on many independent small models. Plan a Rubin lane for HBM4-hungry workloads landing in H2 2026, but do not stall this year's build waiting for it. The thing I would validate first, every time, is the memory math at your real serving precision: weights plus KV cache against capacity and bandwidth. Get that right and the chip almost picks itself. What model size and precision is your deployment actually going to serve, and have you done that math against the card you are about to buy?
References
- NVIDIA H200 Tensor Core GPU (NVIDIA)
- NVIDIA GB200 NVL72 (NVIDIA)
- NVIDIA Blackwell Architecture (NVIDIA)
- NVIDIA Data Center GPU Specs: A Complete Comparison Guide (IntuitionLabs)



