Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

The NVIDIA Data-Center GPU Lineup: Hopper vs Blackwell vs Rubin (NVIDIA AI Series, Part 3)

The NVIDIA data-center GPU lineup from Hopper to Blackwell to Rubin, compared for training and inference: memory, bandwidth, FP4 and rack-scale NVL72, with a clear way to choose.

NVIDIA AI Series · Part 3 of 30
TL;DR · Key Takeaways
  • Two jobs decide the whole choice. Training is bound by FLOPS and interconnect; inference is bound by memory capacity and bandwidth. Buy for the job, not the headline FLOPS on the slide.
  • Hopper is not dead. H100 (80 GB HBM3, 3.35 TB/s) is still the broad workhorse; H200 (141 GB HBM3e, 4.8 TB/s) is the cheaper way to fit and serve a big model today.
  • Blackwell is the current build target. B200 (180 GB HBM3e, about 8 TB/s) and B300 / Blackwell Ultra (288 GB HBM3e, 8 TB/s, 15 PFLOPS dense FP4) add FP4 and FP6, which Hopper does not have.
  • The rack is the real product. GB200 NVL72 ties 72 Blackwell GPUs over NVLink 5 into one memory domain; GB300 NVL72 pushes roughly 1.1 ExaFLOPS of FP4 per rack.
  • Rubin is the next jump: 288 GB HBM4, about 22 TB/s, NVLink 6, full production announced at GTC in 2026 with partner systems landing H2 2026. Plan for it, do not stall a build waiting on it.
Who this is for: architects and platform engineers choosing GPUs for an AI build, and the capacity or finance partners who have to sign off on the bill.
Prerequisites: a rough idea of your workload mix (training, fine-tuning, inference) and the model sizes you intend to run. Part 1 maps the stack these GPUs sit under; Part 2 covers what the AI Enterprise license on top of them costs.

Pick a GPU by the biggest FLOPS number on the keynote slide and you will buy compute you cannot feed. I have watched teams order a rack of the fastest chip available, then serve a 70B model at a fraction of the throughput they paid for, because the model was waiting on memory bandwidth the whole time and the tensor cores sat idle. The data-center lineup looks like a simple ladder, H100 to H200 to B200 to B300 to Rubin, but the right rung depends on what the silicon is being asked to do. This part lays out the current lineup as of 2026, what actually changed generation to generation, and how to choose for training versus inference without overspending on either.

Two jobs, two different bottlenecks

Before any model number matters, separate the two jobs a data-center GPU does, because they stress different parts of the chip. Training a model pushes raw math throughput and the interconnect between GPUs: you are running huge batches forward and backward, syncing gradients across dozens or hundreds of devices, and the limiter is FLOPS plus how fast GPUs talk to each other. Inference is a different animal. Once a model is trained, serving it is dominated by how much memory you have (the weights plus the KV cache for every concurrent request have to live on the GPU) and how fast you can stream that memory (token generation is memory-bandwidth bound, not compute bound). The same chip is good at both, but the spec that saves you is different in each case.

This is why a memory upgrade with the same compute can matter more than a faster core. The H200 is the same Hopper architecture as the H100 with more and faster memory, and it serves large language models meaningfully faster purely because it is less starved. The memory wall is the single most useful lens for reading this lineup.

What each job actually stresses Buy for the bottleneck, not the slide Training Bound by: FLOPS + interconnect Wants: dense compute, NVLink, fast scale-out fabric Spec that helps: FP8/FP4 PFLOPS Inference Bound by: memory + bandwidth Wants: capacity for weights + KV, high HBM bandwidth Spec that helps: GB + TB/s
Training chases compute and interconnect; inference chases memory and bandwidth. The headline FLOPS number only tells half the story.

Hopper: still the workhorse

H100 vs H200

Hopper is the generation most production estates are actually running today, and it is not going away because Blackwell exists. The H100 SXM ships with 80 GB of HBM3 at 3.35 TB/s and a Transformer Engine that supports FP8. It trains and serves almost anything, and the broad install base means tooling, drivers and known-good configurations are mature. The H200 keeps the identical Hopper compute but swaps in 141 GB of HBM3e at 4.8 TB/s. That is roughly 76 percent more capacity and about 43 percent more bandwidth for the same FLOPS.

For inference that gap is the whole point. More memory means a bigger model or a longer context or more concurrent requests fit on one card; more bandwidth means tokens come out faster. Reported gains land around 1.4 to 1.9 times the H100 throughput on large models, with no change to the training math. If your problem is that a model barely fits or barely serves on H100, H200 is the cheaper fix than jumping a whole architecture. What neither Hopper card has is FP4 or FP6, and that is the line Blackwell crosses.

My take: for a lot of inference shops the honest 2026 move is not the newest chip, it is H200 at the right price. It fits the models people actually serve, the stack is boringly stable, and the supply situation is easier than the front of the queue for Blackwell and Rubin. I reach for Blackwell when FP4 or rack-scale memory genuinely changes what is possible, not by default.

Blackwell: B200, B300 and the FP4 step

Blackwell is the current architecture you build new on. The B200 carries 180 GB of HBM3e at roughly 8 TB/s, about 2.3 times the H100 bandwidth, and a second-generation Transformer Engine that adds FP6 and FP4. FP4 is the headline: dropping to four-bit precision for inference, where the math tolerates it, lets a Blackwell card push far more tokens per second and pack a quantized model into far less memory than FP8 or FP16 would need. NVIDIA positions Blackwell at up to about 4 times H100 training and up to 15 times H100 inference on large models, and while those are best-case marketing figures, the FP4 path is real and Hopper simply cannot run it.

B300 / Blackwell Ultra

The B300, branded Blackwell Ultra, pushes memory to 288 GB of HBM3e (12-high stacks) at 8 TB/s and delivers about 15 PFLOPS of dense FP4, at a 1,400 W TDP. The extra capacity is aimed squarely at reasoning models and long-context inference, where the KV cache balloons and 180 GB starts to bind. The cost is power and heat: at 1,400 W a card, air cooling is off the table for dense configurations, which is why the high end ships as liquid-cooled rack systems rather than loose PCIe cards.

Gotcha

FP4 is not free accuracy. It is a quantization format, and pushing a model to four bits can move quality on tasks that are sensitive to it. The throughput and memory wins are real, but you validate output quality on your own evals before you bank the FP4 numbers in a capacity plan. A chip that runs FP4 is not the same as your model running correctly at FP4.

Memory per GPU, generation to generation Bar length = HBM capacity (GB); label shows bandwidth H100 80 GB HBM3 · 3.35 TB/s H200 141 GB HBM3e · 4.8 TB/s B200 180 GB HBM3e · ~8 TB/s B300 288 GB HBM3e · 8 TB/s Rubin 288 GB HBM4 · ~22 TB/s Dashed = announced, partner systems H2 2026. Capacity flattens at 288 GB; the Rubin jump is bandwidth (HBM4).
Capacity climbs then plateaus at 288 GB; Rubin's real move is the bandwidth leap to HBM4, which is what inference is starved for.

The lineup at a glance

GPUArchMemoryBandwidthLow precisionBest at
H100 SXMHopper80 GB HBM33.35 TB/sFP8General training + inference
H200 SXMHopper141 GB HBM3e4.8 TB/sFP8Memory-bound inference
B200Blackwell180 GB HBM3e~8 TB/sFP4 / FP6 / FP8High-throughput training + inference
B300 (Ultra)Blackwell Ultra288 GB HBM3e8 TB/sFP4 (15 PFLOPS dense)Reasoning + long-context inference
RubinRubin288 GB HBM4~22 TB/sNVFP4Next-gen, H2 2026

Figures are SXM / module-class parts from NVIDIA and reported specs as of 2026. Rubin per-GPU compute figures are still being confirmed and are marked [VERIFY] in the text below.

The rack is the real product: GB200 and GB300 NVL72

At the top of the stack you do not buy a GPU, you buy a rack. The GB200 NVL72 connects 72 Blackwell GPUs and 36 Grace CPUs over fifth-generation NVLink at 1.8 TB/s per GPU, so the whole rack behaves like one large memory domain rather than 72 separate cards behind a slower network. That matters because the biggest models and the largest training runs need GPUs to share state at NVLink speed, not Ethernet speed. GB300 NVL72 is the Blackwell Ultra version of the same idea, delivering roughly 1.1 ExaFLOPS of FP4 per rack, about 1.5 times the GB200 rack.

The reason this is a system and not a card is physics. Seventy-two GPUs at Blackwell power draw is a liquid-cooled rack pulling well over a hundred kilowatts, with a copper NVLink spine carrying the all-to-all traffic. You cannot approximate it by buying loose GPUs and wiring them together; the interconnect is the value. If your workload is a single very large model or a frontier-scale training job, the NVL72 is the unit of capacity. If your workload is many independent smaller models, you are usually better served by individual H200 or B200 nodes, and the rack is overkill.

Scale-up rack vs separate nodes The interconnect is the product NVL72: one memory domain NVLink 5 spine 72 GPUs · 1.8 TB/s per GPU all-to-all at NVLink speed Best for: one huge model, frontier training Separate nodes node node linked by Ethernet / InfiniBand (slower than NVLink) Best for: many smaller models
The NVL72 earns its cost only when GPUs must share state at NVLink speed. For many independent models, separate nodes are the cheaper answer.

Rubin: what is coming in H2 2026

Rubin, paired with the Vera CPU as the Vera Rubin platform, is the next architecture after Blackwell. NVIDIA announced full production at GTC in 2026, with partner systems arriving in the second half of the year. The per-GPU memory stays at 288 GB but moves to HBM4, and the bandwidth jumps to about 22 TB/s, which is the part that matters: that is roughly 4.6 times the H200 and far past Blackwell, aimed directly at the bandwidth-bound nature of inference. NVLink moves to its sixth generation at 3.6 TB/s. Reported per-GPU figures of around 50 PFLOPS NVFP4 inference and 35 PFLOPS NVFP4 training, and a transistor count near 336 billion, are circulating but I would treat the exact per-GPU compute numbers as [VERIFY] until they are confirmed against NVIDIA datasheets. The rack-scale Vera Rubin NVL72 is quoted at about 3.6 ExaFLOPS of NVFP4 with 260 TB/s of all-to-all NVLink bandwidth.

In practice: a Rubin announcement does not mean you pause a Blackwell build. New architectures ship into constrained supply first, command a premium, and need new drivers, new cooling and sometimes new power. The teams that win plan a Rubin lane for the workloads that need HBM4 bandwidth, and keep buying Blackwell and Hopper for everything shipping this year. Treat the roadmap as a planning input, not a reason to stall.

How to actually choose

Start from the workload, not the catalog. Ask three questions in order: is this training or inference, does one model have to span many GPUs at once, and what does the model plus its working memory actually weigh. The answers point at a tier without you ever ranking chips by FLOPS. A single large training run or a frontier model that does not fit on one node points at NVL72. Memory-bound serving of models that fit on a card points at H200 or B200 depending on whether FP4 buys you anything. Models that already run fine on what you have point at buying more of the same, not a generational jump.

Which GPU tier? Workload first, chip second One model must span many GPUs? GB200 / GB300 NVL72 frontier train / huge model Need FP4 / max throughput? per-node serving H200 fits + serves, best value B200 / B300 H100 / H200 buy more of same YES NO, fits 1 card YES NO
Three questions, not a FLOPS ranking. Most teams land on H200 or B200; the rack is for the genuine frontier case.

Worked example

Serving a 70B-parameter model. At FP16 the weights alone are about 140 GB, which does not fit on an 80 GB H100 without splitting across two cards. On an H200 at 141 GB the weights fit on a single card, but you still need headroom for the KV cache of every concurrent request, so in practice you serve it at FP8 (roughly 70 GB of weights) and spend the rest of the 141 GB on KV cache and concurrency. Move to a B200 and FP4 quantization (roughly 35 GB of weights, quality permitting) and one card holds the model plus far more concurrent sessions, which is where the throughput multiplier comes from. The lesson: the precision you can run decides how many GPUs you need as much as the model size does, so size memory against your real serving precision and batch, not the FP16 paper weight.

Running this on VMware? the GPU choice interacts with how you partition and present the card to VMs. The Private AI series GPU selection guide covers the L40S, H100, H200 and Blackwell trade-offs specifically for VMware Cloud Foundation deployments, which is the natural next step once you have picked a tier here.

The Verdict

Choose by the bottleneck, not the brochure. For most inference work in 2026, H200 is the value pick: it fits the models people actually serve and the stack is stable and available. Reach for Blackwell, B200 or B300, when FP4 genuinely changes your throughput or memory math and you have validated quality at that precision; the FP4 step is the real reason to move off Hopper, not the bigger headline FLOPS. Buy the NVL72 rack only when one model must span many GPUs at NVLink speed, because the interconnect is what you are paying for and it is wasted on many independent small models. Plan a Rubin lane for HBM4-hungry workloads landing in H2 2026, but do not stall this year's build waiting for it. The thing I would validate first, every time, is the memory math at your real serving precision: weights plus KV cache against capacity and bandwidth. Get that right and the chip almost picks itself. What model size and precision is your deployment actually going to serve, and have you done that math against the card you are about to buy?

NVIDIA AI Series · Part 3 of 30
« Previous: Part 2  |  NVIDIA AI Guide  |  Next: Part 4 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading