TL;DR · Key Takeaways
- For most enterprise Private AI work (RAG, mid-size inference, light fine-tuning), the L40S is the default. Reach for H100/H200 only when the model needs more than one GPU or you are training.
- The GPU name is not the decision. Form factor (PCIe vs SXM), NVLink, memory size and vGPU profile support change the answer more than the marketing tier does.
- The RTX PRO 6000 Blackwell Server Edition (96 GB, MIG up to 4x) is the new value pick for inference density. Worth a hard look before you default to H100.
- A100 is a poor new-buy in 2026: no FP8, end of its road. Only sensible if you are extending an existing fleet.
- Whatever you pick, validate the exact server + GPU + ESXi combination in the Broadcom Compatibility Guide before a PO goes out.
Almost every Private AI design review I sit in opens the same way: someone has already decided they need HGX H100 because that is what the reference architecture diagram showed. Then we look at the actual workload, a handful of 8B to 13B models behind a RAG front end, and the room goes quiet. The cluster they speced will spend most of its life at 20% utilization while the budget took an eight-figure hit. Picking the GPU is the single most expensive decision in a Private AI build, and it is the one people anchor on the fastest.
This post is about choosing the physical GPU model. It is not about how you slice it once it is in the host. Time-slicing, MIG and passthrough are a separate decision that gets its own treatment in Part 6, so I will keep partitioning out of the way here and focus on the card you actually buy.
Start with the workload, not the spec sheet
Private AI Foundation will happily run on anything in the Broadcom Compatibility Guide. That is the trap. The platform supporting a GPU tells you nothing about whether it fits your workload. Three questions decide the tier, and you can answer them before you look at a single product page.
First, does your largest model fit in one GPU? A 13B model in FP8 lands around 14 GB of weights, comfortable in 48 GB with room for KV cache and concurrency. A 70B model in FP8 needs roughly 70 GB before context, which pushes you to an 80 GB H100 or a 96 GB Blackwell card, or to sharding across GPUs. Second, do you need to shard a single model across cards? If yes, you need NVLink, and that means SXM or an HGX board, not a PCIe card. Third, are you training or only serving? Training and heavy fine-tuning live on H100/H200 class hardware with NVLink. Inference, including RAG, almost never does.
Get those three answers and the GPU shortlist collapses to one or two options. Most enterprise Private AI workloads I see are inference and RAG over models under 34B. That is L40S or RTX PRO 6000 territory, full stop.
The contenders, side by side
Here is the shortlist as a selection matrix. Memory and interconnect are the two columns that decide most arguments, so read those first. The “sweet spot” column is my read from field work, not a vendor claim.
| GPU | Memory | Interconnect | MIG | Sweet spot for Private AI |
|---|---|---|---|---|
| L40S | 48 GB GDDR6 | PCIe only, no NVLink | No | Default for RAG and inference up to ~34B; dual-purpose so it also serves VDI/graphics |
| RTX PRO 6000 Blackwell | 96 GB GDDR7 | PCIe only, no NVLink | Yes, up to 4 x 24 GB | Inference density and consolidation; up to ~5x L40S throughput on agentic LLM serving |
| H100 (SXM) | 80 GB HBM3 | NVLink / NVSwitch | Yes, up to 7 instances | Training, fine-tuning, 70B+ models, multi-GPU sharding |
| H200 (SXM/HGX) | 141 GB HBM3e | NVLink / NVSwitch | Yes, up to 7 instances | Large-context inference and big models where memory is the bottleneck |
| A100 | 40 / 80 GB HBM2e | NVLink (SXM) | Yes, up to 7 instances | Extending an existing fleet only; no FP8, weak new-buy in 2026 |
One number on that table does more damage than any other when people ignore it: the NVLink column. It is the difference between a card you can gang together for one big model and a card you cannot.
Form factor decides more than the model name
The mistake that costs real money is buying a GPU by name and ignoring its form factor. “H100” is not one product. The PCIe H100 and the SXM H100 differ in power, cooling, and the thing that matters most for large models: NVLink. PCIe cards talk to each other over the PCIe bus. SXM cards on an HGX board talk over NVLink and NVSwitch at an order of magnitude more bandwidth. If your plan is to shard a 70B model across two cards, a pair of PCIe H100s will technically work and then disappoint you, because the interconnect becomes the wall.
This is why the L40S and RTX PRO 6000 cap out at single-GPU models per card. They are PCIe parts with no NVLink. That is not a flaw, it is a fit. For data-parallel inference, where every replica is an independent copy of the model, you never needed NVLink in the first place, and you save the HGX premium. For a single model that does not fit in one card, PCIe is the wrong tool and you should be on SXM.
There is a second form-factor angle that is easy to miss. The L40, L40S and A40 are dual-purpose cards that can host heterogeneous vGPU profiles on one device, so the same GPU can serve a VDI pool and an inference workload side by side. The H100 is a compute-dedicated part and does not flex that way. If you have any ambition of consolidating graphics and AI on the same host, that pushes you toward the L-series, and it is the kind of detail that does not show up in a TFLOPS comparison.
Power, cooling and the three-host floor
The GPU you choose writes a check the data center has to cash. Eight SXM H100s at roughly 700 W each is a different facility than four L40S at 350 W. I have watched a design stall for months because the chosen HGX node needed rack power and liquid or high-static-pressure cooling the existing room could not deliver. The card was certified, the building was not.
Two platform facts shape the cluster regardless of card. Private AI Foundation wants at least three GPU-enabled ESX hosts in the initial cluster of a workload domain, so you are buying GPUs in threes at minimum even for a pilot. And vGPU mode requires SR-IOV enabled in BIOS plus the matching NVIDIA vGPU host driver VIB in your vSphere Lifecycle Manager image, while passthrough mode skips the vGPU driver but gives up sharing. That driver-and-firmware dependency is worth confirming against your exact GPU before you commit, because not every card and server BIOS combination exposes SR-IOV cleanly.
ESXi 9.1 entry in the Broadcom Compatibility Guide, check vGPU host driver and GPU Operator interoperability for your VCF build, verify NVAIE entitlement, and validate facility power and cooling per host before any purchase or production rollout.
Where each card actually earns its place
The L40S is the workhorse and my default recommendation for a first Private AI cluster. It runs the models most enterprises actually deploy, it is dual-purpose so it does not strand capacity, and it keeps power and cost sane. When someone asks me to “just pick something safe,” this is it.
The RTX PRO 6000 Blackwell Server Edition is the card I am increasingly steering inference-heavy clients toward. 96 GB on a single PCIe card covers models that used to force an 80 GB H100, NVIDIA cites up to roughly 5x the LLM inference throughput of an L40S on agentic workloads, and MIG lets you carve it into four isolated 24 GB instances for tenant separation. For a pure serving estate it can beat H100 on dollars per token. The caution is newness: confirm it is in your server vendor BOM and in the compatibility guide for your ESXi build before you bank on it.
The H100 and H200 earn their premium when memory or interconnect is the real constraint: training, serious fine-tuning, 70B-plus models, or anything that has to shard one model across cards over NVLink. H200’s 141 GB also helps long-context inference where KV cache, not weights, is what overflows. If you are not doing those things, you are paying for headroom you will not use. B200 is frontier-scale and rare in enterprise Private AI today; if you genuinely need it you already know.
For the full sizing picture across hosts, memory and concurrency, it is worth pairing this with the AI infrastructure sizing and cost calculator, and with the host and driver readiness work covered in Part 4 on planning and prerequisites. The component layout from Part 2 on architecture and components shows where the GPU layer fits in the wider stack.
The Bottom Line
My verdict, stated plainly. Default to the L40S for a first Private AI cluster running RAG and inference up to roughly 34B; it is the lowest-regret choice. Put the RTX PRO 6000 Blackwell on the shortlist whenever the estate is inference-dominated and you care about density and dollars per token, and prefer it over H100 there once it clears your compatibility guide. Spend on H100 or H200 only when training, large models, or NVLink sharding genuinely require it, and let H200’s memory decide between the two. Skip A100 for new buys, and treat B200 as the exception, not the plan. The recurring failure is not buying too little, it is buying H100 by reflex for a workload an L40S would have served at a third of the cost. Pick for the workload in front of you, not the diagram.
What is the GPU you are weighing for your next Private AI cluster, and what is making the call hard? Tell me in the comments.
References
- Requirements for Deploying VMware Private AI Foundation with NVIDIA 9.1 (Broadcom TechDocs)
- VMware Private AI Foundation with NVIDIA Server Guidance (VCF Blog)
- VMware Private AI Foundation with NVIDIA on HGX Servers for Inference (VCF Blog)
- NVIDIA RTX PRO 6000 Blackwell Server Edition (NVIDIA Blog)
« Previous: Part 4 | VMware Private AI Complete Guide | Next: Part 6 »








