TL;DR · Key Takeaways
- Plan the GPU-enabled workload domain first. You need at least 3 GPU-enabled ESX hosts in the initial cluster, VCF Single Sign-On on the instance that holds those hosts, and the Private AI Foundation add-on license assigned to that domain.
- The thing that breaks first-time deployments is not hardware, it is the version triangle: VCF build, the NVIDIA vGPU host driver VIB, the NVAIE guest driver, and the GPU Operator (25.10.x) all have to agree. Validate that matrix before you order anything.
- MIG cannot back NVIDIA NIM microservices. If your plan is model serving with NIM, do not architect around MIG for density.
- Most teams over-buy CPU and RAM and under-plan the east-west network. Inference leaves general-purpose cores idle while GPU-to-GPU traffic is what actually constrains you.
- VCF 9.1 adds exclusive GPU passthrough without an NVAIE license and keeps vMotion with low-latency GPU pathways. That changes the licensing math for some designs.
Here is the failure pattern I see most often. A team buys two beautiful 8-way H100 servers, racks them, and then spends three weeks fighting driver mismatches before a single model serves a token. The GPUs were never the hard part. Planning Private AI Foundation well is mostly about getting four moving versions to agree and sizing for the resource that actually runs out, which is almost never the one people budget for.
This is Part 4. Part 1 covered what the platform is, Part 2 walked the architecture layer by layer, and Part 3 untangled the licensing. This part is the design and readiness pass you run before you commit a bill of materials.
Start With the Workload Domain, Not the GPU
Private AI Foundation is not a product you install on top of a host. It is a set of capabilities you light up inside a GPU-enabled VCF workload domain. The management domain stays general purpose. The GPU hosts live in a separate VI workload domain so you can lifecycle them, license them, and fail them independently of the control plane.
Three hard requirements gate the whole design. You need a minimum of three GPU-enabled ESX hosts in the initial cluster, both for vSAN quorum and so a host can go down for driver remediation without stranding workloads. VCF Single Sign-On has to be configured on the VCF instance that contains those GPU hosts. And the Private AI Foundation add-on license has to land on the GPU workload domain, with a copy on the management domain only if you want the guided deployment UI in the vSphere Client.
The Hardware Decision: Which GPU, How Many Hosts
Match the GPU to the workload, not to the spec sheet. The Broadcom Compatibility Guide is the source of truth for what is actually supported on ESXi 9.0, so validate any card there before you buy. In the field the choice usually collapses to three buckets, and the cost difference between them is large enough that guessing is expensive.
| Workload profile | Sensible GPU | Why | Watch out for |
|---|---|---|---|
| RAG / small-to-mid LLM inference | L40S | 48 GB, strong inference per dollar, single-slot power budget | No NVLink, so it scales out, not up |
| Large model serving, high concurrency | H100 / H200 SXM | 80 to 141 GB, NVLink and NVSwitch for tensor parallelism | HGX systems, power and cooling, longer lead times |
| Fine-tuning / training | H100 / H200, 4x or 8x per host | GPUDirect RDMA across hosts for multi-node jobs | East-west fabric becomes the bottleneck, not the GPU |
| Mixed / shared dev | L40S with time slicing | Many light tenants per card | Noisy-neighbor effects under load |
My take: for a first deployment that is mostly RAG and inference, two or three L40S hosts get you to production faster and cheaper than a single HGX box, and they give you real host-level redundancy that one fat server cannot. Reach for H100 or H200 SXM when a single model genuinely will not fit in 48 GB, or when tensor parallelism across NVLink is the only way to hit your latency target. Buying H200 for a 7B-parameter RAG assistant is the most common over-spend I flag in design workshops.
The Version Triangle That Actually Bites
This is where deployments stall. Four versions have to be interoperable at the same time, and none of them moves on the same release cadence. The VCF build dictates the vSphere and ESXi version. The NVIDIA vGPU host driver VIB has to match that ESXi version. The NVAIE guest driver in the deep learning VM or container has to match the host driver branch. And the GPU Operator that manages drivers inside VKS clusters has to match both the guest driver and the Kubernetes version of the cluster.
As a concrete reference point, a current-generation bring-up pairs GPU Operator 25.10.x with NVAIE 7 vGPU drivers in the 580.x branch (for example the 580.105.08 build) and an H100XM-80C class vGPU profile on H100 SXM5 hardware. Those numbers will move, which is exactly the point: pin them against the release notes for your specific VCF build before you commit, not after.
The mechanical part of readiness is getting the host driver into the cluster image. You download the vGPU host driver VIB from the NVIDIA licensing portal, add it to a vSphere Lifecycle Manager image in SDDC Manager, and remediate the hosts. Confirm it took before you go further.
# Confirm the vGPU host driver VIB is installed on the ESXi host
esxcli software vib list | grep -i nvd
# Confirm the GPU is seen and SR-IOV is active
nvidia-smi
esxcli graphics device list
# Check vGPU host manager and supported profiles
nvidia-smi vgpu -s
# From a VKS cluster, confirm the GPU Operator pods are healthy
kubectl get pods -n gpu-operator
Sharing Mode: Time Slicing, MIG or Passthrough
This decision shapes the rest of the design, so make it deliberately. There is one trap that catches people who optimize for density without reading the fine print: you cannot back NVIDIA NIM microservices with MIG. If your model-serving plan is NIM, MIG is off the table for those GPUs, full stop. That single limitation rules out the partitioning strategy a lot of teams reach for first.
Passthrough gives a VM the entire physical GPU with no virtualization overhead, and in VCF 9.1 you can do exclusive passthrough without an NVAIE license while keeping vMotion through the new low-latency GPU data pathways built on BlueField-3 and ConnectX-7. Time slicing shares one card across several vGPU-backed VMs by interleaving in time, which is the right default for RAG, light inference, and dev. MIG carves a card into hardware-isolated slices with guaranteed memory and compute, which is excellent for strict multi-tenant isolation and useless to you if you are serving with NIM.
Host BOM and the Over-Provisioning Trap
Here is the counterintuitive part of sizing a Private AI host. The GPU is the constraint; the CPU and RAM around it usually are not. Field measurements on inference hosts routinely show a fraction of the cores busy and a fraction of system memory used, while the GPU sits pinned. I have seen an inference node use roughly two dozen of more than two hundred logical cores and a quarter of two terabytes of RAM. If you size CPU and RAM like a general-purpose cluster, you pay for silicon that idles.
Spend that budget on the network instead. For multi-GPU and multi-node work the east-west fabric is what actually limits throughput. GPUDirect RDMA, NVLink within a host, and NVSwitch across an HGX baseboard are what keep tensor-parallel jobs fed. A pair of 8-way H100 hosts with a slow fabric between them will underperform a well-connected design every time.
What I’d Do
For a first Private AI Foundation deployment I would stand up a three-host L40S workload domain, run passthrough or full vGPU profiles rather than MIG, and pin the entire version triangle against the release notes for my exact VCF build before a single purchase order goes out. I would size CPU and RAM modestly, spend the saved budget on the east-west fabric, and only move to H100 or H200 SXM once a real model proves it needs NVLink. The teams that struggle are almost always the ones that bought the biggest GPU server first and worked out the software interoperability afterward. Reverse that order and the deployment goes quiet.
Next in the series we get hands-on: deploying Private AI Foundation on VCF 9 end to end. What is the first thing you check on a new GPU host, the driver or the fabric? Tell me how you sequence it.
References
- Broadcom TechDocs: Requirements for Deploying VMware Private AI Foundation with NVIDIA
- VCF Blog: VCF 9.1, the Cost-Effective Private Cloud Platform for Production AI
- VCF Blog: Introducing the Private AI Sizing Guide
- NVIDIA GPU Operator Release Notes
« Previous: Part 3 | VMware Private AI Complete Guide | Next: Part 5 »








