TL;DR · Key Takeaways
- A GPU workload domain is not a normal VI workload domain with cards added later. The firmware, the vLCM image, and the host graphics mode all have to be right before SDDC Manager ever builds the cluster.
- Put the NVIDIA vGPU host driver VIB inside the cluster vSphere Lifecycle Manager image. Hand-installing it per host is the single most common reason the fourth host you add six months later silently drifts.
- Two firmware settings decide whether vGPU works at all: SR-IOV and large-BAR memory mapping. Miss them and the driver loads cleanly while zero vGPU profiles appear.
- Private AI Services only runs on a single-zone Supervisor, and MIG cannot back NVIDIA NIMs. Both constraints shape the cluster you build here, not a later step.
- PAIF 9.0 on VCF 9.0 needs three licenses stacked: VCF subscription, the Private AI Foundation add-on, and an NVIDIA vGPU entitlement.
Most Private AI Foundation projects do not fail at the model. They fail three weeks earlier, on a workload domain that was built like every other VI domain and then had GPUs bolted on. The host driver was installed by hand on three nodes, SR-IOV was never touched in firmware, and the Supervisor was stretched across two zones for “high availability.” Everything looks green in SDDC Manager. Then the first deep learning VM cannot find a vGPU profile, Private AI Services refuses to install, and nobody can say why. This post is the procedure that prevents that, end to end, the way I run it on a real engagement.
What you are actually building
A GPU workload domain in PAIF 9.0 is a stack, and every layer below the AI workload has to be configured in order or the layer above it never comes up. The firmware exposes the GPU correctly to ESX, the host driver turns the physical card into schedulable vGPU, the host graphics mode makes those profiles available to VMs, and only then can vSphere Supervisor, deep learning VMs, VKS clusters, and Private AI Services consume them. Skip a layer and the symptom always shows up higher than the cause, which is why this work is so frustrating to debug after the fact.
Before you touch a host: BOM and licensing
PAIF 9.0 runs on top of VCF 9.0, and the licensing is stacked, not single. You need a VCF subscription, the Private AI Foundation with NVIDIA add-on, and a licensed NVIDIA vGPU entitlement that includes the host driver VIB and the guest OS drivers. The add-on is what enables the guided deployment UI, the VCF Automation catalog items, the pgvector database in Data Services Manager, and Private AI Services. Assign the add-on to the GPU-enabled workload domain; assigning it to the management domain only activates the guided UI in the vSphere Client, it does not light up the workloads.
One detail people miss when they cost this out: you can run GPU workloads and read GPU metrics in vCenter and VCF Operations under the base VCF license alone. The add-on buys the automation and the managed services, not raw GPU access. That distinction matters when you are deciding whether a small proof of concept needs the full add-on on day one. Validate your exact host and GPU against the Broadcom Compatibility Guide for GPU and Accelerators before you buy anything, because an unsupported card is a dead end no license fixes.
| Layer | Component (PAIF 9.0) | What to verify |
|---|---|---|
| Platform | VCF 9.0 | Management domain healthy, SSO configured on the instance with the GPU hosts |
| GPU host driver | NVIDIA vGPU host manager VIB (for example NVD-AIE_ESXi_9.x driver) | Version compatible with VCF 9.0, downloaded from nvid.nvidia.com |
| Guest stack | NVIDIA GPU Operator 25.10.x (guest driver v580.x) | Operator version matched to the VKS Kubernetes release and the host driver |
| Hosts | 3 or more GPU-enabled ESX hosts | Identical GPU model and firmware across the initial cluster |
| License | VCF subscription + PAIF add-on + NVIDIA vGPU | Add-on assigned to the GPU workload domain |
The procedure, end to end
Step 1: firmware. On every host that will carry a GPU, set the firmware before the host ever joins a cluster. Enable SR-IOV, enable Intel VT-d or AMD IOMMU, and turn on the large memory mapping settings your vendor calls Above-4G Decoding and large or 64-bit BAR / MMIO. On many platforms you also want MMIO space sized to at least the total GPU framebuffer plus headroom. Leave ECC on for the data-center GPUs. These are the settings that decide whether the GPU presents correctly; the OS layer cannot compensate for a firmware that hides the device.
Step 2: bake the driver into the vLCM image. This is the step that separates a domain that ages well from one that rots. Add the NVIDIA vGPU host manager VIB to a vSphere Lifecycle Manager image in SDDC Manager, and base the workload domain on that image. Do not install the VIB host by host and forget it. The reason is operational, not cosmetic: when you add a fourth or fifth host in nine months, vLCM remediates it to the image automatically, driver included. A hand-installed estate drifts the moment someone expands the cluster, and a vGPU version skew between hosts produces failures that are miserable to trace.
Step 3: deploy the workload domain. In SDDC Manager, create the VI workload domain on the GPU hosts, selecting the vLCM image that contains the host driver VIB. The workload domain vCenter is deployed into the management domain SSO domain. Confirm that the workload domain vCenter and the management vCenter are joined in a vCenter group through vCenter Linking, because the guided Private AI deployment UI and Private AI Services both expect that linkage. Start with at least three GPU hosts in the initial cluster; this is a hard floor for the domain, not a sizing suggestion.
Step 4: set the host graphics mode. For vGPU, every GPU device on every host has to be set to Shared Direct. This is the step that actually exposes vGPU profiles to virtual machines, and it is the one most often skipped. Place the host in maintenance mode, open Configure, Hardware, Graphics, edit each device, set Device Type to Shared Direct, then reboot and exit maintenance mode. If you are running Private AI Services on passthrough instead, toggle DirectPath I/O on the PCI devices rather than Shared Direct. Do not mix the two intentions on one card without understanding the dual-purpose vGPU mode trade-off.
| Setting | Where | Value for vGPU |
|---|---|---|
| SR-IOV | Server firmware | Enabled |
| VT-d / IOMMU | Server firmware | Enabled |
| Above-4G / large BAR (MMIO) | Server firmware | Enabled, sized to framebuffer + headroom |
| Graphics device type | ESX host graphics | Shared Direct |
| Passthrough (alternative) | ESX PCI devices | DirectPath I/O, for Private AI Services only |
Step 5: enable the Supervisor. Enable vSphere Supervisor on the GPU cluster. This is where a quiet constraint bites: Private AI Services supports only single-zone Supervisors. If you reflexively built a three-zone Supervisor for availability, Private AI Services will not install on it, and you will be rebuilding the Supervisor later. Decide up front whether this domain is for the full Private AI Services experience or for deep learning VMs and VKS clusters that tolerate multi-zone, and build the Supervisor accordingly.
Step 6: license and VM classes. Assign the Private AI Foundation add-on license to this workload domain. Then create the vGPU-based VM classes that map to the vGPU profiles you exposed in step 4. A VM class that asks for a C-type compute profile your hosts do not present will never schedule, so the profiles on the card and the profiles in the VM class have to agree. This is also where you choose your sharing model deliberately, which leads to the one constraint that trips up density-focused teams.
Step 7: validate before you hand it over. Do not declare the domain ready because SDDC Manager is green. Prove the GPU path end to end. The commands below are the quickest way to confirm the host driver is live and the card is healthy before anyone tries to provision a workload.
# On the ESX host: is the vGPU host driver loaded and the GPU healthy?
nvidia-smi
# Confirm the host driver VIB is present in the image
esxcli software vib list | grep -i nvd
# If you ever switch a host from MIG back to passthrough,
# turn MIG off per GPU first, then remove the host driver
nvidia-smi -i 0 -mig 0
esxcli software vib remove -n NVD-AIE_ESXi_9.1.0_Driver
reboot
The mistakes that actually bite
Three things account for most of the wasted days I see on this step. First, firmware that was never touched: the host driver loads, nvidia-smi even works, but no vGPU profiles appear because SR-IOV or the large-BAR mapping is off. Always confirm firmware before you blame the driver. Second, version skew between the ESX host driver and the guest stack. GPU Operator 25.10.x ships a v580.x guest driver, and pairing that with an older host driver produces guest pods that crash-loop with opaque CUDA errors. Treat the host driver, the GPU Operator, and the VKS Kubernetes release as one interoperability set and pin them together.
Third, the MIG-versus-NIM trap. Teams chasing density enable Multi-Instance GPU on the cluster, then discover that NVIDIA NIM microservices, the thing they actually wanted to run, are not supported on MIG. If your roadmap includes NIM-based inference or Private AI Services, do not enable MIG on those hosts. Use time-slicing or full-profile vGPU instead and accept the lower density. This is a design decision you make here, on the workload domain, not something you can quietly flip later without disruption. For the deeper trade-off between vGPU, MIG, and passthrough, see GPU partitioning for VMware Private AI, and for how many hosts and what GPU to buy, see the reference architecture and sizing post.
What I’d Do
If I am building this from scratch, I freeze three decisions before I rack a thing: the host driver goes in the vLCM image from day one, the Supervisor is single-zone if Private AI Services is on the roadmap, and MIG stays off unless I have proven I will never run NIMs on those hosts. Get those three right and the rest of the procedure is mechanical. Get them wrong and you are rebuilding the domain after the data scientists have already started, which is the worst possible time. Build the domain against your design from the planning and prerequisites work, and if you want the generic VI workload domain mechanics underneath this, the VI workload domain deployment walkthrough covers the base.
What is the one firmware or Supervisor setting that has burned you on a GPU domain? Tell me in the comments, because the list of things that hide below SDDC Manager green is longer than any doc admits.
References
- Requirements for Deploying VMware Private AI Foundation with NVIDIA 9.0 (Broadcom TechDocs)
- Configure NVIDIA vGPU or GPU Passthrough on the ESX Hosts (Broadcom TechDocs)
- Preparing VCF for Private AI Workload Deployment (Broadcom TechDocs)
« Previous: Part 7 | VMware Private AI Complete Guide | Next: Part 9 »








