Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Prepare a GPU Workload Domain for VMware Private AI Foundation (Private AI Series, Part 8)

A field-tested, bottom-up procedure for standing up a GPU-accelerated workload domain on VCF 9.0 for Private AI Foundation: firmware, the vLCM vGPU driver, Shared Direct, a single-zone Supervisor, and the mistakes that actually bite.

VMware Private AI Series · Part 8 of 24

TL;DR · Key Takeaways

  • A GPU workload domain is not a normal VI workload domain with cards added later. The firmware, the vLCM image, and the host graphics mode all have to be right before SDDC Manager ever builds the cluster.
  • Put the NVIDIA vGPU host driver VIB inside the cluster vSphere Lifecycle Manager image. Hand-installing it per host is the single most common reason the fourth host you add six months later silently drifts.
  • Two firmware settings decide whether vGPU works at all: SR-IOV and large-BAR memory mapping. Miss them and the driver loads cleanly while zero vGPU profiles appear.
  • Private AI Services only runs on a single-zone Supervisor, and MIG cannot back NVIDIA NIMs. Both constraints shape the cluster you build here, not a later step.
  • PAIF 9.0 on VCF 9.0 needs three licenses stacked: VCF subscription, the Private AI Foundation add-on, and an NVIDIA vGPU entitlement.
Who this is for: VI administrators and VCF architects standing up the first GPU-accelerated workload domain for Private AI Foundation.  Prerequisites: a healthy VCF 9.0 management domain, GPU hosts on the compatibility guide, vGPU host driver VIB downloaded, and the licenses below in hand.

Most Private AI Foundation projects do not fail at the model. They fail three weeks earlier, on a workload domain that was built like every other VI domain and then had GPUs bolted on. The host driver was installed by hand on three nodes, SR-IOV was never touched in firmware, and the Supervisor was stretched across two zones for “high availability.” Everything looks green in SDDC Manager. Then the first deep learning VM cannot find a vGPU profile, Private AI Services refuses to install, and nobody can say why. This post is the procedure that prevents that, end to end, the way I run it on a real engagement.

What you are actually building

A GPU workload domain in PAIF 9.0 is a stack, and every layer below the AI workload has to be configured in order or the layer above it never comes up. The firmware exposes the GPU correctly to ESX, the host driver turns the physical card into schedulable vGPU, the host graphics mode makes those profiles available to VMs, and only then can vSphere Supervisor, deep learning VMs, VKS clusters, and Private AI Services consume them. Skip a layer and the symptom always shows up higher than the cause, which is why this work is so frustrating to debug after the fact.

The GPU Workload Domain Stack Configure bottom-up. The cause of a failure always sits below the symptom. 1. Server firmware SR-IOV, VT-d, Above-4G / large-BAR MMIO, ECC 2. ESX + vGPU host driver (in the vLCM image) NVD-AIE host manager VIB, version-matched to VCF 9.0 3. Host graphics: Shared Direct vGPU profiles for VMs, or DirectPath I/O for passthrough 4. vSphere Supervisor (single zone) VM classes, VKS, Private AI Services Supervisor Service 5. AI workloads Deep learning VMs · VKS GPU clusters · Private AI Services SCOPE OF PART 8 Layers 1 to 4 are what you prepare in this post. Layer 5 is the payoff, covered from Part 9 on.
The dependency stack: firmware and host driver are not optional extras, they are the foundation everything else sits on.

Before you touch a host: BOM and licensing

PAIF 9.0 runs on top of VCF 9.0, and the licensing is stacked, not single. You need a VCF subscription, the Private AI Foundation with NVIDIA add-on, and a licensed NVIDIA vGPU entitlement that includes the host driver VIB and the guest OS drivers. The add-on is what enables the guided deployment UI, the VCF Automation catalog items, the pgvector database in Data Services Manager, and Private AI Services. Assign the add-on to the GPU-enabled workload domain; assigning it to the management domain only activates the guided UI in the vSphere Client, it does not light up the workloads.

One detail people miss when they cost this out: you can run GPU workloads and read GPU metrics in vCenter and VCF Operations under the base VCF license alone. The add-on buys the automation and the managed services, not raw GPU access. That distinction matters when you are deciding whether a small proof of concept needs the full add-on on day one. Validate your exact host and GPU against the Broadcom Compatibility Guide for GPU and Accelerators before you buy anything, because an unsupported card is a dead end no license fixes.

LayerComponent (PAIF 9.0)What to verify
PlatformVCF 9.0Management domain healthy, SSO configured on the instance with the GPU hosts
GPU host driverNVIDIA vGPU host manager VIB (for example NVD-AIE_ESXi_9.x driver)Version compatible with VCF 9.0, downloaded from nvid.nvidia.com
Guest stackNVIDIA GPU Operator 25.10.x (guest driver v580.x)Operator version matched to the VKS Kubernetes release and the host driver
Hosts3 or more GPU-enabled ESX hostsIdentical GPU model and firmware across the initial cluster
LicenseVCF subscription + PAIF add-on + NVIDIA vGPUAdd-on assigned to the GPU workload domain

The procedure, end to end

Build order: firmware first, validation last 1 2 3 4 5 6 7 Firmwaresettings vLCM image+ driver VIB Deploy WLDvia SDDC Mgr SharedDirect Supervisorsingle zone License +VM classes Validatenvidia-smi
Seven steps. Do them in order; the validation at step 7 only passes if 1 through 4 were done right.

Step 1: firmware. On every host that will carry a GPU, set the firmware before the host ever joins a cluster. Enable SR-IOV, enable Intel VT-d or AMD IOMMU, and turn on the large memory mapping settings your vendor calls Above-4G Decoding and large or 64-bit BAR / MMIO. On many platforms you also want MMIO space sized to at least the total GPU framebuffer plus headroom. Leave ECC on for the data-center GPUs. These are the settings that decide whether the GPU presents correctly; the OS layer cannot compensate for a firmware that hides the device.

Step 2: bake the driver into the vLCM image. This is the step that separates a domain that ages well from one that rots. Add the NVIDIA vGPU host manager VIB to a vSphere Lifecycle Manager image in SDDC Manager, and base the workload domain on that image. Do not install the VIB host by host and forget it. The reason is operational, not cosmetic: when you add a fourth or fifth host in nine months, vLCM remediates it to the image automatically, driver included. A hand-installed estate drifts the moment someone expands the cluster, and a vGPU version skew between hosts produces failures that are miserable to trace.

Driver flows through the image, not into hosts by hand nvid.nvidia.com vGPU host VIB SDDC Manager vLCM cluster image (VIB added here) Remediate cluster to image Every host new ones too One source of truth means host four, added later, is never a special case.
Driver lifecycle through vLCM: this is the difference between a domain that scales cleanly and one that drifts.

Step 3: deploy the workload domain. In SDDC Manager, create the VI workload domain on the GPU hosts, selecting the vLCM image that contains the host driver VIB. The workload domain vCenter is deployed into the management domain SSO domain. Confirm that the workload domain vCenter and the management vCenter are joined in a vCenter group through vCenter Linking, because the guided Private AI deployment UI and Private AI Services both expect that linkage. Start with at least three GPU hosts in the initial cluster; this is a hard floor for the domain, not a sizing suggestion.

Step 4: set the host graphics mode. For vGPU, every GPU device on every host has to be set to Shared Direct. This is the step that actually exposes vGPU profiles to virtual machines, and it is the one most often skipped. Place the host in maintenance mode, open Configure, Hardware, Graphics, edit each device, set Device Type to Shared Direct, then reboot and exit maintenance mode. If you are running Private AI Services on passthrough instead, toggle DirectPath I/O on the PCI devices rather than Shared Direct. Do not mix the two intentions on one card without understanding the dual-purpose vGPU mode trade-off.

SettingWhereValue for vGPU
SR-IOVServer firmwareEnabled
VT-d / IOMMUServer firmwareEnabled
Above-4G / large BAR (MMIO)Server firmwareEnabled, sized to framebuffer + headroom
Graphics device typeESX host graphicsShared Direct
Passthrough (alternative)ESX PCI devicesDirectPath I/O, for Private AI Services only

Step 5: enable the Supervisor. Enable vSphere Supervisor on the GPU cluster. This is where a quiet constraint bites: Private AI Services supports only single-zone Supervisors. If you reflexively built a three-zone Supervisor for availability, Private AI Services will not install on it, and you will be rebuilding the Supervisor later. Decide up front whether this domain is for the full Private AI Services experience or for deep learning VMs and VKS clusters that tolerate multi-zone, and build the Supervisor accordingly.

Step 6: license and VM classes. Assign the Private AI Foundation add-on license to this workload domain. Then create the vGPU-based VM classes that map to the vGPU profiles you exposed in step 4. A VM class that asks for a C-type compute profile your hosts do not present will never schedule, so the profiles on the card and the profiles in the VM class have to agree. This is also where you choose your sharing model deliberately, which leads to the one constraint that trips up density-focused teams.

Step 7: validate before you hand it over. Do not declare the domain ready because SDDC Manager is green. Prove the GPU path end to end. The commands below are the quickest way to confirm the host driver is live and the card is healthy before anyone tries to provision a workload.

# On the ESX host: is the vGPU host driver loaded and the GPU healthy?
nvidia-smi

# Confirm the host driver VIB is present in the image
esxcli software vib list | grep -i nvd

# If you ever switch a host from MIG back to passthrough,
# turn MIG off per GPU first, then remove the host driver
nvidia-smi -i 0 -mig 0
esxcli software vib remove -n NVD-AIE_ESXi_9.1.0_Driver
reboot
Validation: follow the failure to its real cause nvidia-smi shows the GPU? Shared Direct set on every device? vGPU profile in VM class schedules? No → driver not in image, or SR-IOV / BAR off in firmware No → host left in default graphics mode, reboot pending No → profile on card and VM class disagree All three pass → domain is ready to hand over
Each failed check points straight at the layer below it, which is why bottom-up build order pays off here.

The mistakes that actually bite

Three things account for most of the wasted days I see on this step. First, firmware that was never touched: the host driver loads, nvidia-smi even works, but no vGPU profiles appear because SR-IOV or the large-BAR mapping is off. Always confirm firmware before you blame the driver. Second, version skew between the ESX host driver and the guest stack. GPU Operator 25.10.x ships a v580.x guest driver, and pairing that with an older host driver produces guest pods that crash-loop with opaque CUDA errors. Treat the host driver, the GPU Operator, and the VKS Kubernetes release as one interoperability set and pin them together.

Third, the MIG-versus-NIM trap. Teams chasing density enable Multi-Instance GPU on the cluster, then discover that NVIDIA NIM microservices, the thing they actually wanted to run, are not supported on MIG. If your roadmap includes NIM-based inference or Private AI Services, do not enable MIG on those hosts. Use time-slicing or full-profile vGPU instead and accept the lower density. This is a design decision you make here, on the workload domain, not something you can quietly flip later without disruption. For the deeper trade-off between vGPU, MIG, and passthrough, see GPU partitioning for VMware Private AI, and for how many hosts and what GPU to buy, see the reference architecture and sizing post.

Disclaimer: This procedure changes production infrastructure. Validate your exact host and GPU against the Broadcom Compatibility Guide, confirm host-driver to GPU-Operator to VKS interoperability before you commit, back up SDDC Manager and vCenter, run firmware changes in a maintenance window, and prove the GPU path on one host before remediating the whole cluster.

What I’d Do

If I am building this from scratch, I freeze three decisions before I rack a thing: the host driver goes in the vLCM image from day one, the Supervisor is single-zone if Private AI Services is on the roadmap, and MIG stays off unless I have proven I will never run NIMs on those hosts. Get those three right and the rest of the procedure is mechanical. Get them wrong and you are rebuilding the domain after the data scientists have already started, which is the worst possible time. Build the domain against your design from the planning and prerequisites work, and if you want the generic VI workload domain mechanics underneath this, the VI workload domain deployment walkthrough covers the base.

What is the one firmware or Supervisor setting that has burned you on a GPU domain? Tell me in the comments, because the list of things that hide below SDDC Manager green is longer than any doc admits.

References

VMware Private AI Series · Part 8 of 30
« Previous: Part 7  |  VMware Private AI Complete Guide  |  Next: Part 9 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading