Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

VMware Private AI Foundation Planning and Prerequisites: GPU Hosts, Drivers and Readiness (Private AI Series, Part 4)

A practitioner’s planning guide for VMware Private AI Foundation with NVIDIA on VCF 9: GPU host selection, the vGPU driver and GPU Operator interoperability matrix, sharing-mode choices, and the readiness checks that decide whether your first deployment lands clean.

VMware Private AI Series · Part 4 of 24

TL;DR · Key Takeaways

  • Plan the GPU-enabled workload domain first. You need at least 3 GPU-enabled ESX hosts in the initial cluster, VCF Single Sign-On on the instance that holds those hosts, and the Private AI Foundation add-on license assigned to that domain.
  • The thing that breaks first-time deployments is not hardware, it is the version triangle: VCF build, the NVIDIA vGPU host driver VIB, the NVAIE guest driver, and the GPU Operator (25.10.x) all have to agree. Validate that matrix before you order anything.
  • MIG cannot back NVIDIA NIM microservices. If your plan is model serving with NIM, do not architect around MIG for density.
  • Most teams over-buy CPU and RAM and under-plan the east-west network. Inference leaves general-purpose cores idle while GPU-to-GPU traffic is what actually constrains you.
  • VCF 9.1 adds exclusive GPU passthrough without an NVAIE license and keeps vMotion with low-latency GPU pathways. That changes the licensing math for some designs.
Who this is for: VCF architects and platform admins scoping a Private AI Foundation deployment.  Prerequisites: a working VCF 9 fleet, and the licensing picture from Part 3 of this series.

Here is the failure pattern I see most often. A team buys two beautiful 8-way H100 servers, racks them, and then spends three weeks fighting driver mismatches before a single model serves a token. The GPUs were never the hard part. Planning Private AI Foundation well is mostly about getting four moving versions to agree and sizing for the resource that actually runs out, which is almost never the one people budget for.

This is Part 4. Part 1 covered what the platform is, Part 2 walked the architecture layer by layer, and Part 3 untangled the licensing. This part is the design and readiness pass you run before you commit a bill of materials.

Start With the Workload Domain, Not the GPU

Private AI Foundation is not a product you install on top of a host. It is a set of capabilities you light up inside a GPU-enabled VCF workload domain. The management domain stays general purpose. The GPU hosts live in a separate VI workload domain so you can lifecycle them, license them, and fail them independently of the control plane.

Three hard requirements gate the whole design. You need a minimum of three GPU-enabled ESX hosts in the initial cluster, both for vSAN quorum and so a host can go down for driver remediation without stranding workloads. VCF Single Sign-On has to be configured on the VCF instance that contains those GPU hosts. And the Private AI Foundation add-on license has to land on the GPU workload domain, with a copy on the management domain only if you want the guided deployment UI in the vSphere Client.

Where Private AI Foundation Lives Management domain stays general purpose; GPUs sit in their own workload domain Management Domain vCenter, NSX, SDDC Manager VCF Operations, VCF Automation PAIF add-on (for guided UI only) No GPUs required here GPU Workload Domain 3+ GPU-enabled ESX hosts vGPU host driver VIB in vLCM image SR-IOV enabled, VCF SSO configured PAIF add-on + NVIDIA vGPU license Deep Learning VMs vGPU-backed VKS Clusters GPU Operator Private AI Services 2.x Model Store, runtime, pgvector via Data Services Manager
The platform is a set of capabilities inside a dedicated GPU workload domain, not an overlay on the management domain.

The Hardware Decision: Which GPU, How Many Hosts

Match the GPU to the workload, not to the spec sheet. The Broadcom Compatibility Guide is the source of truth for what is actually supported on ESXi 9.0, so validate any card there before you buy. In the field the choice usually collapses to three buckets, and the cost difference between them is large enough that guessing is expensive.

Workload profileSensible GPUWhyWatch out for
RAG / small-to-mid LLM inferenceL40S48 GB, strong inference per dollar, single-slot power budgetNo NVLink, so it scales out, not up
Large model serving, high concurrencyH100 / H200 SXM80 to 141 GB, NVLink and NVSwitch for tensor parallelismHGX systems, power and cooling, longer lead times
Fine-tuning / trainingH100 / H200, 4x or 8x per hostGPUDirect RDMA across hosts for multi-node jobsEast-west fabric becomes the bottleneck, not the GPU
Mixed / shared devL40S with time slicingMany light tenants per cardNoisy-neighbor effects under load

My take: for a first deployment that is mostly RAG and inference, two or three L40S hosts get you to production faster and cheaper than a single HGX box, and they give you real host-level redundancy that one fat server cannot. Reach for H100 or H200 SXM when a single model genuinely will not fit in 48 GB, or when tensor parallelism across NVLink is the only way to hit your latency target. Buying H200 for a 7B-parameter RAG assistant is the most common over-spend I flag in design workshops.


The Version Triangle That Actually Bites

This is where deployments stall. Four versions have to be interoperable at the same time, and none of them moves on the same release cadence. The VCF build dictates the vSphere and ESXi version. The NVIDIA vGPU host driver VIB has to match that ESXi version. The NVAIE guest driver in the deep learning VM or container has to match the host driver branch. And the GPU Operator that manages drivers inside VKS clusters has to match both the guest driver and the Kubernetes version of the cluster.

As a concrete reference point, a current-generation bring-up pairs GPU Operator 25.10.x with NVAIE 7 vGPU drivers in the 580.x branch (for example the 580.105.08 build) and an H100XM-80C class vGPU profile on H100 SXM5 hardware. Those numbers will move, which is exactly the point: pin them against the release notes for your specific VCF build before you commit, not after.

The Four Versions Must Agree Pin each one against the release notes for your VCF build before ordering VCF Build ESXi version vGPU Host Driver VIB matches ESXi, in vLCM image NVAIE Guest Driver 580.x branch, matches host GPU Operator 25.10.x matches K8s + guest driver same branch
Break any one edge and you get a driver that loads but will not allocate a vGPU, or a NIM that never schedules.

The mechanical part of readiness is getting the host driver into the cluster image. You download the vGPU host driver VIB from the NVIDIA licensing portal, add it to a vSphere Lifecycle Manager image in SDDC Manager, and remediate the hosts. Confirm it took before you go further.

# Confirm the vGPU host driver VIB is installed on the ESXi host
esxcli software vib list | grep -i nvd

# Confirm the GPU is seen and SR-IOV is active
nvidia-smi
esxcli graphics device list

# Check vGPU host manager and supported profiles
nvidia-smi vgpu -s

# From a VKS cluster, confirm the GPU Operator pods are healthy
kubectl get pods -n gpu-operator

Sharing Mode: Time Slicing, MIG or Passthrough

This decision shapes the rest of the design, so make it deliberately. There is one trap that catches people who optimize for density without reading the fine print: you cannot back NVIDIA NIM microservices with MIG. If your model-serving plan is NIM, MIG is off the table for those GPUs, full stop. That single limitation rules out the partitioning strategy a lot of teams reach for first.

Choosing a GPU Sharing Mode Will one workload use the whole GPU? Yes, dedicated No, share it Passthrough (DDA) Full card, max perf Serving with NIM? MIG is not allowed Yes No vGPU Time Slicing Shared, NIM-compatible MIG Hard isolation Default for a first deployment: passthrough or a full vGPU profile. Add MIG only when you need hard isolation and are not using NIM.
Decide sharing mode against the workload and the NIM constraint, not against a density target on a slide.

Passthrough gives a VM the entire physical GPU with no virtualization overhead, and in VCF 9.1 you can do exclusive passthrough without an NVAIE license while keeping vMotion through the new low-latency GPU data pathways built on BlueField-3 and ConnectX-7. Time slicing shares one card across several vGPU-backed VMs by interleaving in time, which is the right default for RAG, light inference, and dev. MIG carves a card into hardware-isolated slices with guaranteed memory and compute, which is excellent for strict multi-tenant isolation and useless to you if you are serving with NIM.

Host BOM and the Over-Provisioning Trap

Here is the counterintuitive part of sizing a Private AI host. The GPU is the constraint; the CPU and RAM around it usually are not. Field measurements on inference hosts routinely show a fraction of the cores busy and a fraction of system memory used, while the GPU sits pinned. I have seen an inference node use roughly two dozen of more than two hundred logical cores and a quarter of two terabytes of RAM. If you size CPU and RAM like a general-purpose cluster, you pay for silicon that idles.

Spend that budget on the network instead. For multi-GPU and multi-node work the east-west fabric is what actually limits throughput. GPUDirect RDMA, NVLink within a host, and NVSwitch across an HGX baseboard are what keep tensor-parallel jobs fed. A pair of 8-way H100 hosts with a slow fabric between them will underperform a well-connected design every time.

Where the Budget Should Go Anatomy of a single GPU host, sized for inference and RAG GPU + interconnect The real constraint. Spend here first. PRIORITY East-west network fabric GPUDirect RDMA, NVLink, NVSwitch PRIORITY CPU cores Often mostly idle on inference do not over-buy System RAM Right-size to the model, not the chassis max do not over-buy Reserve the idle headroom Spare cores and RAM are not waste if you consolidate other VMs onto the host. Use the Private AI Sizing Guide to set the management and workload domain figures. Validate power and cooling for dense GPU hosts before the rack arrives.
Size the GPU and the fabric first; treat CPU and RAM as headroom you can reclaim, not as the main spend.
Disclaimer: Before you order hardware or remediate hosts, validate the target bill of materials against the Broadcom Compatibility Guide, confirm the vGPU host driver, NVAIE guest driver, and GPU Operator versions are mutually supported for your exact VCF build, enable SR-IOV in BIOS per your vendor guidance, back up SDDC Manager and vLCM image state, run prechecks, and prove the stack on one cluster before you scale it out.

What I’d Do

For a first Private AI Foundation deployment I would stand up a three-host L40S workload domain, run passthrough or full vGPU profiles rather than MIG, and pin the entire version triangle against the release notes for my exact VCF build before a single purchase order goes out. I would size CPU and RAM modestly, spend the saved budget on the east-west fabric, and only move to H100 or H200 SXM once a real model proves it needs NVLink. The teams that struggle are almost always the ones that bought the biggest GPU server first and worked out the software interoperability afterward. Reverse that order and the deployment goes quiet.

Next in the series we get hands-on: deploying Private AI Foundation on VCF 9 end to end. What is the first thing you check on a new GPU host, the driver or the fabric? Tell me how you sequence it.

References


VMware Private AI Series · Part 4 of 30
« Previous: Part 3  |  VMware Private AI Complete Guide  |  Next: Part 5 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading