Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

How to Deploy VMware Private AI Foundation with NVIDIA on VCF 9 (VCF 9 Series, Part 26)

A field-tested runbook for deploying VMware Private AI Foundation with NVIDIA on VCF 9: the two deployment paths, the three licenses you need, GPU host prep, the right sharing mode, and the guided workflow, plus the gotchas that stall bring-up.

VCF 9 Series · Part 26 of 36

TL;DR · Key Takeaways

  • Decide the path first. A Deep Learning VM only deployment validates the GPU plumbing fast; Private AI Services needs a Supervisor and the NVIDIA GPU Operator on top.
  • You need three separate entitlements: a VCF subscription, the Private AI Foundation add-on, and an NVIDIA AI Enterprise (vGPU) license bought from NVIDIA, not from Broadcom.
  • SR-IOV in the host BIOS and the matching vGPU host driver VIB in the vSphere Lifecycle Manager image are the two prep steps that stall the most bring-ups.
  • MIG cannot back NVIDIA NIM microservices. Plan time-slicing or whole-GPU assignment for inference, and keep MIG for training or notebook isolation.
  • Assign the add-on license to the GPU workload domain, and also to the management domain if you want the guided deployment UI in the vSphere Client.
Who this is for: VCF administrators and architects standing up GPU-accelerated AI on an existing VCF 9 fleet.  Prerequisites: a healthy VCF 9.0 or 9.1 instance, at least three GPU-enabled ESX hosts, and access to NVIDIA Licensing and the NGC catalog.

Most stalled Private AI deployments I see do not fail in the clever places. They fail because SR-IOV was never enabled in the host BIOS, or because someone carved a GPU into MIG slices on a host that then quietly refused to serve a NIM. The platform itself is well behaved once the layer underneath it is correct. This runbook walks the deployment of VMware Private AI Foundation with NVIDIA on VCF 9 in the order that keeps you clear of those traps, with the specific checks I run at each stage.

Decide the deployment path before you touch a host

There are two shapes this deployment can take, and picking one up front saves a lot of rework. The first is the Deep Learning VM path: you provision a GPU-backed VM from the catalog, it boots with the NVIDIA driver and CUDA stack already in place, and you have somewhere to run a model in an afternoon. No Kubernetes, no Supervisor. The second is the Private AI Services path, which runs model serving, the Model Store, the API Gateway, and the Data Indexing and Retrieval Service on a vSphere Supervisor with the NVIDIA GPU Operator inside VKS clusters. That path gives you multi-tenancy, scalable endpoints, and RAG tooling, but it is a bigger build.

My rule: prove the GPU plumbing with a Deep Learning VM before you enable a Supervisor. If a DLVM cannot see the GPU, no amount of Kubernetes will fix that, and you will have added a layer of abstraction on top of a broken foundation. If you are going down the Services route, the Supervisor and VKS design is its own topic, covered in the vSphere Supervisor and VKS architecture reference design.

Private AI Foundation on VCF 9: deployment sequence Work top to bottom. Validate each stage before starting the next. 1 Licensing VCF subscription + PAIF add-on + NVIDIA AI Enterprise 2 Prepare GPU hosts SR-IOV in BIOS, vGPU host driver VIB in the vLCM image 3 Choose the sharing mode Time-slicing or whole GPU for NIM; MIG for training only 4 Run the guided workflow vSphere Client + VCF Automation Service Catalog 5 Validate and monitor nvidia-smi, GPU dashboards, test the model endpoint TWO PATHS AT STEP 4 A. Deep Learning VM No Supervisor. Fastest way to prove GPU plumbing and run a single model or notebook. B. Private AI Services Supervisor + GPU Operator on VKS. Model Runtime, API Gateway, RAG, multi-tenancy. Both consume the same prepared GPU hosts from steps 1 to 3. Air-gapped? VCF Automation pulls NGC content to local repos.
The five stages, and where the Deep Learning VM and Private AI Services paths diverge.

Get the three licenses right, and in the right place

This is the step people get wrong on paper before they get anything wrong in the lab. Private AI Foundation needs three distinct entitlements. You need a VCF subscription for the underlying platform. You need the VMware Private AI Foundation with NVIDIA add-on license from Broadcom, which is what activates the catalog items in VCF Automation, the pgvector PostgreSQL provisioning through Data Services Manager, the VMware-delivered Deep Learning VM image, the guided deployment workflow, and Private AI Services. And you need an NVIDIA AI Enterprise license for the vGPU host driver and guest drivers, which you buy from NVIDIA, not from Broadcom. Teams routinely budget the two VMware pieces and forget the NVIDIA one, then discover at install time that the host driver will not license.

Placement has a subtlety worth calling out. You assign the Private AI Foundation add-on license to the GPU-enabled workload domain where the AI workloads run. But if you also want the guided deployment UI to appear in the vSphere Client, you must assign the add-on license to the management domain as well. That single detail is why some admins do everything correctly and still cannot find the deployment wizard. Note too that you can run AI workloads and read GPU metrics in vCenter and VCF Operations under the base VCF license; the add-on is specifically for the automated provisioning, the DLVM image, pgvector, and Private AI Services.

Three licenses, three sourcesTeams budget the two VMware pieces and forget the NVIDIA oneVCF subscriptionThe platformFrom BroadcomAssign to vCenterPrivate AI Foundation add-onCatalog, DLVM, pgvector, ServicesFrom BroadcomGPU domain + mgmt domainNVIDIA AI EnterprisevGPU host + guest driversFrom NVIDIANot from BroadcomAssign the PAIF add-on to the management domain too, or the deployment wizard never appears.
Get all three entitlements and their placement right before you touch a host.

Prepare the GPU hosts

Disclaimer: Driver VIBs, SR-IOV, and vLCM remediation are disruptive host-level changes. Validate the vGPU driver against the VCF interoperability matrix and your exact GPU model, confirm the Deep Learning VM image supports your vGPU branch, back up SDDC Manager and vCenter, and remediate one host at a time in maintenance mode before touching production capacity.

You need at least three GPU-enabled ESX hosts in the initial cluster of the workload domain. On each host, two things have to be true before vGPU works. First, SR-IOV has to be enabled in the server BIOS and on the graphics devices, which you set on the GPU through the VMware Host Client. Skip this and the driver loads but vGPU profiles never appear; this is the single most common bring-up blocker I run into. Second, the NVIDIA vGPU host driver VIB, downloaded from the NVIDIA Licensing portal at nvid.nvidia.com, has to be installed on every host. The clean way to do that on VCF is to add the VIB to the vSphere Lifecycle Manager image for the cluster in SDDC Manager and remediate, so every host stays consistent and new hosts inherit the driver automatically.

# On each GPU host: confirm the NVIDIA vGPU host driver VIB is present
esxcli software vib list | grep -i nvd

# Confirm the GPU is visible and the host driver is loaded
nvidia-smi

# Confirm the GPU shows up on the PCI bus as expected
esxcli hardware pci list | grep -i -A3 nvidia

One version trap to flag before you download anything. The VMware-delivered Deep Learning VM image is supported only on hosts running the NVAIE 6.x vGPU branch or earlier. If you reflexively grab the newest vGPU release, you can end up with a host driver that the DLVM image will not pair with. Check the driver branch against the Deep Learning VM image release notes first, then pull the matching VIB. Pinning the vGPU branch deliberately is cheaper than rolling back a remediated cluster.


Choose the GPU sharing mode deliberately

VCF 9 supports two GPU sharing modes for vGPU, time-slicing and Multi-Instance GPU (MIG), plus passthrough for the GPU Operator path that backs Private AI Services. This choice is not cosmetic, and here is the part that bites people: you cannot use MIG with NVIDIA NIM microservices. If your plan was to slice an H100 into MIG instances and run a fleet of NIMs across them, that does not work. NIM expects a whole GPU or a time-sliced vGPU profile, not a MIG slice.

So match the mode to the workload. For inference with NIM, use time-slicing or assign a full GPU; for the Private AI Services route the GPU Operator typically consumes the device in passthrough. Reserve MIG for hard isolation of training jobs, multi-user notebook environments, or anywhere you want guaranteed memory and compute partitions and are not serving NIMs on those instances. Getting this wrong is one of the GPU and vGPU mistakes that break Private AI Foundation, so decide it before you remediate hosts, because flipping a host between MIG and time-slicing is not a live change.

Pick the GPU sharing mode for the workloadYou cannot run NVIDIA NIM on a MIG sliceTime-slice / whole GPUNIM inferenceMIGTraining, notebooks, isolation. NOT for NIMPassthroughGPU Operator (Private AI Services)Flipping a host between MIG and time-slicing is not a live change; decide before you remediate.
Match the mode to the workload: NIM wants a whole or time-sliced GPU, never MIG.

Run the guided deployment workflow

With licensing and hosts in place, the guided deployment workflow in the vSphere Client becomes available. The fastest validation is the Deep Learning VM. From the VCF Automation Service Catalog you request a DLVM, select the vGPU profile, and the platform provisions a VM that boots with the NVIDIA driver, CUDA libraries, and the common frameworks already configured from the NGC catalog. When you build a customized image, the only NVIDIA component you must supply is the vGPU guest driver; the rest comes from the image. This is the step that tells you, definitively, whether the GPU stack is healthy.

For retrieval-augmented generation, provision the vector database as a PostgreSQL instance with the pgvector extension through VMware Data Services Manager rather than hand-rolling a database VM. It is the supported path, it carries enterprise support, and it keeps the embeddings and chat history on managed storage. On the Private AI Services side, enabling a Supervisor unlocks the Model Store for secure model delivery, Model Runtime for scalable endpoints, an OpenAI-compatible API Gateway with authentication and load balancing in front of your models, the Agent Builder Service, and the Data Indexing and Retrieval Service that chunks and vectorizes private documents into knowledge bases. If you operate an air-gapped site, all of this is supported: VCF Automation pulls the NGC containers and libraries into on-premises repositories that only an administrator refreshes, so the Deep Learning VMs never reach to the internet.

Validate before you scale

Do not hand the environment to data scientists until you have confirmed three things end to end: the guest sees the GPU, the model answers, and the monitoring is wired up. Inside the Deep Learning VM, nvidia-smi should report the expected vGPU profile and memory. Then exercise an endpoint through the API Gateway to confirm the serving path, not just the driver.

# Inside the Deep Learning VM: confirm the guest driver and the assigned vGPU
nvidia-smi

# Confirm the serving path through the OpenAI-compatible API Gateway
curl -s http://<api-gateway-address>/v1/models | jq .

On the monitoring side, VCF 9 surfaces GPU health and utilization in vCenter and VCF Operations at host, cluster, and VM level, including vGPU memory reservation and usage, compute utilization, and the GPU slowdown and shutdown temperature thresholds that protect the hardware. Set those dashboards up now, while the system is quiet, so when a vGPU profile is over-provisioned or a GPU is running hot you see it before a data scientist files a ticket about slow inference. Capacity for AI is unforgiving: an idle GPU is wasted money, and an oversubscribed one is a queue of unhappy users.

What I’d Do

Stand up the Deep Learning VM path first, on a single GPU host with time-slicing, and get one model answering through the API Gateway before you enable a Supervisor or wire up Private AI Services. It is tempting to switch everything on at once, but every layer you add before the GPU plumbing is proven just hides where a problem actually lives. Validate the foundation, then grow into multi-tenancy, RAG, and scaled endpoints on a base you trust. Which path are you taking first, the quick DLVM proof or straight into Private AI Services?

References

VCF 9 Series · Part 26 of 36
« Previous: Part 25  |  VCF 9 Complete Guide  |  Next: Part 27 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading