Installing the NVIDIA GPU Operator and vGPU Drivers for VMware Private AI Foundation (Private AI Series, Part 9)

A practical runbook for installing the NVIDIA GPU Operator and matching vGPU host and guest drivers on VMware Private AI Foundation, with the validation checks and version-skew traps that decide whether GPUs actually schedule.

by

Dr. Pranay Jha

June 15, 2026

No comments

9 minutes

Read Time

Updated for VCF 9.1. Reviewed against VMware Private AI Foundation on VCF 9.1 and Private AI Services 2.1. Version numbers and product names are current as of 2026.

VMware Private AI Series · Part 9 of 24

TL;DR · Key Takeaways

The GPU Operator is the piece that makes a GPU schedulable inside a VKS cluster. The vGPU host VIB and the guest driver are separate things, and both have to agree.
PAIF 9.1 ships GPU Operator 25.10.1 with a guest driver on the 580.x branch. The host VIB must be on a compatible branch or the driver pod never goes Ready.
You need an NVIDIA AI Enterprise entitlement for the host VIB, the guest driver, and NGC image pulls. No NVAIE, no install.
The VCF Automation Quickstart can install the operator for you. Do the manual install when you need control, custom profiles, or you are air-gapped.
Most bring-up failures are licensing or host-to-guest branch skew, not the operator code itself.

Who this is for: VCF architects and admins standing up Private AI Foundation on GPU hosts. Prerequisites: a GPU-ready VI workload domain, the NVIDIA vGPU host VIB available, an NVAIE entitlement, an NGC API key, and a VKS cluster (or a Deep Learning VM) to target.

Your GPU hosts are licensed, the workload domain is built, and nvidia-smi works inside the VM. Then the GPU Operator lands in CrashLoopBackOff, the device plugin never advertises a single GPU, and your pods sit in Pending. This is the step where most Private AI deployments stall, and it is almost never the operator that is broken. It is the driver chain underneath it.

This part assumes you have already prepared a GPU workload domain and decided how you are partitioning the GPU, whether that is a full vGPU profile, MIG, or passthrough. Here we install the piece that turns those GPUs into something Kubernetes can actually schedule: the NVIDIA GPU Operator, plus the matching vGPU drivers on the host and the guest.

The bring-up sequence. A green operator means nothing if step 1 or step 2 is wrong.

What the GPU Operator actually does

The GPU Operator is a set of Kubernetes operators and DaemonSets that automate everything between a bare node and a schedulable GPU: the guest driver container, the NVIDIA container toolkit, the device plugin that advertises nvidia.com/gpu to the scheduler, and a validator that proves the stack works before workloads land. On a vGPU node it also handles license check-out through the guest driver. What it does not do is install the host-side VIB. That lives on ESXi and is your job.

The guest driver and the host VIB are different installs on different layers. Branch mismatch between them is the classic failure.

Prerequisites you must validate first

Before you touch Kubernetes, prove the host layer. The host VIB has to be installed and loaded on every GPU host in the cluster, and the version has to be one the guest driver branch can talk to. Confirm it on each host:

# on each ESXi GPU host
esxcli software vib list | grep -i nvd
nvidia-smi
nvidia-smi -q | grep -i 'Driver Version'

# the host must report a vGPU host driver, not just a GPU present

You also need three entitlements lined up: an NVIDIA AI Enterprise license that covers the host VIB and the guest driver, an NGC API key so the operator can pull the driver and validator images, and a vGPU license source (a Delegated License Service appliance or the NVIDIA licensing portal token). If you have read the planning and prerequisites part, this should already be in place. One field note: HTTPS proxies with authentication or self-signed certificates are not supported for vGPU license registration. If you are behind that kind of proxy, stand up a local DLS appliance instead and save yourself a day.

Two install paths, and when to use each

There are two supported ways to get the operator onto a VKS cluster. The VCF Automation Private AI Quickstart generates a blueprint that downloads the NVAIE installer and runs it for you as part of provisioning. Or you run the same installer by hand against an existing cluster. They produce the same end state. They differ in how much control you keep.

Dimension	VCF Automation Quickstart	Manual install
Effort	Low, generated blueprint	Higher, you run the script
Control	Opinionated, limited overrides	Full, you pin versions and values
Driver version	Chosen for you	You set GPU_OPERATOR_VERSION and branch
Air-gapped	Supported via local staging	Supported, you stage NGC content
Best for	Standard RAG and NIM catalog items	Custom clusters, special profiles, debugging

My rule: use the Quickstart for the first deployment so you have a known-good reference, then do a manual install at least once so you actually understand the secrets and versions it set on your behalf. When something breaks at 11pm, the team that ran it by hand is the one that can fix it.

The manual install, step by step

Confirm the host VIB and licensing are healthy on every GPU host in the target cluster.
Create the licensing-config secret from your gridd.conf and client_configuration_token.tok so the guest driver can check out a license.
Set your NGC API key so the operator can pull the driver, toolkit, and validator images.
Download the NVAIE GPU Operator installer bundle from NGC.
Point KUBECONFIG at the VKS cluster and run the installer, pinning the NVAIE and operator versions.
Watch the operator namespace until the driver, toolkit, device plugin, and validator pods are all Ready.

# pin the versions explicitly, do not rely on latest
export NVAIE_VERSION='6'
export GPU_OPERATOR_VERSION='25.10.1'

# pull the NVAIE installer bundle from NGC
ngc --org=nvidia registry resource download-version nvidia/vgpu/gpu-operator-installer-6:6.3

# target the VKS cluster
export KUBECONFIG=/etc/vks-kubeconfig/k8s-cluster-kubeconfig-admin

# run the NVAIE operator installer
bash ./gpu-operator-installer-6_v6.3/gpu-operator-nvaie-6.3.sh install

Behind that one script call the Quickstart blueprint runs a predictable sequence: download the installer, set the kubeconfig, install the operator, count the GPUs the node should expose, wait for the DaemonSet, validate the license, then install the RAG and NIM workloads. If you are debugging a stuck deployment, walk that same order. The failure is almost always at check_gpu_count or validate_license, not at the install itself.

Validate before you celebrate

A finished install is not a working GPU. Run the checks in order, because each one rules out a different layer. If the validator pod is green but the device plugin advertises zero GPUs, your vGPU profile or host VIB is wrong, not your cluster.

Validate left to right. Reinstalling a broken stack just hides which layer failed.

# all operator pods should be Running, validator Completed
kubectl get pods -n gpu-operator
kubectl get ds  -n gpu-operator

# the node should now advertise GPUs
kubectl get node -o json | grep -i 'nvidia.com/gpu'

# check the validator and the license
kubectl logs -n gpu-operator -l app=nvidia-vgpu-validator

# prove it end to end from inside a pod
kubectl exec -it <gpu-pod> -- nvidia-smi

The version-skew trap, read this twice

This is the failure that burns the most hours, so spend the four paragraphs here that it deserves. The guest driver running in the operator pod and the host VIB on ESXi are independent installs, but they are not independent in behavior. If the host driver is on a newer major vGPU branch than the guest driver, the driver pod transitions to a failed state and never recovers. The operator looks broken. It is not. The two driver branches simply do not match.

Treat the host VIB and guest driver as one matched pair you upgrade together.

There is an escape hatch. You can set DISABLE_VGPU_VERSION_CHECK to true in the driver config so the operator stops enforcing the branch match. Use it only as a bridge during a staged upgrade, never as a steady state. Bypassing the check lets a known-incompatible pair run, and the symptoms it produces later (random Xid errors, workloads that hang on the GPU) are far worse to diagnose than a driver pod that honestly refuses to start.

# bridge only, during a staged host-then-guest upgrade
driver:
  env:
    - name: DISABLE_VGPU_VERSION_CHECK
      value: 'true'

In practice, the cleanest operating model is to pin the host VIB and the guest driver as a single matched pair, document the pair, and upgrade both in the same change window. Most teams break this the first time they patch ESXi for an unrelated CVE and quietly move the host driver forward. The GPUs go offline the moment the nodes reschedule. If you hit issues during bring-up, the broader GPU and vGPU mistakes that break Private AI Foundation are worth a read alongside this.

Disclaimer: This installs kernel drivers and a privileged operator on production GPU nodes. Validate the host VIB against your VCF and vGPU bill of materials, confirm host-to-guest driver branch compatibility, back up cluster state, run the install on one node pool first, and keep the previous driver bundle on hand for rollback.

What I’d Do

For a first deployment, let the VCF Automation Quickstart install the operator and prove the path end to end, then rebuild it manually once so you understand every secret and version it set. Pin your GPU Operator and driver versions in writing, and treat the host VIB and guest driver as a single matched pair you upgrade together. Skip that discipline and the next routine host patch is what takes your GPUs offline. Which vGPU driver branch are you standardizing on across your fleet?

References

VMware Private AI Series · Part 9 of 30
« Previous: Part 8 | VMware Private AI Complete Guide | Next: Part 10 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts