Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Installing the NVIDIA GPU Operator and vGPU Drivers for VMware Private AI Foundation (Private AI Series, Part 9)

A practical runbook for installing the NVIDIA GPU Operator and matching vGPU host and guest drivers on VMware Private AI Foundation, with the validation checks and version-skew traps that decide whether GPUs actually schedule.

Updated for VCF 9.1. Reviewed against VMware Private AI Foundation on VCF 9.1 and Private AI Services 2.1. Version numbers and product names are current as of 2026.
VMware Private AI Series · Part 9 of 24

TL;DR · Key Takeaways

  • The GPU Operator is the piece that makes a GPU schedulable inside a VKS cluster. The vGPU host VIB and the guest driver are separate things, and both have to agree.
  • PAIF 9.1 ships GPU Operator 25.10.1 with a guest driver on the 580.x branch. The host VIB must be on a compatible branch or the driver pod never goes Ready.
  • You need an NVIDIA AI Enterprise entitlement for the host VIB, the guest driver, and NGC image pulls. No NVAIE, no install.
  • The VCF Automation Quickstart can install the operator for you. Do the manual install when you need control, custom profiles, or you are air-gapped.
  • Most bring-up failures are licensing or host-to-guest branch skew, not the operator code itself.
Who this is for: VCF architects and admins standing up Private AI Foundation on GPU hosts.  Prerequisites: a GPU-ready VI workload domain, the NVIDIA vGPU host VIB available, an NVAIE entitlement, an NGC API key, and a VKS cluster (or a Deep Learning VM) to target.

Your GPU hosts are licensed, the workload domain is built, and nvidia-smi works inside the VM. Then the GPU Operator lands in CrashLoopBackOff, the device plugin never advertises a single GPU, and your pods sit in Pending. This is the step where most Private AI deployments stall, and it is almost never the operator that is broken. It is the driver chain underneath it.

This part assumes you have already prepared a GPU workload domain and decided how you are partitioning the GPU, whether that is a full vGPU profile, MIG, or passthrough. Here we install the piece that turns those GPUs into something Kubernetes can actually schedule: the NVIDIA GPU Operator, plus the matching vGPU drivers on the host and the guest.

The GPU enablement path Every box has to pass before the next one means anything 1 Host vGPU VIB esxcli, host driver 2 NVAIE + NGC entitlement, secret 3 Install operator GPU Operator 25.10.1 4 Validate daemonset, nvidia-smi 5 Schedule NIM, RAG, DLVM
The bring-up sequence. A green operator means nothing if step 1 or step 2 is wrong.

What the GPU Operator actually does

The GPU Operator is a set of Kubernetes operators and DaemonSets that automate everything between a bare node and a schedulable GPU: the guest driver container, the NVIDIA container toolkit, the device plugin that advertises nvidia.com/gpu to the scheduler, and a validator that proves the stack works before workloads land. On a vGPU node it also handles license check-out through the guest driver. What it does not do is install the host-side VIB. That lives on ESXi and is your job.

Where each driver lives Two drivers, two layers. The operator owns the top one only. AI workload pod NIM microservice, RAG service, training job GPU Operator (managed) guest driver container · container toolkit · device plugin · vgpu validator VKS node VM guest vGPU driver 580.x · assigned vGPU profile ESXi host NVIDIA vGPU host VIB (NVAIE) · SR-IOV, ECC settings Physical GPU   L40S / H100 / H200 operator scope your scope
The guest driver and the host VIB are different installs on different layers. Branch mismatch between them is the classic failure.

Prerequisites you must validate first

Before you touch Kubernetes, prove the host layer. The host VIB has to be installed and loaded on every GPU host in the cluster, and the version has to be one the guest driver branch can talk to. Confirm it on each host:

# on each ESXi GPU host
esxcli software vib list | grep -i nvd
nvidia-smi
nvidia-smi -q | grep -i 'Driver Version'

# the host must report a vGPU host driver, not just a GPU present

You also need three entitlements lined up: an NVIDIA AI Enterprise license that covers the host VIB and the guest driver, an NGC API key so the operator can pull the driver and validator images, and a vGPU license source (a Delegated License Service appliance or the NVIDIA licensing portal token). If you have read the planning and prerequisites part, this should already be in place. One field note: HTTPS proxies with authentication or self-signed certificates are not supported for vGPU license registration. If you are behind that kind of proxy, stand up a local DLS appliance instead and save yourself a day.

Two install paths, and when to use each

There are two supported ways to get the operator onto a VKS cluster. The VCF Automation Private AI Quickstart generates a blueprint that downloads the NVAIE installer and runs it for you as part of provisioning. Or you run the same installer by hand against an existing cluster. They produce the same end state. They differ in how much control you keep.

DimensionVCF Automation QuickstartManual install
EffortLow, generated blueprintHigher, you run the script
ControlOpinionated, limited overridesFull, you pin versions and values
Driver versionChosen for youYou set GPU_OPERATOR_VERSION and branch
Air-gappedSupported via local stagingSupported, you stage NGC content
Best forStandard RAG and NIM catalog itemsCustom clusters, special profiles, debugging

My rule: use the Quickstart for the first deployment so you have a known-good reference, then do a manual install at least once so you actually understand the secrets and versions it set on your behalf. When something breaks at 11pm, the team that ran it by hand is the one that can fix it.

The manual install, step by step

  1. Confirm the host VIB and licensing are healthy on every GPU host in the target cluster.
  2. Create the licensing-config secret from your gridd.conf and client_configuration_token.tok so the guest driver can check out a license.
  3. Set your NGC API key so the operator can pull the driver, toolkit, and validator images.
  4. Download the NVAIE GPU Operator installer bundle from NGC.
  5. Point KUBECONFIG at the VKS cluster and run the installer, pinning the NVAIE and operator versions.
  6. Watch the operator namespace until the driver, toolkit, device plugin, and validator pods are all Ready.
# pin the versions explicitly, do not rely on latest
export NVAIE_VERSION='6'
export GPU_OPERATOR_VERSION='25.10.1'

# pull the NVAIE installer bundle from NGC
ngc --org=nvidia registry resource download-version nvidia/vgpu/gpu-operator-installer-6:6.3

# target the VKS cluster
export KUBECONFIG=/etc/vks-kubeconfig/k8s-cluster-kubeconfig-admin

# run the NVAIE operator installer
bash ./gpu-operator-installer-6_v6.3/gpu-operator-nvaie-6.3.sh install

Behind that one script call the Quickstart blueprint runs a predictable sequence: download the installer, set the kubeconfig, install the operator, count the GPUs the node should expose, wait for the DaemonSet, validate the license, then install the RAG and NIM workloads. If you are debugging a stuck deployment, walk that same order. The failure is almost always at check_gpu_count or validate_license, not at the install itself.


Validate before you celebrate

A finished install is not a working GPU. Run the checks in order, because each one rules out a different layer. If the validator pod is green but the device plugin advertises zero GPUs, your vGPU profile or host VIB is wrong, not your cluster.

Validate in this order Each check isolates a different layer of the stack GPU count node exposes nvidia.com/gpu DaemonSet Ready driver, plugin, validator License OK gridd checked out nvidia-smi in pod GPU visible to workload Any check fails? Stop. Fix that layer before moving right. Do not reinstall.
Validate left to right. Reinstalling a broken stack just hides which layer failed.
# all operator pods should be Running, validator Completed
kubectl get pods -n gpu-operator
kubectl get ds  -n gpu-operator

# the node should now advertise GPUs
kubectl get node -o json | grep -i 'nvidia.com/gpu'

# check the validator and the license
kubectl logs -n gpu-operator -l app=nvidia-vgpu-validator

# prove it end to end from inside a pod
kubectl exec -it <gpu-pod> -- nvidia-smi

The version-skew trap, read this twice

This is the failure that burns the most hours, so spend the four paragraphs here that it deserves. The guest driver running in the operator pod and the host VIB on ESXi are independent installs, but they are not independent in behavior. If the host driver is on a newer major vGPU branch than the guest driver, the driver pod transitions to a failed state and never recovers. The operator looks broken. It is not. The two driver branches simply do not match.

Host and guest branches must agree Matched branch Host VIB: vGPU 18.x (580.x) Guest driver: 580.x Driver pod Ready, GPUs advertised This is the only state you ship. Mismatched branch Host VIB newer than guest Guest driver on older branch Driver pod fails, no GPUs Last resort: DISABLE_VGPU_VERSION_CHECK
Treat the host VIB and guest driver as one matched pair you upgrade together.

There is an escape hatch. You can set DISABLE_VGPU_VERSION_CHECK to true in the driver config so the operator stops enforcing the branch match. Use it only as a bridge during a staged upgrade, never as a steady state. Bypassing the check lets a known-incompatible pair run, and the symptoms it produces later (random Xid errors, workloads that hang on the GPU) are far worse to diagnose than a driver pod that honestly refuses to start.

# bridge only, during a staged host-then-guest upgrade
driver:
  env:
    - name: DISABLE_VGPU_VERSION_CHECK
      value: 'true'

In practice, the cleanest operating model is to pin the host VIB and the guest driver as a single matched pair, document the pair, and upgrade both in the same change window. Most teams break this the first time they patch ESXi for an unrelated CVE and quietly move the host driver forward. The GPUs go offline the moment the nodes reschedule. If you hit issues during bring-up, the broader GPU and vGPU mistakes that break Private AI Foundation are worth a read alongside this.

Disclaimer: This installs kernel drivers and a privileged operator on production GPU nodes. Validate the host VIB against your VCF and vGPU bill of materials, confirm host-to-guest driver branch compatibility, back up cluster state, run the install on one node pool first, and keep the previous driver bundle on hand for rollback.

What I’d Do

For a first deployment, let the VCF Automation Quickstart install the operator and prove the path end to end, then rebuild it manually once so you understand every secret and version it set. Pin your GPU Operator and driver versions in writing, and treat the host VIB and guest driver as a single matched pair you upgrade together. Skip that discipline and the next routine host patch is what takes your GPUs offline. Which vGPU driver branch are you standardizing on across your fleet?

References

VMware Private AI Series · Part 9 of 30
« Previous: Part 8  |  VMware Private AI Complete Guide  |  Next: Part 10 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading