TL;DR · Key Takeaways
- The GPU Operator is the piece that makes a GPU schedulable inside a VKS cluster. The vGPU host VIB and the guest driver are separate things, and both have to agree.
- PAIF 9.1 ships GPU Operator 25.10.1 with a guest driver on the 580.x branch. The host VIB must be on a compatible branch or the driver pod never goes Ready.
- You need an NVIDIA AI Enterprise entitlement for the host VIB, the guest driver, and NGC image pulls. No NVAIE, no install.
- The VCF Automation Quickstart can install the operator for you. Do the manual install when you need control, custom profiles, or you are air-gapped.
- Most bring-up failures are licensing or host-to-guest branch skew, not the operator code itself.
Your GPU hosts are licensed, the workload domain is built, and nvidia-smi works inside the VM. Then the GPU Operator lands in CrashLoopBackOff, the device plugin never advertises a single GPU, and your pods sit in Pending. This is the step where most Private AI deployments stall, and it is almost never the operator that is broken. It is the driver chain underneath it.
This part assumes you have already prepared a GPU workload domain and decided how you are partitioning the GPU, whether that is a full vGPU profile, MIG, or passthrough. Here we install the piece that turns those GPUs into something Kubernetes can actually schedule: the NVIDIA GPU Operator, plus the matching vGPU drivers on the host and the guest.
What the GPU Operator actually does
The GPU Operator is a set of Kubernetes operators and DaemonSets that automate everything between a bare node and a schedulable GPU: the guest driver container, the NVIDIA container toolkit, the device plugin that advertises nvidia.com/gpu to the scheduler, and a validator that proves the stack works before workloads land. On a vGPU node it also handles license check-out through the guest driver. What it does not do is install the host-side VIB. That lives on ESXi and is your job.
Prerequisites you must validate first
Before you touch Kubernetes, prove the host layer. The host VIB has to be installed and loaded on every GPU host in the cluster, and the version has to be one the guest driver branch can talk to. Confirm it on each host:
# on each ESXi GPU host
esxcli software vib list | grep -i nvd
nvidia-smi
nvidia-smi -q | grep -i 'Driver Version'
# the host must report a vGPU host driver, not just a GPU present
You also need three entitlements lined up: an NVIDIA AI Enterprise license that covers the host VIB and the guest driver, an NGC API key so the operator can pull the driver and validator images, and a vGPU license source (a Delegated License Service appliance or the NVIDIA licensing portal token). If you have read the planning and prerequisites part, this should already be in place. One field note: HTTPS proxies with authentication or self-signed certificates are not supported for vGPU license registration. If you are behind that kind of proxy, stand up a local DLS appliance instead and save yourself a day.
Two install paths, and when to use each
There are two supported ways to get the operator onto a VKS cluster. The VCF Automation Private AI Quickstart generates a blueprint that downloads the NVAIE installer and runs it for you as part of provisioning. Or you run the same installer by hand against an existing cluster. They produce the same end state. They differ in how much control you keep.
| Dimension | VCF Automation Quickstart | Manual install |
|---|---|---|
| Effort | Low, generated blueprint | Higher, you run the script |
| Control | Opinionated, limited overrides | Full, you pin versions and values |
| Driver version | Chosen for you | You set GPU_OPERATOR_VERSION and branch |
| Air-gapped | Supported via local staging | Supported, you stage NGC content |
| Best for | Standard RAG and NIM catalog items | Custom clusters, special profiles, debugging |
My rule: use the Quickstart for the first deployment so you have a known-good reference, then do a manual install at least once so you actually understand the secrets and versions it set on your behalf. When something breaks at 11pm, the team that ran it by hand is the one that can fix it.
The manual install, step by step
- Confirm the host VIB and licensing are healthy on every GPU host in the target cluster.
- Create the
licensing-configsecret from yourgridd.confandclient_configuration_token.tokso the guest driver can check out a license. - Set your NGC API key so the operator can pull the driver, toolkit, and validator images.
- Download the NVAIE GPU Operator installer bundle from NGC.
- Point
KUBECONFIGat the VKS cluster and run the installer, pinning the NVAIE and operator versions. - Watch the operator namespace until the driver, toolkit, device plugin, and validator pods are all Ready.
# pin the versions explicitly, do not rely on latest
export NVAIE_VERSION='6'
export GPU_OPERATOR_VERSION='25.10.1'
# pull the NVAIE installer bundle from NGC
ngc --org=nvidia registry resource download-version nvidia/vgpu/gpu-operator-installer-6:6.3
# target the VKS cluster
export KUBECONFIG=/etc/vks-kubeconfig/k8s-cluster-kubeconfig-admin
# run the NVAIE operator installer
bash ./gpu-operator-installer-6_v6.3/gpu-operator-nvaie-6.3.sh install
Behind that one script call the Quickstart blueprint runs a predictable sequence: download the installer, set the kubeconfig, install the operator, count the GPUs the node should expose, wait for the DaemonSet, validate the license, then install the RAG and NIM workloads. If you are debugging a stuck deployment, walk that same order. The failure is almost always at check_gpu_count or validate_license, not at the install itself.
Validate before you celebrate
A finished install is not a working GPU. Run the checks in order, because each one rules out a different layer. If the validator pod is green but the device plugin advertises zero GPUs, your vGPU profile or host VIB is wrong, not your cluster.
# all operator pods should be Running, validator Completed
kubectl get pods -n gpu-operator
kubectl get ds -n gpu-operator
# the node should now advertise GPUs
kubectl get node -o json | grep -i 'nvidia.com/gpu'
# check the validator and the license
kubectl logs -n gpu-operator -l app=nvidia-vgpu-validator
# prove it end to end from inside a pod
kubectl exec -it <gpu-pod> -- nvidia-smi
The version-skew trap, read this twice
This is the failure that burns the most hours, so spend the four paragraphs here that it deserves. The guest driver running in the operator pod and the host VIB on ESXi are independent installs, but they are not independent in behavior. If the host driver is on a newer major vGPU branch than the guest driver, the driver pod transitions to a failed state and never recovers. The operator looks broken. It is not. The two driver branches simply do not match.
There is an escape hatch. You can set DISABLE_VGPU_VERSION_CHECK to true in the driver config so the operator stops enforcing the branch match. Use it only as a bridge during a staged upgrade, never as a steady state. Bypassing the check lets a known-incompatible pair run, and the symptoms it produces later (random Xid errors, workloads that hang on the GPU) are far worse to diagnose than a driver pod that honestly refuses to start.
# bridge only, during a staged host-then-guest upgrade
driver:
env:
- name: DISABLE_VGPU_VERSION_CHECK
value: 'true'
In practice, the cleanest operating model is to pin the host VIB and the guest driver as a single matched pair, document the pair, and upgrade both in the same change window. Most teams break this the first time they patch ESXi for an unrelated CVE and quietly move the host driver forward. The GPUs go offline the moment the nodes reschedule. If you hit issues during bring-up, the broader GPU and vGPU mistakes that break Private AI Foundation are worth a read alongside this.
What I’d Do
For a first deployment, let the VCF Automation Quickstart install the operator and prove the path end to end, then rebuild it manually once so you understand every secret and version it set. Pin your GPU Operator and driver versions in writing, and treat the host VIB and guest driver as a single matched pair you upgrade together. Skip that discipline and the next routine host patch is what takes your GPUs offline. Which vGPU driver branch are you standardizing on across your fleet?
References
- VMware Private AI Foundation with NVIDIA 9.0.x Release Notes (Broadcom TechDocs)
- Using NVIDIA vGPU with the GPU Operator (NVIDIA Docs)
- Generative AI with VMware Private AI Foundation with NVIDIA on VCF 9 (VCF Blog)
« Previous: Part 8 | VMware Private AI Complete Guide | Next: Part 10 »








