Troubleshooting VMware Private AI Foundation: 7 Failures That Actually Bite (Private AI Series, Part 23)

The seven failures I hit most often on VMware Private AI Foundation with NVIDIA, from a dark GPU on the ESXi host to a NIM pod crashing on CUDA out of memory, with the real error strings and the checks that isolate each layer.

by

Dr. Pranay Jha

June 16, 2026

No comments

13 minutes

Read Time

VMware Private AI Series · Part 23 of 24

TL;DR · Key Takeaways

A Private AI stack fails in layers. The symptom you see is usually one or two layers above the actual cause, so work bottom-up: host, then profile, then guest license, then Kubernetes, then the model.
If nvidia-smi on the ESXi host shows nothing, stop touching the guest. The GPU never came up on the host.
The single most common day-one failure is host and guest vGPU drivers from different branches. Match the branch before anything else.
NIM pods that crash on CUDA out of memory are almost never a node-size problem. They are a model-profile problem.
In VCF 9, DirectPath Profile Pool exhaustion now looks like a quiet scheduling failure, not a power-on error. Check pool capacity before you blame the host.

Who this is for: admins and architects running VMware Private AI Foundation with NVIDIA on VCF 9.0 or 9.1. Prerequisites: a deployed GPU workload domain, shell access to an ESXi host and to the Supervisor or TKG cluster, and the NGC credentials used at deployment.

A Private AI stack fails in layers, and the error you read is almost never where the problem lives. nvidia-smi reports no devices, so people reinstall the guest driver when the GPU never came up on the host. A NIM pod crashes on CUDA out of memory, so people ask for a bigger node when the real fix is a smaller model profile. After enough bring-ups you stop reading the symptom and start asking which layer is lying. These are the seven failures I hit most often, from the ESXi host up to the inference pod, with the actual error strings and the checks that isolate each one.

The Private AI stack and the layer each of the seven failures actually sits in.

Debug in this order, not the order the errors arrive

The decision flow below is the order I actually follow. Each gate is a one-line check that either clears a whole layer or stops you there. Run them top to bottom and you will isolate almost any Private AI inference failure in under five commands.

Five gates, bottom of the stack to the top. Each one clears or convicts a layer.

F1 · The GPU is dark on the host

Symptom: in a Deep Learning VM or a TKG worker, nvidia-smi returns No devices were found or the blunt NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. The instinct is to reinstall the guest driver. Resist it. Check the host first, because a guest can only see a vGPU that the host successfully created.

# On the ESXi host
[root@esx01:~] esxcli software vib list | grep -i nvd
[root@esx01:~] nvidia-smi -q | head

# Is the GPU even enumerated on the PCI bus?
[root@esx01:~] lspci | grep -i nvidia

# Kernel log when the adapter never inits:
NVRM: GPU 0000:31:00.0: RmInitAdapter failed! (0x62:0x40:1860)
NVRM: GPU 0000:31:00.0: rm_init_adapter failed, device minor number 0

If lspci does not list the card, this is hardware: a reseated GPU, a dead riser, or a slot that is not bifurcated the way the BOM assumed. No driver fixes that. If the card is listed but the VIB is missing, the host vGPU manager was never installed or fell off during a vSphere Lifecycle Manager remediation. The fix is to put the NVIDIA host driver VIB back into the cluster image and remediate, not to install it by hand on one host where it will drift out again. If you see RmInitAdapter failed with the card present and the VIB loaded, the usual cause is an ECC state change or a memory training failure after a firmware update. In practice, the most common RmInitAdapter trigger I see is a GPU that was flashed or moved between hosts without a clean power cycle. A full host power drain, not a soft reboot, clears it more often than any driver reinstall.

F2 · The guest driver runs but the GPU is throttled or unlicensed

Symptom: the GPU appears in the guest, workloads start, then performance collapses after about twenty minutes, or the driver reports an unlicensed state outright. vGPU enforces licensing by degrading clocks once the grace period ends, so this looks like a performance bug long before it looks like a licensing one.

$ nvidia-smi -q | grep -A3 -i license
    vGPU Software Licensed Product
        License Status : Unlicensed (Restricted)

# Token lands here on a Linux guest:
$ ls /etc/nvidia/ClientConfigToken/
$ systemctl restart nvidia-gridd && nvidia-smi -q | grep -i 'License Status'

Three things cause this in a Private AI deployment. The client configuration token is missing, expired, or not valid JWT, so validate the token format before anything else. The guest cannot reach the license server, which on Private AI usually means a proxy in the path: an HTTP proxy that requires authentication is not supported for vGPU license registration, and an HTTPS proxy with a self-signed certificate is not supported for downloading the guest driver. If your environment forces auth proxies, you need a direct route or an exception for the license endpoint. Branch mismatch is the third cause and the one people miss: the guest driver, the 580.x series in the current release, must come from the same vGPU branch as the host VIB. A guest driver one major branch ahead of the host will load, register, and then misbehave in ways that read like flaky hardware.

Host VIB and guest driver are a matched pair. Treat the GPU Operator driver image the same way.

F3 · The VM will not power on, or the pool is silently full

On VCF 9 this failure changed shape. ESXi 9 no longer makes you pin every device to a profile before power-on, and reconfiguration now happens transparently at VM lifecycle events. That removed the old profile not available power-on error for most cases. What replaced it is quieter. VCF 9 introduced DirectPath Profile Pools, a cluster-wide view of consumed and remaining vGPU capacity. When a pool is exhausted, a new Deep Learning VM or a scaled-up worker does not fail loudly. It waits, or it lands without the GPU it expected, and you find out three layers up when a pod cannot schedule. Before you blame a host, check the pool. A profile mismatch still bites when you mix time-sliced and MIG-backed profiles on the same physical GPU, which is not allowed: a GPU is either in time-slicing mode or MIG mode, not both, and a request for the wrong kind will never place.

F4 · The GPU Operator is green but the GPU is not allocatable

Symptom: every pod in gpu-operator looks fine at a glance, yet nvidia.com/gpu shows zero on every node. The operator orchestrates the driver, device plugin, and validators, so a failure in any one of them leaves the node with no advertised GPU even when the others are healthy.

$ kubectl get pods -n gpu-operator
NAME                                  READY   STATUS             RESTARTS
nvidia-driver-daemonset-xxxxx         0/1     CrashLoopBackOff   6
nvidia-device-plugin-daemonset-yyyy   0/1     Init:0/1           0

# What the node actually advertises:
$ kubectl get node <worker> -o jsonpath='{.status.allocatable.nvidia.com/gpu}'
0

$ kubectl logs -n gpu-operator nvidia-driver-daemonset-xxxxx | tail -20

With GPU Operator 25.10.x the failure I see most is the driver container trying to build or load a driver whose branch does not match the host VIB, the same handshake from F2 but inside a container. On a Private AI deployment you generally want the operator running in vGPU mode against the pre-installed guest driver rather than letting it manage a driver that fights the host. If the driver pod is healthy but the device plugin is stuck in Init, the validator cannot see a working GPU, which loops you back to F1 and F2. Walk the operator pods in dependency order, driver first, then toolkit, then device plugin, then validator, and fix the first one that is red rather than restarting all of them. For the full install path, see installing the NVIDIA GPU Operator and vGPU drivers.

F5 · The NIMCache never reaches Ready

The NIM Operator pattern is a hard dependency: the NIMCache controller spins up a Kubernetes Job named <nimcache-name>-job to pull model artifacts from NGC, and the NIMService will not start until that cache is Ready. When inference never comes up, check the cache before the service.

$ kubectl get nimcache -A
NAME            STATUS       AGE
meta-llama3-8b  NotReady     14m

$ kubectl get jobs -n nim | grep meta-llama3-8b
$ kubectl logs -n nim job/meta-llama3-8b-job --tail=30
# 401 Unauthorized  -> NGC API key / image pull secret
# i/o timeout to nvcr.io -> egress or proxy
# No space left on device -> PVC too small for the model

Three causes cover almost every stuck cache. Authentication: the NGC API key or the image pull secret is wrong or expired, which shows as a 401 in the job log. Network: the job cannot reach nvcr.io, which in a Private AI environment usually means egress rules or the same auth-proxy problem from F2. Disk: the cache PVC is too small for the model, and large checkpoints fill it mid-download. Size the cache volume for the biggest model you intend to serve plus headroom, because teams routinely under-provision it for the small model they tested with and then fail on the real one. In an air-gapped deployment the cache pulls from your local mirror instead, so a stuck cache there points at the mirror, covered in the air-gapped deployment guide.

F6 · The NIM pod crashes on CUDA out of memory

Symptom: the NIMService pod starts, loads for a minute, then drops into CrashLoopBackOff with an out-of-memory error in the log. The reflex is to ask for a bigger node. That is usually the wrong move. The pod already has the GPU it asked for; the model profile it selected does not fit that GPU.

$ kubectl logs -n nim meta-llama3-8b-nimservice-xxxx | tail
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate ...
# or:
ERROR  No supported model profile for detected hardware

# See what the model actually offers for this GPU:
$ kubectl exec -n nim <pod> -- list-model-profiles

NIM picks a profile based on the GPU it detects, and on a partitioned GPU, a MIG slice or a fractional vGPU, the visible memory is a fraction of the card. The fix is to choose a profile that fits: a higher tensor-parallel or pipeline-parallel profile that spreads the model across more partitions, or a lower-precision profile such as FP8 or NVFP4 that halves the footprint. No supported model profile is the related cousin: the model has no profile that matches the slice you gave it at all, which means the partition is too small for this model and you need a larger vGPU profile or a full GPU. My take: decide the profile from the GPU memory math up front rather than letting NIM auto-select and discovering the ceiling in a crash loop. The sizing arithmetic is in the sizing and cost part, and the serving-layer design is in deploying NIM microservices.

F7 · The pod sits in Pending and no GPU is offered

Symptom: the NIM or workload pod never schedules. kubectl describe pod shows the familiar line.

$ kubectl describe pod <pod> | grep -A2 Events
  Warning  FailedScheduling  0/3 nodes are available:
  3 Insufficient nvidia.com/gpu.

If F4 already proved the node advertises GPUs, then either they are all in use or the request cannot be satisfied as written. With time-slicing, a node advertises more replicas than physical GPUs, so a request for a whole GPU on a sliced node never matches; with MIG, the requested profile name must exist on a node. Mismatched nodeSelector or missing tolerations for the GPU taint will also hold a pod in Pending forever while the GPUs sit idle. This is exactly where good monitoring pays for itself, because allocatable-versus-requested is a dashboard you should already have from GPU monitoring with VCF Operations.

The symptom, cause and fix at a glance

Symptom	Likely cause	First fix
No devices were found (F1)	Host VIB missing or GPU not enumerated	Check lspci and the VIB on the host; remediate the cluster image; full power drain
Clocks throttle / Unlicensed (F2)	Bad token, auth proxy, or host/guest branch mismatch	Validate JWT token, bypass auth proxy, match driver branches
VM waits or lands without GPU (F3)	DirectPath Profile Pool exhausted or mode mismatch	Check pool capacity; do not mix time-slicing and MIG on one GPU
nvidia.com/gpu = 0 (F4)	GPU Operator driver or device plugin failing	Fix the first red operator pod in dependency order
NIMCache NotReady (F5)	401, egress block, or undersized PVC	Read the cache job log; fix key, route or volume size
CUDA out of memory (F6)	Model profile too big for the partition	Pick a higher TP/PP or FP8/NVFP4 profile, not a bigger node
Pod Pending, Insufficient gpu (F7)	Slice/profile request unmatched or taints	Align request to time-slicing/MIG; fix selectors and tolerations

Disclaimer: these are diagnostic steps, not blind remediations. Before any host driver change, profile change, or power drain, validate the target BOM, confirm host and guest vGPU branch interoperability, back up VM and cluster state, run prechecks, and test on a non-production host first. A full power drain evacuates running workloads, so drain the host through vSphere before you pull power.

What I'd Do

Stop debugging Private AI top-down. The errors surface at the model layer because that is where you are looking, but the cause is usually a host VIB, a driver branch, a license token, or a pool that quietly ran dry. Run the five gates from the bottom of the stack, fix the lowest red layer first, and most of these clear in minutes instead of an afternoon. Two of the seven, the branch handshake and the over-sized model profile, account for more wasted hours than the other five combined, so check those reflexively. Which of these has burned you the most: the dark GPU on the host, or the NIM pod that swears it is out of memory?

References

VMware Private AI Series · Part 23 of 30
« Previous: Part 22 | VMware Private AI Complete Guide | Next: Part 24 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts

Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Tags: GPU Operator, nim, PAIF, Private AI Series, Troubleshooting, vGPU, VMware Private AI

June 17, 2026

Dr. Pranay Jha