TL;DR · Key Takeaways
- A Private AI stack fails in layers. The symptom you see is usually one or two layers above the actual cause, so work bottom-up: host, then profile, then guest license, then Kubernetes, then the model.
- If
nvidia-smion the ESXi host shows nothing, stop touching the guest. The GPU never came up on the host. - The single most common day-one failure is host and guest vGPU drivers from different branches. Match the branch before anything else.
- NIM pods that crash on CUDA out of memory are almost never a node-size problem. They are a model-profile problem.
- In VCF 9, DirectPath Profile Pool exhaustion now looks like a quiet scheduling failure, not a power-on error. Check pool capacity before you blame the host.
A Private AI stack fails in layers, and the error you read is almost never where the problem lives. nvidia-smi reports no devices, so people reinstall the guest driver when the GPU never came up on the host. A NIM pod crashes on CUDA out of memory, so people ask for a bigger node when the real fix is a smaller model profile. After enough bring-ups you stop reading the symptom and start asking which layer is lying. These are the seven failures I hit most often, from the ESXi host up to the inference pod, with the actual error strings and the checks that isolate each one.
Debug in this order, not the order the errors arrive
The decision flow below is the order I actually follow. Each gate is a one-line check that either clears a whole layer or stops you there. Run them top to bottom and you will isolate almost any Private AI inference failure in under five commands.
F1 · The GPU is dark on the host
Symptom: in a Deep Learning VM or a TKG worker, nvidia-smi returns No devices were found or the blunt NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. The instinct is to reinstall the guest driver. Resist it. Check the host first, because a guest can only see a vGPU that the host successfully created.
# On the ESXi host
[root@esx01:~] esxcli software vib list | grep -i nvd
[root@esx01:~] nvidia-smi -q | head
# Is the GPU even enumerated on the PCI bus?
[root@esx01:~] lspci | grep -i nvidia
# Kernel log when the adapter never inits:
NVRM: GPU 0000:31:00.0: RmInitAdapter failed! (0x62:0x40:1860)
NVRM: GPU 0000:31:00.0: rm_init_adapter failed, device minor number 0
If lspci does not list the card, this is hardware: a reseated GPU, a dead riser, or a slot that is not bifurcated the way the BOM assumed. No driver fixes that. If the card is listed but the VIB is missing, the host vGPU manager was never installed or fell off during a vSphere Lifecycle Manager remediation. The fix is to put the NVIDIA host driver VIB back into the cluster image and remediate, not to install it by hand on one host where it will drift out again. If you see RmInitAdapter failed with the card present and the VIB loaded, the usual cause is an ECC state change or a memory training failure after a firmware update. In practice, the most common RmInitAdapter trigger I see is a GPU that was flashed or moved between hosts without a clean power cycle. A full host power drain, not a soft reboot, clears it more often than any driver reinstall.
F2 · The guest driver runs but the GPU is throttled or unlicensed
Symptom: the GPU appears in the guest, workloads start, then performance collapses after about twenty minutes, or the driver reports an unlicensed state outright. vGPU enforces licensing by degrading clocks once the grace period ends, so this looks like a performance bug long before it looks like a licensing one.
$ nvidia-smi -q | grep -A3 -i license
vGPU Software Licensed Product
License Status : Unlicensed (Restricted)
# Token lands here on a Linux guest:
$ ls /etc/nvidia/ClientConfigToken/
$ systemctl restart nvidia-gridd && nvidia-smi -q | grep -i 'License Status'
Three things cause this in a Private AI deployment. The client configuration token is missing, expired, or not valid JWT, so validate the token format before anything else. The guest cannot reach the license server, which on Private AI usually means a proxy in the path: an HTTP proxy that requires authentication is not supported for vGPU license registration, and an HTTPS proxy with a self-signed certificate is not supported for downloading the guest driver. If your environment forces auth proxies, you need a direct route or an exception for the license endpoint. Branch mismatch is the third cause and the one people miss: the guest driver, the 580.x series in the current release, must come from the same vGPU branch as the host VIB. A guest driver one major branch ahead of the host will load, register, and then misbehave in ways that read like flaky hardware.
F3 · The VM will not power on, or the pool is silently full
On VCF 9 this failure changed shape. ESXi 9 no longer makes you pin every device to a profile before power-on, and reconfiguration now happens transparently at VM lifecycle events. That removed the old profile not available power-on error for most cases. What replaced it is quieter. VCF 9 introduced DirectPath Profile Pools, a cluster-wide view of consumed and remaining vGPU capacity. When a pool is exhausted, a new Deep Learning VM or a scaled-up worker does not fail loudly. It waits, or it lands without the GPU it expected, and you find out three layers up when a pod cannot schedule. Before you blame a host, check the pool. A profile mismatch still bites when you mix time-sliced and MIG-backed profiles on the same physical GPU, which is not allowed: a GPU is either in time-slicing mode or MIG mode, not both, and a request for the wrong kind will never place.
F4 · The GPU Operator is green but the GPU is not allocatable
Symptom: every pod in gpu-operator looks fine at a glance, yet nvidia.com/gpu shows zero on every node. The operator orchestrates the driver, device plugin, and validators, so a failure in any one of them leaves the node with no advertised GPU even when the others are healthy.
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS
nvidia-driver-daemonset-xxxxx 0/1 CrashLoopBackOff 6
nvidia-device-plugin-daemonset-yyyy 0/1 Init:0/1 0
# What the node actually advertises:
$ kubectl get node <worker> -o jsonpath='{.status.allocatable.nvidia.com/gpu}'
0
$ kubectl logs -n gpu-operator nvidia-driver-daemonset-xxxxx | tail -20
With GPU Operator 25.10.x the failure I see most is the driver container trying to build or load a driver whose branch does not match the host VIB, the same handshake from F2 but inside a container. On a Private AI deployment you generally want the operator running in vGPU mode against the pre-installed guest driver rather than letting it manage a driver that fights the host. If the driver pod is healthy but the device plugin is stuck in Init, the validator cannot see a working GPU, which loops you back to F1 and F2. Walk the operator pods in dependency order, driver first, then toolkit, then device plugin, then validator, and fix the first one that is red rather than restarting all of them. For the full install path, see installing the NVIDIA GPU Operator and vGPU drivers.
F5 · The NIMCache never reaches Ready
The NIM Operator pattern is a hard dependency: the NIMCache controller spins up a Kubernetes Job named <nimcache-name>-job to pull model artifacts from NGC, and the NIMService will not start until that cache is Ready. When inference never comes up, check the cache before the service.
$ kubectl get nimcache -A
NAME STATUS AGE
meta-llama3-8b NotReady 14m
$ kubectl get jobs -n nim | grep meta-llama3-8b
$ kubectl logs -n nim job/meta-llama3-8b-job --tail=30
# 401 Unauthorized -> NGC API key / image pull secret
# i/o timeout to nvcr.io -> egress or proxy
# No space left on device -> PVC too small for the model
Three causes cover almost every stuck cache. Authentication: the NGC API key or the image pull secret is wrong or expired, which shows as a 401 in the job log. Network: the job cannot reach nvcr.io, which in a Private AI environment usually means egress rules or the same auth-proxy problem from F2. Disk: the cache PVC is too small for the model, and large checkpoints fill it mid-download. Size the cache volume for the biggest model you intend to serve plus headroom, because teams routinely under-provision it for the small model they tested with and then fail on the real one. In an air-gapped deployment the cache pulls from your local mirror instead, so a stuck cache there points at the mirror, covered in the air-gapped deployment guide.
F6 · The NIM pod crashes on CUDA out of memory
Symptom: the NIMService pod starts, loads for a minute, then drops into CrashLoopBackOff with an out-of-memory error in the log. The reflex is to ask for a bigger node. That is usually the wrong move. The pod already has the GPU it asked for; the model profile it selected does not fit that GPU.
$ kubectl logs -n nim meta-llama3-8b-nimservice-xxxx | tail
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate ...
# or:
ERROR No supported model profile for detected hardware
# See what the model actually offers for this GPU:
$ kubectl exec -n nim <pod> -- list-model-profiles
NIM picks a profile based on the GPU it detects, and on a partitioned GPU, a MIG slice or a fractional vGPU, the visible memory is a fraction of the card. The fix is to choose a profile that fits: a higher tensor-parallel or pipeline-parallel profile that spreads the model across more partitions, or a lower-precision profile such as FP8 or NVFP4 that halves the footprint. No supported model profile is the related cousin: the model has no profile that matches the slice you gave it at all, which means the partition is too small for this model and you need a larger vGPU profile or a full GPU. My take: decide the profile from the GPU memory math up front rather than letting NIM auto-select and discovering the ceiling in a crash loop. The sizing arithmetic is in the sizing and cost part, and the serving-layer design is in deploying NIM microservices.
F7 · The pod sits in Pending and no GPU is offered
Symptom: the NIM or workload pod never schedules. kubectl describe pod shows the familiar line.
$ kubectl describe pod <pod> | grep -A2 Events
Warning FailedScheduling 0/3 nodes are available:
3 Insufficient nvidia.com/gpu.
If F4 already proved the node advertises GPUs, then either they are all in use or the request cannot be satisfied as written. With time-slicing, a node advertises more replicas than physical GPUs, so a request for a whole GPU on a sliced node never matches; with MIG, the requested profile name must exist on a node. Mismatched nodeSelector or missing tolerations for the GPU taint will also hold a pod in Pending forever while the GPUs sit idle. This is exactly where good monitoring pays for itself, because allocatable-versus-requested is a dashboard you should already have from GPU monitoring with VCF Operations.
The symptom, cause and fix at a glance
| Symptom | Likely cause | First fix |
|---|---|---|
| No devices were found (F1) | Host VIB missing or GPU not enumerated | Check lspci and the VIB on the host; remediate the cluster image; full power drain |
| Clocks throttle / Unlicensed (F2) | Bad token, auth proxy, or host/guest branch mismatch | Validate JWT token, bypass auth proxy, match driver branches |
| VM waits or lands without GPU (F3) | DirectPath Profile Pool exhausted or mode mismatch | Check pool capacity; do not mix time-slicing and MIG on one GPU |
| nvidia.com/gpu = 0 (F4) | GPU Operator driver or device plugin failing | Fix the first red operator pod in dependency order |
| NIMCache NotReady (F5) | 401, egress block, or undersized PVC | Read the cache job log; fix key, route or volume size |
| CUDA out of memory (F6) | Model profile too big for the partition | Pick a higher TP/PP or FP8/NVFP4 profile, not a bigger node |
| Pod Pending, Insufficient gpu (F7) | Slice/profile request unmatched or taints | Align request to time-slicing/MIG; fix selectors and tolerations |
What I'd Do
Stop debugging Private AI top-down. The errors surface at the model layer because that is where you are looking, but the cause is usually a host VIB, a driver branch, a license token, or a pool that quietly ran dry. Run the five gates from the bottom of the stack, fix the lowest red layer first, and most of these clear in minutes instead of an afternoon. Two of the seven, the branch handshake and the over-sized model profile, account for more wasted hours than the other five combined, so check those reflexively. Which of these has burned you the most: the dark GPU on the host, or the NIM pod that swears it is out of memory?
References
- VMware Private AI Foundation with NVIDIA 9.1, Broadcom TechDocs
- The NVIDIA vGPU Guest Driver Is Shown as Unlicensed, Broadcom TechDocs
- NVIDIA NIM Operator documentation
- Troubleshooting GPU Memory Out-of-Memory Errors, NVIDIA NIM
- NVIDIA GPU not detected in ESXi host after installing drivers, Broadcom Knowledge
« Previous: Part 22 | VMware Private AI Complete Guide | Next: Part 24 »








