Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Troubleshooting VMware Private AI Foundation: 7 Failures That Actually Bite (Private AI Series, Part 23)

The seven failures I hit most often on VMware Private AI Foundation with NVIDIA, from a dark GPU on the ESXi host to a NIM pod crashing on CUDA out of memory, with the real error strings and the checks that isolate each layer.

VMware Private AI Series · Part 23 of 24

TL;DR · Key Takeaways

  • A Private AI stack fails in layers. The symptom you see is usually one or two layers above the actual cause, so work bottom-up: host, then profile, then guest license, then Kubernetes, then the model.
  • If nvidia-smi on the ESXi host shows nothing, stop touching the guest. The GPU never came up on the host.
  • The single most common day-one failure is host and guest vGPU drivers from different branches. Match the branch before anything else.
  • NIM pods that crash on CUDA out of memory are almost never a node-size problem. They are a model-profile problem.
  • In VCF 9, DirectPath Profile Pool exhaustion now looks like a quiet scheduling failure, not a power-on error. Check pool capacity before you blame the host.
Who this is for: admins and architects running VMware Private AI Foundation with NVIDIA on VCF 9.0 or 9.1.  Prerequisites: a deployed GPU workload domain, shell access to an ESXi host and to the Supervisor or TKG cluster, and the NGC credentials used at deployment.

A Private AI stack fails in layers, and the error you read is almost never where the problem lives. nvidia-smi reports no devices, so people reinstall the guest driver when the GPU never came up on the host. A NIM pod crashes on CUDA out of memory, so people ask for a bigger node when the real fix is a smaller model profile. After enough bring-ups you stop reading the symptom and start asking which layer is lying. These are the seven failures I hit most often, from the ESXi host up to the inference pod, with the actual error strings and the checks that isolate each one.

Where each failure lives The symptom is rarely in the same layer as the cause. Read it bottom-up. Inference app / RAG clientretrieval and prompt issues live here (see Part 14) NIM / model serving (NIMCache, NIMService)cache never ready, pod OOM, no supported profile 56 Kubernetes + NVIDIA GPU Operator 25.10.xdriver daemonset crash, GPU not allocatable, pods Pending 47 Guest OS + vGPU guest driver 580.x + licensedriver unlicensed, branch mismatch with the host 2 vGPU profile + DirectPath Profile Poolprofile not available, pool capacity exhausted 3 ESXi host + vGPU host VIB + physical GPUVIB missing, GPU not enumerated, RmInitAdapter failed 1 start debugging here
The Private AI stack and the layer each of the seven failures actually sits in.

Debug in this order, not the order the errors arrive

The decision flow below is the order I actually follow. Each gate is a one-line check that either clears a whole layer or stops you there. Run them top to bottom and you will isolate almost any Private AI inference failure in under five commands.

Isolate the failing layer Host nvidia-smi sees GPU? yes no Host VIB /physical (F1) Guest nvidia-smi works? yes no Profile (F3) /license (F2) GPU allocatable in K8s? yes no GPU Operator(F4) NIMCache Ready? yes no Cache job(F5) NIMService running? no OOM (F6) /sched (F7) yes Stack healthy: look at the app
Five gates, bottom of the stack to the top. Each one clears or convicts a layer.

F1 · The GPU is dark on the host

Symptom: in a Deep Learning VM or a TKG worker, nvidia-smi returns No devices were found or the blunt NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. The instinct is to reinstall the guest driver. Resist it. Check the host first, because a guest can only see a vGPU that the host successfully created.

# On the ESXi host
[root@esx01:~] esxcli software vib list | grep -i nvd
[root@esx01:~] nvidia-smi -q | head

# Is the GPU even enumerated on the PCI bus?
[root@esx01:~] lspci | grep -i nvidia

# Kernel log when the adapter never inits:
NVRM: GPU 0000:31:00.0: RmInitAdapter failed! (0x62:0x40:1860)
NVRM: GPU 0000:31:00.0: rm_init_adapter failed, device minor number 0

If lspci does not list the card, this is hardware: a reseated GPU, a dead riser, or a slot that is not bifurcated the way the BOM assumed. No driver fixes that. If the card is listed but the VIB is missing, the host vGPU manager was never installed or fell off during a vSphere Lifecycle Manager remediation. The fix is to put the NVIDIA host driver VIB back into the cluster image and remediate, not to install it by hand on one host where it will drift out again. If you see RmInitAdapter failed with the card present and the VIB loaded, the usual cause is an ECC state change or a memory training failure after a firmware update. In practice, the most common RmInitAdapter trigger I see is a GPU that was flashed or moved between hosts without a clean power cycle. A full host power drain, not a soft reboot, clears it more often than any driver reinstall.

F2 · The guest driver runs but the GPU is throttled or unlicensed

Symptom: the GPU appears in the guest, workloads start, then performance collapses after about twenty minutes, or the driver reports an unlicensed state outright. vGPU enforces licensing by degrading clocks once the grace period ends, so this looks like a performance bug long before it looks like a licensing one.

$ nvidia-smi -q | grep -A3 -i license
    vGPU Software Licensed Product
        License Status : Unlicensed (Restricted)

# Token lands here on a Linux guest:
$ ls /etc/nvidia/ClientConfigToken/
$ systemctl restart nvidia-gridd && nvidia-smi -q | grep -i 'License Status'

Three things cause this in a Private AI deployment. The client configuration token is missing, expired, or not valid JWT, so validate the token format before anything else. The guest cannot reach the license server, which on Private AI usually means a proxy in the path: an HTTP proxy that requires authentication is not supported for vGPU license registration, and an HTTPS proxy with a self-signed certificate is not supported for downloading the guest driver. If your environment forces auth proxies, you need a direct route or an exception for the license endpoint. Branch mismatch is the third cause and the one people miss: the guest driver, the 580.x series in the current release, must come from the same vGPU branch as the host VIB. A guest driver one major branch ahead of the host will load, register, and then misbehave in ways that read like flaky hardware.

The branch handshake that quietly fails ESXi host VIBvGPU host managerbranch e.g. 580.x Guest vGPU driverin DLVM or TKG nodemust match branch SAMEBRANCH Match: licensed, full clocks.   Mismatch: loads, registers, then degrades or reports unlicensed. The GPU Operator driver container counts as the guest driver on TKG worker nodes.
Host VIB and guest driver are a matched pair. Treat the GPU Operator driver image the same way.

F3 · The VM will not power on, or the pool is silently full

On VCF 9 this failure changed shape. ESXi 9 no longer makes you pin every device to a profile before power-on, and reconfiguration now happens transparently at VM lifecycle events. That removed the old profile not available power-on error for most cases. What replaced it is quieter. VCF 9 introduced DirectPath Profile Pools, a cluster-wide view of consumed and remaining vGPU capacity. When a pool is exhausted, a new Deep Learning VM or a scaled-up worker does not fail loudly. It waits, or it lands without the GPU it expected, and you find out three layers up when a pod cannot schedule. Before you blame a host, check the pool. A profile mismatch still bites when you mix time-sliced and MIG-backed profiles on the same physical GPU, which is not allowed: a GPU is either in time-slicing mode or MIG mode, not both, and a request for the wrong kind will never place.


F4 · The GPU Operator is green but the GPU is not allocatable

Symptom: every pod in gpu-operator looks fine at a glance, yet nvidia.com/gpu shows zero on every node. The operator orchestrates the driver, device plugin, and validators, so a failure in any one of them leaves the node with no advertised GPU even when the others are healthy.

$ kubectl get pods -n gpu-operator
NAME                                  READY   STATUS             RESTARTS
nvidia-driver-daemonset-xxxxx         0/1     CrashLoopBackOff   6
nvidia-device-plugin-daemonset-yyyy   0/1     Init:0/1           0

# What the node actually advertises:
$ kubectl get node <worker> -o jsonpath='{.status.allocatable.nvidia.com/gpu}'
0

$ kubectl logs -n gpu-operator nvidia-driver-daemonset-xxxxx | tail -20

With GPU Operator 25.10.x the failure I see most is the driver container trying to build or load a driver whose branch does not match the host VIB, the same handshake from F2 but inside a container. On a Private AI deployment you generally want the operator running in vGPU mode against the pre-installed guest driver rather than letting it manage a driver that fights the host. If the driver pod is healthy but the device plugin is stuck in Init, the validator cannot see a working GPU, which loops you back to F1 and F2. Walk the operator pods in dependency order, driver first, then toolkit, then device plugin, then validator, and fix the first one that is red rather than restarting all of them. For the full install path, see installing the NVIDIA GPU Operator and vGPU drivers.

F5 · The NIMCache never reaches Ready

The NIM Operator pattern is a hard dependency: the NIMCache controller spins up a Kubernetes Job named <nimcache-name>-job to pull model artifacts from NGC, and the NIMService will not start until that cache is Ready. When inference never comes up, check the cache before the service.

$ kubectl get nimcache -A
NAME            STATUS       AGE
meta-llama3-8b  NotReady     14m

$ kubectl get jobs -n nim | grep meta-llama3-8b
$ kubectl logs -n nim job/meta-llama3-8b-job --tail=30
# 401 Unauthorized  -> NGC API key / image pull secret
# i/o timeout to nvcr.io -> egress or proxy
# No space left on device -> PVC too small for the model

Three causes cover almost every stuck cache. Authentication: the NGC API key or the image pull secret is wrong or expired, which shows as a 401 in the job log. Network: the job cannot reach nvcr.io, which in a Private AI environment usually means egress rules or the same auth-proxy problem from F2. Disk: the cache PVC is too small for the model, and large checkpoints fill it mid-download. Size the cache volume for the biggest model you intend to serve plus headroom, because teams routinely under-provision it for the small model they tested with and then fail on the real one. In an air-gapped deployment the cache pulls from your local mirror instead, so a stuck cache there points at the mirror, covered in the air-gapped deployment guide.

F6 · The NIM pod crashes on CUDA out of memory

Symptom: the NIMService pod starts, loads for a minute, then drops into CrashLoopBackOff with an out-of-memory error in the log. The reflex is to ask for a bigger node. That is usually the wrong move. The pod already has the GPU it asked for; the model profile it selected does not fit that GPU.

$ kubectl logs -n nim meta-llama3-8b-nimservice-xxxx | tail
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate ...
# or:
ERROR  No supported model profile for detected hardware

# See what the model actually offers for this GPU:
$ kubectl exec -n nim <pod> -- list-model-profiles

NIM picks a profile based on the GPU it detects, and on a partitioned GPU, a MIG slice or a fractional vGPU, the visible memory is a fraction of the card. The fix is to choose a profile that fits: a higher tensor-parallel or pipeline-parallel profile that spreads the model across more partitions, or a lower-precision profile such as FP8 or NVFP4 that halves the footprint. No supported model profile is the related cousin: the model has no profile that matches the slice you gave it at all, which means the partition is too small for this model and you need a larger vGPU profile or a full GPU. My take: decide the profile from the GPU memory math up front rather than letting NIM auto-select and discovering the ceiling in a crash loop. The sizing arithmetic is in the sizing and cost part, and the serving-layer design is in deploying NIM microservices.

F7 · The pod sits in Pending and no GPU is offered

Symptom: the NIM or workload pod never schedules. kubectl describe pod shows the familiar line.

$ kubectl describe pod <pod> | grep -A2 Events
  Warning  FailedScheduling  0/3 nodes are available:
  3 Insufficient nvidia.com/gpu.

If F4 already proved the node advertises GPUs, then either they are all in use or the request cannot be satisfied as written. With time-slicing, a node advertises more replicas than physical GPUs, so a request for a whole GPU on a sliced node never matches; with MIG, the requested profile name must exist on a node. Mismatched nodeSelector or missing tolerations for the GPU taint will also hold a pod in Pending forever while the GPUs sit idle. This is exactly where good monitoring pays for itself, because allocatable-versus-requested is a dashboard you should already have from GPU monitoring with VCF Operations.

The symptom, cause and fix at a glance

SymptomLikely causeFirst fix
No devices were found (F1)Host VIB missing or GPU not enumeratedCheck lspci and the VIB on the host; remediate the cluster image; full power drain
Clocks throttle / Unlicensed (F2)Bad token, auth proxy, or host/guest branch mismatchValidate JWT token, bypass auth proxy, match driver branches
VM waits or lands without GPU (F3)DirectPath Profile Pool exhausted or mode mismatchCheck pool capacity; do not mix time-slicing and MIG on one GPU
nvidia.com/gpu = 0 (F4)GPU Operator driver or device plugin failingFix the first red operator pod in dependency order
NIMCache NotReady (F5)401, egress block, or undersized PVCRead the cache job log; fix key, route or volume size
CUDA out of memory (F6)Model profile too big for the partitionPick a higher TP/PP or FP8/NVFP4 profile, not a bigger node
Pod Pending, Insufficient gpu (F7)Slice/profile request unmatched or taintsAlign request to time-slicing/MIG; fix selectors and tolerations
Disclaimer: these are diagnostic steps, not blind remediations. Before any host driver change, profile change, or power drain, validate the target BOM, confirm host and guest vGPU branch interoperability, back up VM and cluster state, run prechecks, and test on a non-production host first. A full power drain evacuates running workloads, so drain the host through vSphere before you pull power.

What I'd Do

Stop debugging Private AI top-down. The errors surface at the model layer because that is where you are looking, but the cause is usually a host VIB, a driver branch, a license token, or a pool that quietly ran dry. Run the five gates from the bottom of the stack, fix the lowest red layer first, and most of these clear in minutes instead of an afternoon. Two of the seven, the branch handshake and the over-sized model profile, account for more wasted hours than the other five combined, so check those reflexively. Which of these has burned you the most: the dark GPU on the host, or the NIM pod that swears it is out of memory?

References

VMware Private AI Series · Part 23 of 30
« Previous: Part 22  |  VMware Private AI Complete Guide  |  Next: Part 24 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading