TL;DR · Key Takeaways
- VCF Operations 9.x adds real GPU and vGPU metrics at the cluster, host and VM levels, plus a built-in Private AI (GPU) dashboard. It is excellent for capacity, right-sizing and chargeback, and blind to in-guest hardware health.
- The metric most teams trust,
DCGM_FI_DEV_GPU_UTIL, only tells you a context was scheduled. It will read 100% on a GPU that is doing almost no useful work. - Alert on engine-active and memory-bandwidth ratios (
GR_ENGINE_ACTIVE,DRAM_ACTIVE), framebuffer pressure, XID errors, ECC row-remap, and vGPU license status. Most outages I see trace back to those last four. - Run two planes: VCF Operations for the infrastructure view, DCGM plus your serving layer (NIM, Triton) for the workload view. Neither one alone catches a stalled inference endpoint.
Here is a pattern I have watched play out on more than one Private AI engagement. The platform team opens the GPU dashboard, sees every accelerator pinned near 100%, and reports the cluster as healthy and fully booked. Meanwhile the data science team is filing tickets because training throughput dropped by a third overnight and an inference endpoint is timing out. Both teams are looking at real numbers. The problem is that the headline number on most GPU dashboards answers a question nobody actually cares about.
This post is about what to watch instead. VCF Operations gives you a genuinely useful GPU monitoring stack now, but the defaults will let a degrading GPU and a throttled vGPU hide in plain sight. I will walk the signal stack from silicon to dashboard, show where each metric lives, call out the ones that mislead, and end with the short list I actually wire alerts to.
Where each GPU signal actually lives
The single biggest source of monitoring confusion on Private AI is that GPU telemetry comes from two different places that do not see the same thing. VCF Operations and the vSphere Client read GPU metrics through the ESXi host and vCenter: that is the infrastructure plane, and it sees the physical device and the vGPU allocation. DCGM, the NVIDIA Data Center GPU Manager, runs inside the guest or the Kubernetes node and reads the device directly: that is the workload plane, and it sees what the application is doing to the hardware. On a Private AI estate the DCGM exporter ships pre-installed in every Deep Learning VM, and on VKS clusters the NVIDIA GPU Operator deploys it for you.
Knowing which plane owns which signal is what stops you chasing the wrong dashboard at 2am. Here is the map.
If you only take one thing from this section: VCF Operations is the right tool for capacity and right-sizing decisions, and it is the wrong tool for answering "is this specific GPU about to fail." That answer lives in DCGM.
The utilization mirage
The metric everyone reaches for is GPU utilization. In DCGM that is DCGM_FI_DEV_GPU_UTIL, and in nvidia-smi it is the "GPU-Util" column. Here is the catch, straight from how NVIDIA defines it: that number is the percentage of the sampling window during which at least one kernel was running. It says a context was scheduled. It says nothing about how many of the streaming multiprocessors were busy or whether memory bandwidth was the bottleneck.
So a tiny kernel that touches one SM and leaves the other hundred-plus idle reports the same 100% as a fully saturated training step. I have seen teams scale out a cluster because the dashboard showed every GPU "maxed", when the real fix was a batch-size and data-loader change that took one GPU from 100% reported utilization and 9% real engine activity to genuine saturation.
The fix is to stop treating utilization as a saturation metric and start reading the profiling counters that DCGM exposes. These work per-MIG-instance too, which plain GPU_UTIL does not.
# The metrics that actually describe saturation (DCGM field names)
DCGM_FI_PROF_GR_ENGINE_ACTIVE # graphics/compute engine active ratio (MIG-aware)
DCGM_FI_PROF_SM_ACTIVE # fraction of SMs with at least one warp resident
DCGM_FI_PROF_SM_OCCUPANCY # fraction of warp slots actually filled
DCGM_FI_PROF_DRAM_ACTIVE # memory bandwidth utilization (catches memory-bound)
DCGM_FI_DEV_FB_USED # framebuffer / VRAM used (MiB)
DCGM_FI_DEV_FB_FREE # framebuffer / VRAM free (MiB)
# Live read inside a Deep Learning VM or VKS node
dcgmi dmon -e 1001,1002,1003,1005,252
nvidia-smi dmon -s pucvmet -d 1
My take: GR_ENGINE_ACTIVE and DRAM_ACTIVE together tell you almost everything GPU_UTIL pretends to. If engine-active is low while utilization is high, you are input-bound or kernel-launch-bound, and adding hardware makes the bill worse without making the job faster. This is exactly the kind of judgement the VCF Operations capacity view cannot make for you, because it never sees these counters.
The signals the default dashboards never show you
The Private AI (GPU) dashboard in VCF Operations is built around utilization, memory, temperature and power: the things that matter for capacity and placement. What it does not surface is hardware health, and those are the signals that turn a quiet degradation into a 3am page. Four of them are worth wiring up before anything else.
- XID errors (
DCGM_FI_DEV_XID_ERRORS). These are the GPU's own fault codes. An XID 79 (GPU has fallen off the bus) or repeated XID 48 (double-bit ECC) means the device is already in trouble. They never reach vCenter as a GPU event. - ECC and row-remap (
DCGM_FI_DEV_ROW_REMAP_PENDING,DCGM_FI_DEV_ROW_REMAP_FAILURE). On HBM-based parts a pending row remap is the early warning that a memory bank is going. Drain that host on your schedule, not when the job crashes. - vGPU license status (
DCGM_FI_DEV_VGPU_LICENSE_STATUS). If a guest cannot reach the licensing service, NVIDIA throttles the vGPU. Throughput collapses, GPU_UTIL still looks fine, and nobody thinks to check licensing because the GPU is "busy." This is one of the most common silent slowdowns on vGPU estates. - Thermal and power throttle reasons (clocks-throttle reasons,
DCGM_FI_DEV_POWER_VIOLATION). A GPU clocking down under a thermal cap looks fully utilized while quietly delivering less work each second.
For a refresher on how the general VCF Operations monitoring model works, the broader observability picture is covered in VCF Operations monitoring and observability. The point here is that GPU hardware health is the one area where you cannot rely on the infrastructure plane alone.
# Confirm the exporter is up on a VKS cluster
kubectl -n gpu-operator get pods -l app=nvidia-dcgm-exporter
kubectl -n gpu-operator get servicemonitor
# vGPU view from the ESXi host (per-vGPU, license state, fb usage)
nvidia-smi vgpu -q
# Prometheus alert rule: pending row-remap on any GPU
# ALERT GpuRowRemapPending
# expr: DCGM_FI_DEV_ROW_REMAP_PENDING > 0
# for: 5m labels: { severity: "critical" }
Symptom, cause, where to look
When a GPU workload misbehaves on Private AI, the diagnosis almost always comes down to picking the right plane and the right metric. This is the matrix I keep handy.
| Symptom | Likely cause | Where to look |
|---|---|---|
| GPU_UTIL pinned at 100%, throughput flat or dropping | Input-bound or kernel-launch-bound; SMs mostly idle | GR_ENGINE_ACTIVE, SM_ACTIVE, DRAM_ACTIVE in DCGM |
| Throughput halved overnight, no config change | vGPU license lapse, or thermal/power throttle | VGPU_LICENSE_STATUS, clocks-throttle reasons |
| CUDA job aborts intermittently, GPU "disappears" | Hardware fault, falling off the bus, double-bit ECC | XID_ERRORS, ROW_REMAP_PENDING in DCGM |
| OOM at model load despite "free" capacity on dashboard | Wrong vGPU profile, fragmented framebuffer | FB_USED / FB_FREE per vGPU; profile sizing |
| Endpoint slow but GPU metrics look healthy | Queueing in the serving layer, not the GPU | NIM / Triton queue depth and request latency |
What I actually alert on
Dashboards are for humans looking; alerts are for problems finding you. After several Private AI builds, the short list that earns its place is narrower than most teams expect. Capacity and utilization belong on a dashboard you review weekly, not on a pager. The pager gets hardware health and the two failure modes that silently waste money.
- Any non-zero
ROW_REMAP_PENDINGor repeatedXID_ERRORSon a host: critical, drain and inspect. - vGPU license not licensed for more than a few minutes: critical, throughput is already degraded.
- GR_ENGINE_ACTIVE low while a job claims the GPU for an extended window: a waste-of-capacity warning, route it to the workload owner, not ops.
- Framebuffer above roughly 90% on a vGPU profile: warning, you are one model load from an OOM.
- Sustained power or thermal violation: warning, you are paying for clocks you are not getting.
Everything else, including raw GPU_UTIL, stays on the VCF Operations dashboard for capacity and chargeback conversations. If you are still deciding how to slice the hardware in the first place, the trade-offs between time-sliced vGPU, MIG and passthrough change which of these metrics even apply, and I worked through that in GPU partitioning for VMware Private AI.
What I'd Do
Use VCF Operations for what it is genuinely good at: GPU capacity, right-sizing, placement and showing each tenant what they consume. Do not ask it to tell you a GPU is failing, because that signal lives in DCGM, and DCGM is already running in your Deep Learning VMs and VKS clusters. Wire a Prometheus scrape and five alert rules against the exporter, lead every performance investigation with engine-active rather than utilization, and treat the vGPU license state as a first-class health signal. That combination catches the failures that the default dashboard, on its own, will quietly let through.
What is the most misleading GPU metric you have been burned by? I would like to hear which signal earned a permanent spot on your pager.
References
- Broadcom TechDocs: Monitoring VMware Private AI Foundation with NVIDIA
- Broadcom TechDocs: Private AI (GPU) Dashboards in VCF Operations
- NVIDIA dcgm-exporter: default metrics and DCGM field reference
« Previous: Part 16 | VMware Private AI Complete Guide | Next: Part 18 »








