Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

GPU Monitoring with VCF Operations for VMware Private AI: The Signals That Actually Catch a Failing Workload (Private AI Series, Part 17)

VCF Operations gives you GPU dashboards out of the box, but the metric most teams trust is the one that lies. Here is what to watch on a Private AI Foundation estate, why GPU utilization misleads, and the hardware-health signals the default dashboards never surface.

VMware Private AI Series · Part 17 of 24

TL;DR · Key Takeaways

  • VCF Operations 9.x adds real GPU and vGPU metrics at the cluster, host and VM levels, plus a built-in Private AI (GPU) dashboard. It is excellent for capacity, right-sizing and chargeback, and blind to in-guest hardware health.
  • The metric most teams trust, DCGM_FI_DEV_GPU_UTIL, only tells you a context was scheduled. It will read 100% on a GPU that is doing almost no useful work.
  • Alert on engine-active and memory-bandwidth ratios (GR_ENGINE_ACTIVE, DRAM_ACTIVE), framebuffer pressure, XID errors, ECC row-remap, and vGPU license status. Most outages I see trace back to those last four.
  • Run two planes: VCF Operations for the infrastructure view, DCGM plus your serving layer (NIM, Triton) for the workload view. Neither one alone catches a stalled inference endpoint.
Who this is for: VCF architects and platform operators running GPU workload domains on Private AI Foundation.  Prerequisites: a deployed GPU workload domain, vGPU or MIG-backed hosts, and VCF Operations collecting from vCenter.

Here is a pattern I have watched play out on more than one Private AI engagement. The platform team opens the GPU dashboard, sees every accelerator pinned near 100%, and reports the cluster as healthy and fully booked. Meanwhile the data science team is filing tickets because training throughput dropped by a third overnight and an inference endpoint is timing out. Both teams are looking at real numbers. The problem is that the headline number on most GPU dashboards answers a question nobody actually cares about.

This post is about what to watch instead. VCF Operations gives you a genuinely useful GPU monitoring stack now, but the defaults will let a degrading GPU and a throttled vGPU hide in plain sight. I will walk the signal stack from silicon to dashboard, show where each metric lives, call out the ones that mislead, and end with the short list I actually wire alerts to.

Where each GPU signal actually lives

The single biggest source of monitoring confusion on Private AI is that GPU telemetry comes from two different places that do not see the same thing. VCF Operations and the vSphere Client read GPU metrics through the ESXi host and vCenter: that is the infrastructure plane, and it sees the physical device and the vGPU allocation. DCGM, the NVIDIA Data Center GPU Manager, runs inside the guest or the Kubernetes node and reads the device directly: that is the workload plane, and it sees what the application is doing to the hardware. On a Private AI estate the DCGM exporter ships pre-installed in every Deep Learning VM, and on VKS clusters the NVIDIA GPU Operator deploys it for you.

Knowing which plane owns which signal is what stops you chasing the wrong dashboard at 2am. Here is the map.

Two planes, one GPU Which tool sees which signal on a Private AI host Physical GPU (L40S / H100 / H200) SMs, HBM, ECC, power, temperature, XID ESXi host + vGPU manager per-vGPU allocation, host-level GPU counters vCenter VM / host / cluster GPU metrics VCF Operations Private AI (GPU) dashboard, capacity, chargeback INFRASTRUCTURE PLANE right-sizing, placement, who owns the capacity DCGM in guest / VKS node profiling counters, ECC, XID, license dcgm-exporter + Prometheus scrape, alert rules, Grafana Serving layer (NIM / Triton) queue depth, request latency, throughput WORKLOAD PLANE is the model actually doing useful work Hardware-health signals (XID, ECC, row-remap, license) live on the right. The left plane will not raise them.
VCF Operations owns the infrastructure plane. DCGM and the serving layer own the workload plane. You need both.

If you only take one thing from this section: VCF Operations is the right tool for capacity and right-sizing decisions, and it is the wrong tool for answering "is this specific GPU about to fail." That answer lives in DCGM.

The utilization mirage

The metric everyone reaches for is GPU utilization. In DCGM that is DCGM_FI_DEV_GPU_UTIL, and in nvidia-smi it is the "GPU-Util" column. Here is the catch, straight from how NVIDIA defines it: that number is the percentage of the sampling window during which at least one kernel was running. It says a context was scheduled. It says nothing about how many of the streaming multiprocessors were busy or whether memory bandwidth was the bottleneck.

So a tiny kernel that touches one SM and leaves the other hundred-plus idle reports the same 100% as a fully saturated training step. I have seen teams scale out a cluster because the dashboard showed every GPU "maxed", when the real fix was a batch-size and data-loader change that took one GPU from 100% reported utilization and 9% real engine activity to genuine saturation.

Same GPU_UTIL, very different GPU Why the headline number cannot tell these two apart GPU A: genuinely saturated GPU_UTIL 99% GR_ENGINE_ACTIVE 93% DRAM_ACTIVE 81% Verdict: real work. Scaling out is fair. Engine and memory both lit up. GPU B: starved pipeline GPU_UTIL 98% GR_ENGINE_ACTIVE 11% DRAM_ACTIVE 8% Verdict: input-bound. Fix the pipeline, do not buy more GPUs.
Both GPUs report ~100% utilization. Only the profiling counters tell you which one is actually working.

The fix is to stop treating utilization as a saturation metric and start reading the profiling counters that DCGM exposes. These work per-MIG-instance too, which plain GPU_UTIL does not.

# The metrics that actually describe saturation (DCGM field names)
DCGM_FI_PROF_GR_ENGINE_ACTIVE   # graphics/compute engine active ratio (MIG-aware)
DCGM_FI_PROF_SM_ACTIVE          # fraction of SMs with at least one warp resident
DCGM_FI_PROF_SM_OCCUPANCY       # fraction of warp slots actually filled
DCGM_FI_PROF_DRAM_ACTIVE        # memory bandwidth utilization (catches memory-bound)
DCGM_FI_DEV_FB_USED             # framebuffer / VRAM used (MiB)
DCGM_FI_DEV_FB_FREE             # framebuffer / VRAM free (MiB)

# Live read inside a Deep Learning VM or VKS node
dcgmi dmon -e 1001,1002,1003,1005,252
nvidia-smi dmon -s pucvmet -d 1

My take: GR_ENGINE_ACTIVE and DRAM_ACTIVE together tell you almost everything GPU_UTIL pretends to. If engine-active is low while utilization is high, you are input-bound or kernel-launch-bound, and adding hardware makes the bill worse without making the job faster. This is exactly the kind of judgement the VCF Operations capacity view cannot make for you, because it never sees these counters.


The signals the default dashboards never show you

The Private AI (GPU) dashboard in VCF Operations is built around utilization, memory, temperature and power: the things that matter for capacity and placement. What it does not surface is hardware health, and those are the signals that turn a quiet degradation into a 3am page. Four of them are worth wiring up before anything else.

  • XID errors (DCGM_FI_DEV_XID_ERRORS). These are the GPU's own fault codes. An XID 79 (GPU has fallen off the bus) or repeated XID 48 (double-bit ECC) means the device is already in trouble. They never reach vCenter as a GPU event.
  • ECC and row-remap (DCGM_FI_DEV_ROW_REMAP_PENDING, DCGM_FI_DEV_ROW_REMAP_FAILURE). On HBM-based parts a pending row remap is the early warning that a memory bank is going. Drain that host on your schedule, not when the job crashes.
  • vGPU license status (DCGM_FI_DEV_VGPU_LICENSE_STATUS). If a guest cannot reach the licensing service, NVIDIA throttles the vGPU. Throughput collapses, GPU_UTIL still looks fine, and nobody thinks to check licensing because the GPU is "busy." This is one of the most common silent slowdowns on vGPU estates.
  • Thermal and power throttle reasons (clocks-throttle reasons, DCGM_FI_DEV_POWER_VIOLATION). A GPU clocking down under a thermal cap looks fully utilized while quietly delivering less work each second.

For a refresher on how the general VCF Operations monitoring model works, the broader observability picture is covered in VCF Operations monitoring and observability. The point here is that GPU hardware health is the one area where you cannot rely on the infrastructure plane alone.

# Confirm the exporter is up on a VKS cluster
kubectl -n gpu-operator get pods -l app=nvidia-dcgm-exporter
kubectl -n gpu-operator get servicemonitor

# vGPU view from the ESXi host (per-vGPU, license state, fb usage)
nvidia-smi vgpu -q

# Prometheus alert rule: pending row-remap on any GPU
# ALERT GpuRowRemapPending
#   expr: DCGM_FI_DEV_ROW_REMAP_PENDING > 0
#   for: 5m   labels: { severity: "critical" }

Symptom, cause, where to look

When a GPU workload misbehaves on Private AI, the diagnosis almost always comes down to picking the right plane and the right metric. This is the matrix I keep handy.

SymptomLikely causeWhere to look
GPU_UTIL pinned at 100%, throughput flat or droppingInput-bound or kernel-launch-bound; SMs mostly idleGR_ENGINE_ACTIVE, SM_ACTIVE, DRAM_ACTIVE in DCGM
Throughput halved overnight, no config changevGPU license lapse, or thermal/power throttleVGPU_LICENSE_STATUS, clocks-throttle reasons
CUDA job aborts intermittently, GPU "disappears"Hardware fault, falling off the bus, double-bit ECCXID_ERRORS, ROW_REMAP_PENDING in DCGM
OOM at model load despite "free" capacity on dashboardWrong vGPU profile, fragmented framebufferFB_USED / FB_FREE per vGPU; profile sizing
Endpoint slow but GPU metrics look healthyQueueing in the serving layer, not the GPUNIM / Triton queue depth and request latency
Triage flow: workload is slow Workload slow Is GR_ENGINE_ACTIVE high? no Input/launch-bound. Fix pipeline, not hardware. yes License OK? Throttled? If lapsed: fix licensing. If clean: check serving layer. Crashing too? Check XID_ERRORS and ROW_REMAP. Drain if faulting.
Start from engine-active, not utilization. The branch you take decides which plane you debug in.
Disclaimer: Before changing alert thresholds, deploying or upgrading the dcgm-exporter, or draining a GPU host, validate exporter and driver interoperability against your vGPU/GPU Operator versions, confirm the metric field IDs exist in your DCGM build, and test alert rules in a non-production workload domain first. Draining a host evacuates running jobs, so coordinate with workload owners.

What I actually alert on

Dashboards are for humans looking; alerts are for problems finding you. After several Private AI builds, the short list that earns its place is narrower than most teams expect. Capacity and utilization belong on a dashboard you review weekly, not on a pager. The pager gets hardware health and the two failure modes that silently waste money.

  1. Any non-zero ROW_REMAP_PENDING or repeated XID_ERRORS on a host: critical, drain and inspect.
  2. vGPU license not licensed for more than a few minutes: critical, throughput is already degraded.
  3. GR_ENGINE_ACTIVE low while a job claims the GPU for an extended window: a waste-of-capacity warning, route it to the workload owner, not ops.
  4. Framebuffer above roughly 90% on a vGPU profile: warning, you are one model load from an OOM.
  5. Sustained power or thermal violation: warning, you are paying for clocks you are not getting.

Everything else, including raw GPU_UTIL, stays on the VCF Operations dashboard for capacity and chargeback conversations. If you are still deciding how to slice the hardware in the first place, the trade-offs between time-sliced vGPU, MIG and passthrough change which of these metrics even apply, and I worked through that in GPU partitioning for VMware Private AI.

What I'd Do

Use VCF Operations for what it is genuinely good at: GPU capacity, right-sizing, placement and showing each tenant what they consume. Do not ask it to tell you a GPU is failing, because that signal lives in DCGM, and DCGM is already running in your Deep Learning VMs and VKS clusters. Wire a Prometheus scrape and five alert rules against the exporter, lead every performance investigation with engine-active rather than utilization, and treat the vGPU license state as a first-class health signal. That combination catches the failures that the default dashboard, on its own, will quietly let through.

What is the most misleading GPU metric you have been burned by? I would like to hear which signal earned a permanent spot on your pager.

References


VMware Private AI Series · Part 17 of 30
« Previous: Part 16  |  VMware Private AI Complete Guide  |  Next: Part 18 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading