Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

GPU Observability and Multi-Tenancy: DCGM, Honest Utilization, and Sharing (NVIDIA AI Series, Part 29)

Why GPU utilization lies, the DCGM profiling fields that tell the truth (SM and Tensor activity), dcgm-exporter into Prometheus, and choosing MIG vs time-slicing for multi-tenancy.

NVIDIA AI Series · Part 29 of 30

TL;DR

The GPU utilization number everyone watches is the metric that lies. It reports whether any kernel ran during the sampling window, not how much of the GPU was working, so a job can sit at 100 percent utilization with streaming-multiprocessor activity under 20 percent and the Tensor Cores nearly idle. The honest signals come from DCGM profiling fields: SM active, Tensor Core active, and memory bandwidth, exposed by dcgm-exporter into Prometheus and Grafana, and shipped automatically by the GPU Operator. For multi-tenancy, MIG gives hard, isolated partitions with their own metrics, while time-slicing and MPS share a GPU more loosely with no isolation. Stop trusting GPU utilization for capacity and chargeback; measure SM and Tensor activity instead, because the gap between the two is usually where half your GPU budget is hiding.

Who this is for: AI-infrastructure architects and platform engineers who operate shared GPU clusters and have to answer for their utilization and cost. Prerequisites: GPU partitioning from Part 6 and the GPU Operator from Part 12. This part is the operations layer: how to see what your GPUs are really doing.

A finance team once told me their GPU cluster was fully utilized and they needed more capacity. Their dashboard showed 95 percent GPU utilization across the fleet. We turned on DCGM profiling and found the streaming multiprocessors were active about 18 percent of the time. They did not need more GPUs. They needed to stop trusting one number. GPU utilization is the most-watched and least-honest metric in AI infrastructure, and learning to look past it is the single biggest lever you have on cost.

The metric that lies

The number reported as GPU utilization, the one nvidia-smi prints and most dashboards graph, measures one thing: the fraction of the sampling window during which at least one kernel was executing on the GPU. It says nothing about how much of the GPU that kernel used. A tiny kernel that touches a handful of the GPU's streaming multiprocessors and leaves the rest idle still drives utilization to 100 percent, because something was running the whole time. Utilization is an occupancy-of-time measure, not an occupancy-of-hardware measure.

This is the difference between utilization and saturation. A GPU can be utilized (busy in time) while badly unsaturated (most of its compute idle). On a modern data-center GPU with tens of thousands of CUDA cores and Tensor Cores, the gap is enormous: workloads routinely show 100 percent utilization with SM activity under 20 percent and Tensor Core activity near zero. If you size capacity, plan purchases, or bill teams off utilization, you are making decisions off a number that cannot tell a saturated GPU from a barely-touched one.

One job, three very different numbersSame workload, same instant; only the bottom two are honestGPU utilization100%SM active18%Tensor active5%The top bar is what most dashboards show. The bottom two are what you are paying for.
Values are illustrative of a common pattern. A job pinned at 100 percent utilization with single-digit Tensor activity is leaving most of an expensive GPU on the floor.

The metrics that tell the truth

DCGM, the NVIDIA Data Center GPU Manager, exposes a set of profiling fields that measure hardware occupancy directly. These are the numbers to build dashboards and alerts on.

SM and Tensor activity

DCGM_FI_PROF_SM_ACTIVE reports the fraction of streaming multiprocessors that were active, which is your real compute occupancy. DCGM_FI_PROF_PIPE_TENSOR_ACTIVE reports how busy the Tensor Cores were, which for an LLM workload is the number that actually tracks useful work, since training and inference run on the Tensor Core pipes. A healthy training job pushes Tensor activity high; a job with high utilization but low Tensor activity is usually bottlenecked on data loading, small batch sizes, or the wrong precision, all fixable once you can see them.

Memory and interconnect

Compute is only half the story. DCGM_FI_PROF_DRAM_ACTIVE shows HBM bandwidth use, which is the real bottleneck for memory-bound inference, and the NVLink and PCIe traffic fields show whether your interconnect is the limit. Read together, these tell you which resource is actually saturated, so you fix the right thing instead of guessing. A job at low SM activity and high DRAM activity is memory-bound and will not go faster on a bigger compute GPU; one at high SM and low DRAM is compute-bound and will.

What you want to knowDo not useUse instead (DCGM field)
Is the GPU really busy?GPU utilizationSM_ACTIVE
Doing useful AI math?GPU utilizationPIPE_TENSOR_ACTIVE
Memory-bound?Memory used (GB)DRAM_ACTIVE
Interconnect-bound?(usually not watched)NVLINK / PCIE traffic
Chargeback basisGPU utilizationSM_ACTIVE over time

DCGM and dcgm-exporter

DCGM is the management and telemetry layer for data-center GPUs: health checks, diagnostics, configuration, and the profiling fields above. The piece you wire into a monitoring stack is dcgm-exporter, which reads DCGM and publishes Prometheus metrics, including per-MIG-instance metrics when a GPU is partitioned. If you deployed the GPU Operator from earlier in this series, you already have dcgm-exporter running; the work is choosing the right fields and building dashboards and alerts on them rather than on the default utilization gauge.

# watch the honest fields live on a host
dcgmi dmon -e 1002,1004,1005
#   1002 = SM_ACTIVE, 1004 = TENSOR_ACTIVE, 1005 = DRAM_ACTIVE

# in Prometheus, average real compute occupancy across the fleet
avg(DCGM_FI_PROF_SM_ACTIVE)

# alert: GPUs that look busy but are not doing AI math
DCGM_FI_DEV_GPU_UTIL > 0.9 and DCGM_FI_PROF_PIPE_TENSOR_ACTIVE < 0.2

# failure mode: some PROF_* fields need multiple profiling passes and
# cannot all be sampled at once on a given GPU; request too many and the
# exporter drops fields or multiplexes, so the values look noisy.

Confirm the exact field IDs and which profiling fields can be collected together for your GPU generation, since the multiplexing limits differ by architecture. [VERIFY field IDs and concurrent-collection limits against the DCGM docs for your GPU and DCGM version.]

The telemetry pathThe GPU Operator ships the middle of this for youGPUhardware countersDCGMhost enginedcgm-exporterper GPU + per MIGPrometheusscrape + storeGrafanadashboards + alertsChoose the SM and Tensor fields here, not the default GPU-util gauge.
The plumbing is mostly free with the GPU Operator. The judgment is in which fields you alert on, and that is where most teams default to the wrong one.

Multi-tenancy: sharing GPUs honestly

The reason the honest metrics matter is that they expose the case for sharing. If most jobs use a fraction of a GPU, giving each its own whole GPU is waste. There are three ways to share, and they trade isolation against flexibility.

MIG: hard partitions

Multi-Instance GPU splits a supported GPU into as many as seven isolated instances, each with its own dedicated SMs, memory, and memory bandwidth, and each reporting its own DCGM metrics. The isolation is real at the hardware level, so one tenant cannot starve another, which makes MIG the right answer when you are sharing across teams or running untrusted workloads side by side. The trade-off is rigidity: instance sizes are fixed geometries you reconfigure deliberately, not on demand.

Time-slicing and MPS: soft sharing

Time-slicing lets multiple containers take turns on the whole GPU with no memory isolation, and MPS (the Multi-Process Service) lets multiple processes run concurrently on the GPU, again without hard isolation. Both raise occupancy for bursty or small workloads and both are dangerous for multi-tenant isolation, because one job can exhaust memory or compute and take the others down with it. Use them within a trusted team to pack more small jobs onto a GPU; do not use them as a security boundary between tenants.

Hard partition vs soft shareIsolation is the deciding questionMIGTenant ATenant Bdedicated SMs + memory, real isolationTime-slicing / MPSJobs A, B, C share the whole GPUno memory isolationhigher packing, one job can hurt the others
MIG when tenants must not affect each other; time-slicing or MPS only inside a team that trusts its own jobs. The metrics in this part are how you decide which one a GPU needs.
ModeIsolationPer-tenant metricsUse for
MIGHard (hardware)Yes, per instanceAcross teams / untrusted
Time-slicingNoneCoarseBursty small jobs, one team
MPSNone (concurrent)CoarseConcurrent small inference, one team
My take: chargeback should bill on SM activity over time, not GPU utilization. Billing on utilization rewards teams for holding a GPU at 100 percent while doing almost nothing, which is the opposite of the behavior you want. Bill on real occupancy and suddenly everyone has a reason to raise their Tensor activity, which is the same as getting more out of the cluster you already bought.

Worked example

A 64-GPU cluster reports 90 percent average GPU utilization, and the platform team is about to order 32 more GPUs. Before signing, they graph SM_ACTIVE and find the fleet averages about 20 percent real occupancy. The cluster is not full; it is full of jobs that each grip a whole GPU and use a fifth of it.

Two changes follow from the data: enable MIG so the many small inference jobs each get a right-sized instance instead of a whole GPU, and bill teams on SM_ACTIVE so the incentive flips. The effective capacity roughly triples without buying a single GPU, and the purchase is cancelled. None of that is visible on a utilization dashboard, which is exactly why the utilization dashboard was about to cost six figures of unnecessary hardware.

Beyond metrics: health and diagnostics

DCGM is not only a telemetry source; it also runs health checks and diagnostics, and in a large cluster that is what saves you from silent failure. Background health watches monitor for ECC memory errors, thermal throttling, NVLink errors, and PCIe replays, and an on-demand diagnostic (dcgmi diag, at run levels from a quick check to a long stress test) validates a GPU before a job lands on it. At thousands of GPUs a single bad card stalls an entire multi-node training run, so catching it before the job starts is worth far more than discovering it after an hour of wasted compute.

Wire the active health checks into the scheduler so a node that fails a check is cordoned automatically, and run the deeper diagnostic on a cadence and after every hardware event. The same dcgm-exporter that publishes occupancy also exposes health and XID error fields, so your alerting can fire on a degrading GPU instead of waiting for a job to crash on it. Treating GPU health as a monitored, automated property rather than something you check after an outage is the difference between a cluster that loses minutes to a bad card and one that loses days.

What I would actually choose

My recommendation: build your GPU dashboards, alerts, and chargeback on DCGM profiling fields, with SM_ACTIVE and PIPE_TENSOR_ACTIVE as the headline numbers, and relegate GPU utilization to a secondary, clearly-labeled gauge. Why: those fields measure hardware occupancy and useful AI work, which is what capacity and cost decisions actually depend on. When utilization is still fine to use: as a quick liveness check that something is running, never as a measure of how much. What to validate first: which profiling fields your GPU generation can collect simultaneously, because the multiplexing limits will otherwise make your dashboards noisy and you will not trust them.

On sharing, choose by isolation need, not by convenience. MIG across teams and for anything untrusted, time-slicing or MPS only inside a team that owns all the jobs on the GPU. The metrics tell you which jobs are candidates for sharing in the first place, so the observability decision and the multi-tenancy decision are really one decision made with the same data. For the VCF Operations view of these same signals, see the Private AI GPU monitoring walkthrough.

The Verdict

GPU utilization is the metric that lies, and DCGM is how you stop being lied to. SM activity and Tensor activity tell you what your GPUs are really doing, dcgm-exporter puts those fields in Prometheus and Grafana for free once the GPU Operator is running, and the same numbers tell you which workloads should share a GPU and whether they need MIG's hard isolation or can live with soft time-slicing. The highest-return move in GPU operations is to retire utilization as your primary signal and bill on real occupancy. If you do one thing after this part, graph SM_ACTIVE next to GPU utilization on your busiest cluster and look at the gap, because that gap is your budget.

That closes the technical arc of this series. Next, the finale: running the whole NVIDIA stack on-prem and on VMware Cloud Foundation, the cost recap, and my overall verdict on building an AI factory.

NVIDIA AI Series · Part 29 of 30
« Previous: Part 28  |  NVIDIA AI Guide  |  Next: Part 30 »

References

NVIDIA DCGM-Exporter documentation
NVIDIA DCGM feature overview
NVIDIA Multi-Instance GPU (MIG)

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading