Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NVIDIA GPU Operator on Kubernetes: ClusterPolicy, Components, and Day-2 Ops (NVIDIA AI Series, Part 12)

The NVIDIA GPU Operator automates every software layer a GPU node needs in Kubernetes, from kernel driver to DCGM metrics, via a single ClusterPolicy CRD. Here is what it installs, how the reconciliation loop works, when to disable the driver component, and the failure modes that will catch you on first install.

NVIDIA AI Series · Part 12 of 30
TL;DR: The NVIDIA GPU Operator is a Kubernetes operator that manages every piece of NVIDIA software a GPU node needs, from the kernel-level driver to DCGM metrics, via a single ClusterPolicy custom resource. One helm install replaces five manual installation steps that are error-prone, version-mismatched, and impossible to audit at scale. The current release is v26.3.1. If your nodes run immutable OS images or pre-baked DGX Base OS, skip the driver component but keep everything else. If you manage more than three GPU nodes in Kubernetes, the Operator is not optional.
Who this is for: Platform engineers running GPU workloads on self-managed Kubernetes (upstream k8s, RKE2, OpenShift, GKE, AKS, EKS). You already know what a Helm chart is. You have at least one node with a data-center GPU (H100, H200, A100, Blackwell B200, or a Hopper-class card). You want to know what the Operator actually installs, what the ClusterPolicy CRD controls, how day-2 upgrades work, and when to turn the driver component off.

Before the GPU Operator existed, onboarding a new GPU node into a Kubernetes cluster meant SSHing in, installing the NVIDIA driver at the right branch version, installing the Container Toolkit, configuring the runtime shim in containerd or CRI-O, deploying the device plugin DaemonSet, wiring up DCGM for Prometheus, and repeating that procedure identically for every node in the pool. One wrong driver branch and your CUDA workloads fail silently at kernel launch. One missed Container Toolkit config and the scheduler sees the GPU but the container cannot open it. Managing that by hand at ten or more nodes is a maintenance liability, not an architecture.

The GPU Operator changes the model. You declare what you want in a ClusterPolicy and the Operator reconciles every node to that state. This post covers what is under the hood, what the Operator does not cover, and the operational failure modes that will catch you if you skip the verification steps.

What the GPU Operator Actually Installs

The Operator is itself a small controller pod running in the gpu-operator namespace. It watches the ClusterPolicy custom resource and deploys a set of DaemonSets on GPU nodes, each managing one software component. The table below lists every component in v26.3.1 with its function and whether it is on by default.

Component What It Does On by Default
NVIDIA Driver ContainerBuilds and loads the kernel driver in a container, no host-level install requiredYes
NVIDIA Container ToolkitConfigures containerd / CRI-O to inject GPUs via CDI into workload containersYes
NVIDIA Device PluginAdvertises nvidia.com/gpu (and MIG slice) resources to the Kubernetes schedulerYes
GPU Feature Discovery (GFD)Labels nodes with GPU model, memory, driver version, CUDA compute capabilityYes
Node Feature Discovery (NFD)Detects CPU, PCI, and kernel features; labels nodes for topology-aware schedulingYes (disable if NFD already running)
DCGM + DCGM ExporterCollects GPU telemetry (utilization, memory, temperature, ECC, NVLink throughput) and exposes a Prometheus /metrics endpointYes
MIG ManagerWatches the nvidia.com/mig.config node label and reconfigures MIG partitions on supported GPUs (A100, H100, H200)Yes (active on MIG-capable nodes only)
Driver Manager (k8s-driver-manager)Handles safe driver upgrades by draining pods and reloading the kernel moduleYes
Operator ValidatorRuns a sequential validation chain (driver, toolkit, device plugin, CUDA) and marks nodes ReadyYes
Node Status ExporterExports per-node operator state to Prometheus for observability of the Operator itselfYes
GPU Operator Component Stack v26.3.1 / Kubernetes node view GPU Operator Controller Watches ClusterPolicy CRD / reconciles DaemonSets DaemonSets per GPU Node Driver Container kernel module load Container Toolkit CDI + runtime config Device Plugin nvidia.com/gpu resource GFD + NFD node labels DCGM Exporter Prometheus /metrics MIG Manager MIG geometry changes Driver Manager safe in-place upgrades Validator driver/toolkit/CUDA check Host Kernel + GPU Hardware H100 / H200 / B200 / A100 — nvidia.ko loaded by driver container All DaemonSets target nodes with label feature.node.kubernetes.io/pci-10de.present=true
Figure 1: GPU Operator component stack. The controller reconciles ClusterPolicy into DaemonSets; each DaemonSet manages one software layer on every GPU node.

The ClusterPolicy CRD: One Object to Rule the Cluster

The ClusterPolicy is a cluster-scoped custom resource. There is exactly one per cluster. It is the single configuration surface for the Operator: you change a field in ClusterPolicy and the Operator picks it up and reconciles all affected DaemonSets without any node-level SSH access.

Key top-level spec sections:

  • spec.driver — enable/disable driver install, set driver branch, kernel module type (open vs proprietary vs auto), pull secret, RDMA mode
  • spec.toolkit — CDI mode, NRI plugin, containerd/CRI-O config paths
  • spec.devicePlugin — configMap reference for time-slicing or MIG profiles
  • spec.dcgmExporter — enable/disable, metrics config, Prometheus service type
  • spec.mig — MIG strategy (single vs mixed)
  • spec.migManager — watches the MIG label and drains + reconfigures the node
  • spec.gfd — controls the node labels emitted (GPU model, memory, CUDA version)
  • spec.nodeStatusExporter — Prometheus endpoint for operator health
ClusterPolicy Reconciliation Loop kubectl apply ClusterPolicy change or helm upgrade API Server Watch event fires to controller queue Controller Reconcile(): diff desired vs actual DaemonSet Update creates / patches pods on all GPU nodes Validator Chain Runs on Each Node driver OK –> toolkit OK –> device plugin OK –> CUDA OK Node Status: nvidia.com/gpu.present=true GPU schedulable / DCGM metrics flowing / Prometheus scraping continuous watch Any field change in ClusterPolicy restarts the loop within seconds
Figure 2: The reconciliation loop. A ClusterPolicy edit fires a watch event; the controller diffs and patches DaemonSets; the validator chain confirms each node before marking it schedulable.

Reading the ClusterPolicy After Install

After the Helm install completes, inspect what the Operator created:

# List the ClusterPolicy (there is exactly one)
kubectl get clusterpolicy -o wide

# Describe the full spec
kubectl describe clusterpolicy gpu-cluster-policy

# Check the node labels GFD applied
kubectl get node <gpu-node-name> --show-labels | tr ',' '\n' | grep -E 'nvidia|feature'

Expected output for the node labels command on a healthy H100 node:

nvidia.com/cuda.driver.major=570
nvidia.com/cuda.driver.minor=86
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=4
nvidia.com/gpu.count=8
nvidia.com/gpu.memory=81920
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-H100-SXM5-80GB
feature.node.kubernetes.io/pci-10de.present=true

Installing the Operator: the Real Command

The minimal install for a standard bare-metal or passthrough-GPU cluster running containerd:

Worked Example

Standard install on a bare-metal Kubernetes cluster with containerd, H100 or B200 nodes, no pre-existing drivers. Every field below is confirmed from the v26.3.1 getting-started page.

# Step 1: add the NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
  && helm repo update

# Step 2: create namespace and label it for privileged PSA
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator \
  pod-security.kubernetes.io/enforce=privileged

# Step 3: install the Operator
# version=v26.3.1 is the current patch release as of June 2026
helm install --wait --generate-name \
  -n gpu-operator \
  nvidia/gpu-operator \
  --version=v26.3.1

A values override file for a production cluster where you want open kernel modules, MIG in mixed mode, and RDMA enabled:

# values-prod.yaml
driver:
  kernelModuleType: open      # recommended for H100/H200/B200 on driver 570+
  rdma:
    enabled: true             # enables nvidia-peermem for GPUDirect RDMA
mig:
  strategy: mixed             # allows different MIG profiles across nodes
migManager:
  enabled: true
dcgmExporter:
  enabled: true
  service:
    internalTrafficPolicy: Local  # avoids cross-node metric scraping
helm install --wait --generate-name \
  -n gpu-operator \
  nvidia/gpu-operator \
  --version=v26.3.1 \
  -f values-prod.yaml

Check that all pods are running after install (should take 3-8 minutes on first run while the driver container compiles the kernel module):

kubectl get pods -n gpu-operator -o wide

# Expected healthy output (names are generated; count varies by node count)
NAME                                                   READY  STATUS    RESTARTS  AGE
gpu-feature-discovery-xxxxx                            1/1    Running   0         6m
gpu-operator-xxxxxxx-xxxx                              1/1    Running   0         8m
nvidia-container-toolkit-daemonset-xxxxx               1/1    Running   0         6m
nvidia-cuda-validator-xxxxx                            0/1    Completed 0         5m
nvidia-dcgm-exporter-xxxxx                             1/1    Running   0         6m
nvidia-device-plugin-daemonset-xxxxx                   1/1    Running   0         6m
nvidia-driver-daemonset-xxxxx                          1/1    Running   0         7m
nvidia-mig-manager-xxxxx                               1/1    Running   0         6m
nvidia-operator-validator-xxxxx                        1/1    Running   0         5m
Gotcha: The most common failure on a first install is the validator pod stuck in Init:0/4 for more than 15 minutes. Do not debug the validator itself. It reports that something upstream failed. Run kubectl describe pod -n gpu-operator -l app=nvidia-operator-validator and look at the init container statuses. The usual culprits are: (1) the nouveau open-source driver is still loaded in the host kernel — check with lsmod | grep nouveau and blacklist it before the driver container can load nvidia.ko; (2) the container runtime is not configured for the nvidia runtime class yet, causing "no runtime for nvidia is configured" in sandbox creation; (3) on NVSwitch-based systems (DGX H100, GB200 NVL72), the nvidia-fabricmanager must be running on the host for the validator to pass. None of these will self-resolve without action on your part.

Day-2 Operations: Upgrades, MIG Changes, and Driver Rollouts

Driver Upgrades

Driver upgrades go through helm upgrade against the Operator Helm release, not through individual node changes. The Driver Manager drains the node, stops GPU-dependent pods, unloads the old kernel module, loads the new one, and reschedules pods. This is the correct flow because it ensures no GPU workloads are running when the driver is replaced.

# Upgrade the GPU Operator to a newer version
helm upgrade <release-name> nvidia/gpu-operator \
  -n gpu-operator \
  --version=<new-version> \
  -f values-prod.yaml

Get the release name with helm list -n gpu-operator. The upgrade will temporarily drain GPU nodes. Schedule this during a maintenance window for production clusters or use a node-by-node rolling update policy.

MIG Reconfiguration Without SSH

Changing MIG geometry on a node is a one-label operation when the MIG Manager is running. The MIG Manager watches the nvidia.com/mig.config node label. When it changes, the MIG Manager stops GPU pods, applies the new partition configuration, and restarts them. No SSH, no manual MIG CLI commands.

# Change H100 node to 7x 1g.10gb MIG profile
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb --overwrite

# Verify the MIG configuration applied
kubectl get node <gpu-node> -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
ClusterPolicy Lifecycle States NotReady Reconciling DaemonSets deploying Ready Disabled node excluded install validators pass label node.deploy.operands=false Upgrading driver drain + reload Degraded validator failed component crash helm upgrade upgrade error upgrade completes / validators pass
Figure 3: ClusterPolicy lifecycle states. Ready is the target; Upgrading is transient; Degraded means a validator failed and needs manual investigation.

Node Readiness Sequence: What Happens After a Node Joins

When a new GPU node joins the cluster, the Operator detects the feature.node.kubernetes.io/pci-10de.present=true label (set by NFD) and begins deploying DaemonSet pods in a defined order. The validator enforces this order by running as an init-container chain: each stage must complete before the next starts. Understanding this sequence is what lets you debug a stuck install without guessing.

Node Readiness Sequence New GPU node joins cluster 1. NFD Detects PCI-10de labels node 2. Driver Builds + loads nvidia.ko 3. Toolkit CDI config + runtime shim 4. Device Plugin Registers GPUs with kubelet 5. DCGM Telemetry agent starts scraping 6. Node gpu.present=true Schedulable COMMON FAILURE POINTS nouveau conflict or kernel mismatch CRI-O/containerd socket path wrong fabricmanager needed on NVSwitch nodes Validator init containers enforce this order: each stage must complete before the next starts
Figure 4: Node readiness sequence. Each stage is gated by the validator. Common failure points are at driver load (nouveau conflict) and toolkit (wrong socket path). DCGM failures on NVSwitch systems need fabricmanager on the host.

When NOT to Use the GPU Operator (or Which Parts to Disable)

The Operator is not the right answer in every situation. Three cases where you should disable specific components or skip the Operator entirely:

Scenario What to Do Why
Immutable OS / pre-baked images (NVIDIA DGX Base OS, Bottlerocket, Flatcar)Install Operator with --set driver.enabled=false; keep Toolkit, Device Plugin, DCGMThe driver is bundled in the OS image; the Operator driver container cannot load a competing module anyway
Managed cloud node pools with GPU drivers baked in (GKE Accelerator nodes, AKS GPU SKUs with driver extension)Set driver.enabled=false and toolkit.enabled=false; run Device Plugin + GFD + DCGM onlyCloud providers install drivers and the toolkit as a node extension; duplication causes conflicts
Very small cluster (<= 2 GPU nodes) where all config is staticManual install is feasible but you still need the Device Plugin and DCGM; use the Operator anyway for the validation chain and the upgrade path you will eventually needSmall cluster today becomes 10 nodes after the first project succeeds; retrofitting the Operator later is harder than starting with it
In practice: If you are running GPU nodes on VMware Cloud Foundation with vGPU mediated device assignment, the Operator driver component does not apply — the vGPU driver is managed inside the guest VM by NVIDIA AI Enterprise. The Operator sits above that layer and handles the Kubernetes-facing side: device plugin, DCGM metrics, and feature labeling. For the full VCF deployment model with vGPU drivers and the NVAIE stack, see the Private AI GPU Operator post, which covers that exact setup from the platform side. This post stays on the bare-metal / passthrough-GPU path.

What the Operator Does Not Handle

Setting expectations here matters. The GPU Operator is specifically a node-level software provisioner. It does not:

  • Handle GPUDirect Storage without an additional step — you need to enable driver.rdma and separately configure the GDS driver. The Operator supports it but it is off by default.
  • Manage NVSwitch / fabric manager on multi-GPU tray systems like the DGX H100 or GB200 NVL72 — fabricmanager must be installed on the host before the Operator is deployed. Without it, multi-GPU workloads that use NVLink across GPUs will fail at runtime.
  • Handle network fabric provisioning — that is the job of the NVIDIA Network Operator, covered in Part 13 of this series. The two operators are independent and can coexist.
  • Schedule workloads — it only makes GPUs visible to the scheduler. Scheduling policies, resource quotas, and gang scheduling are your cluster admin responsibility; the GPU Operator does not include Volcano or Yunikorn.
  • Manage driver licensing on vGPU nodes — NVAIE license tokens are managed by the vGPU driver and the NVIDIA License System, not by the GPU Operator.

What to Validate Before Calling the Install Done

After all pods reach Running state, run these three checks before declaring the node production-ready:

# 1. GPU count visible to the scheduler
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'
# Expected: 8 (or however many physical GPUs the node has)

# 2. Run a CUDA workload to confirm end-to-end path
kubectl run cuda-test --image=nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04 \
  --restart=Never \
  --limits='nvidia.com/gpu=1' \
  -- nvidia-smi
kubectl logs cuda-test
# Expected: nvidia-smi table showing driver version ~570.x and GPU name

# 3. Confirm DCGM metrics endpoint is reachable
kubectl exec -n gpu-operator \
  $(kubectl get pod -n gpu-operator -l app=nvidia-dcgm-exporter -o name | head -1) \
  -- curl -s http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL | head -5
# Expected: metric lines with instance labels matching your GPU node

My Take

The GPU Operator is the correct default for any self-managed Kubernetes cluster with NVIDIA GPUs. The alternative — maintaining a custom Ansible playbook or Terraform provisioner that installs the driver at the right branch, configures containerd, and deploys the device plugin DaemonSet — creates a hidden dependency between your infrastructure code and the GPU driver version matrix. The Operator makes that dependency explicit and managed. The only argument against using it is on platforms where the OS is immutable and the driver is pre-baked, and in that case you disable the driver component and use everything else. That is a configuration decision, not a reason to avoid the Operator.

What to validate first before any production rollout: confirm NFD detects your GPU (check for the feature.node.kubernetes.io/pci-10de.present=true label), confirm the nouveau module is blacklisted on the host, and confirm fabricmanager is running if you are using multi-GPU trays with NVSwitch. Those three checks eliminate 80% of first-install failures before you even run helm install.

The Verdict

Use the GPU Operator on every self-managed Kubernetes cluster with NVIDIA GPUs. It turns a five-step manual process with a node-specific failure surface into a single declarative object. The ClusterPolicy CRD gives you a clear audit trail, a single place to change configuration, and the validation chain gives you a structured way to debug when something goes wrong. These are not minor conveniences at scale: they are the difference between a GPU pool you can operate and one that accumulates configuration drift.

When NOT to use the driver component: immutable OS, pre-baked cloud node pools, DGX Base OS. In those cases, disable driver.enabled and keep the rest.

What to validate first: nouveau blacklisted, NFD label present, fabricmanager running on NVSwitch nodes. Get those three right and the rest of the install is deterministic.

If you are building on VMware Cloud Foundation with NVIDIA AI Enterprise and vGPU, the GPU Operator works differently — the driver layer is handled in the guest VM. Read the Private AI series GPU Operator post for that deployment model.

Questions about how the accelerated data-center network integrates with this stack? That is covered next. The Network Operator that manages InfiniBand and RoCE fabric on Kubernetes is the subject of Part 13.

Disclaimer: The Helm commands and ClusterPolicy fields in this post are drawn from the NVIDIA GPU Operator v26.3.1 documentation as of June 2026. Version numbers, default values, and CLI flags change across releases. Always verify against the official getting-started page for the version you are deploying. Running helm upgrade on a production cluster drains GPU nodes and interrupts workloads; schedule it during a maintenance window and test in a non-production environment first.
NVIDIA AI Series · Part 12 of 30
« Previous: Part 11  |  NVIDIA AI Guide  |  Next: Part 13 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading