helm install replaces five manual installation steps that are error-prone, version-mismatched, and impossible to audit at scale. The current release is v26.3.1. If your nodes run immutable OS images or pre-baked DGX Base OS, skip the driver component but keep everything else. If you manage more than three GPU nodes in Kubernetes, the Operator is not optional.Before the GPU Operator existed, onboarding a new GPU node into a Kubernetes cluster meant SSHing in, installing the NVIDIA driver at the right branch version, installing the Container Toolkit, configuring the runtime shim in containerd or CRI-O, deploying the device plugin DaemonSet, wiring up DCGM for Prometheus, and repeating that procedure identically for every node in the pool. One wrong driver branch and your CUDA workloads fail silently at kernel launch. One missed Container Toolkit config and the scheduler sees the GPU but the container cannot open it. Managing that by hand at ten or more nodes is a maintenance liability, not an architecture.
The GPU Operator changes the model. You declare what you want in a ClusterPolicy and the Operator reconciles every node to that state. This post covers what is under the hood, what the Operator does not cover, and the operational failure modes that will catch you if you skip the verification steps.
What the GPU Operator Actually Installs
The Operator is itself a small controller pod running in the gpu-operator namespace. It watches the ClusterPolicy custom resource and deploys a set of DaemonSets on GPU nodes, each managing one software component. The table below lists every component in v26.3.1 with its function and whether it is on by default.
| Component | What It Does | On by Default |
|---|---|---|
| NVIDIA Driver Container | Builds and loads the kernel driver in a container, no host-level install required | Yes |
| NVIDIA Container Toolkit | Configures containerd / CRI-O to inject GPUs via CDI into workload containers | Yes |
| NVIDIA Device Plugin | Advertises nvidia.com/gpu (and MIG slice) resources to the Kubernetes scheduler | Yes |
| GPU Feature Discovery (GFD) | Labels nodes with GPU model, memory, driver version, CUDA compute capability | Yes |
| Node Feature Discovery (NFD) | Detects CPU, PCI, and kernel features; labels nodes for topology-aware scheduling | Yes (disable if NFD already running) |
| DCGM + DCGM Exporter | Collects GPU telemetry (utilization, memory, temperature, ECC, NVLink throughput) and exposes a Prometheus /metrics endpoint | Yes |
| MIG Manager | Watches the nvidia.com/mig.config node label and reconfigures MIG partitions on supported GPUs (A100, H100, H200) | Yes (active on MIG-capable nodes only) |
| Driver Manager (k8s-driver-manager) | Handles safe driver upgrades by draining pods and reloading the kernel module | Yes |
| Operator Validator | Runs a sequential validation chain (driver, toolkit, device plugin, CUDA) and marks nodes Ready | Yes |
| Node Status Exporter | Exports per-node operator state to Prometheus for observability of the Operator itself | Yes |
The ClusterPolicy CRD: One Object to Rule the Cluster
The ClusterPolicy is a cluster-scoped custom resource. There is exactly one per cluster. It is the single configuration surface for the Operator: you change a field in ClusterPolicy and the Operator picks it up and reconciles all affected DaemonSets without any node-level SSH access.
Key top-level spec sections:
spec.driver— enable/disable driver install, set driver branch, kernel module type (open vs proprietary vs auto), pull secret, RDMA modespec.toolkit— CDI mode, NRI plugin, containerd/CRI-O config pathsspec.devicePlugin— configMap reference for time-slicing or MIG profilesspec.dcgmExporter— enable/disable, metrics config, Prometheus service typespec.mig— MIG strategy (single vs mixed)spec.migManager— watches the MIG label and drains + reconfigures the nodespec.gfd— controls the node labels emitted (GPU model, memory, CUDA version)spec.nodeStatusExporter— Prometheus endpoint for operator health
Reading the ClusterPolicy After Install
After the Helm install completes, inspect what the Operator created:
# List the ClusterPolicy (there is exactly one)
kubectl get clusterpolicy -o wide
# Describe the full spec
kubectl describe clusterpolicy gpu-cluster-policy
# Check the node labels GFD applied
kubectl get node <gpu-node-name> --show-labels | tr ',' '\n' | grep -E 'nvidia|feature'
Expected output for the node labels command on a healthy H100 node:
nvidia.com/cuda.driver.major=570
nvidia.com/cuda.driver.minor=86
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=4
nvidia.com/gpu.count=8
nvidia.com/gpu.memory=81920
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-H100-SXM5-80GB
feature.node.kubernetes.io/pci-10de.present=true
Installing the Operator: the Real Command
The minimal install for a standard bare-metal or passthrough-GPU cluster running containerd:
Worked Example
Standard install on a bare-metal Kubernetes cluster with containerd, H100 or B200 nodes, no pre-existing drivers. Every field below is confirmed from the v26.3.1 getting-started page.
# Step 1: add the NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
# Step 2: create namespace and label it for privileged PSA
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator \
pod-security.kubernetes.io/enforce=privileged
# Step 3: install the Operator
# version=v26.3.1 is the current patch release as of June 2026
helm install --wait --generate-name \
-n gpu-operator \
nvidia/gpu-operator \
--version=v26.3.1
A values override file for a production cluster where you want open kernel modules, MIG in mixed mode, and RDMA enabled:
# values-prod.yaml
driver:
kernelModuleType: open # recommended for H100/H200/B200 on driver 570+
rdma:
enabled: true # enables nvidia-peermem for GPUDirect RDMA
mig:
strategy: mixed # allows different MIG profiles across nodes
migManager:
enabled: true
dcgmExporter:
enabled: true
service:
internalTrafficPolicy: Local # avoids cross-node metric scraping
helm install --wait --generate-name \
-n gpu-operator \
nvidia/gpu-operator \
--version=v26.3.1 \
-f values-prod.yaml
Check that all pods are running after install (should take 3-8 minutes on first run while the driver container compiles the kernel module):
kubectl get pods -n gpu-operator -o wide
# Expected healthy output (names are generated; count varies by node count)
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-xxxxx 1/1 Running 0 6m
gpu-operator-xxxxxxx-xxxx 1/1 Running 0 8m
nvidia-container-toolkit-daemonset-xxxxx 1/1 Running 0 6m
nvidia-cuda-validator-xxxxx 0/1 Completed 0 5m
nvidia-dcgm-exporter-xxxxx 1/1 Running 0 6m
nvidia-device-plugin-daemonset-xxxxx 1/1 Running 0 6m
nvidia-driver-daemonset-xxxxx 1/1 Running 0 7m
nvidia-mig-manager-xxxxx 1/1 Running 0 6m
nvidia-operator-validator-xxxxx 1/1 Running 0 5m
Init:0/4 for more than 15 minutes. Do not debug the validator itself. It reports that something upstream failed. Run kubectl describe pod -n gpu-operator -l app=nvidia-operator-validator and look at the init container statuses. The usual culprits are: (1) the nouveau open-source driver is still loaded in the host kernel — check with lsmod | grep nouveau and blacklist it before the driver container can load nvidia.ko; (2) the container runtime is not configured for the nvidia runtime class yet, causing "no runtime for nvidia is configured" in sandbox creation; (3) on NVSwitch-based systems (DGX H100, GB200 NVL72), the nvidia-fabricmanager must be running on the host for the validator to pass. None of these will self-resolve without action on your part.Day-2 Operations: Upgrades, MIG Changes, and Driver Rollouts
Driver Upgrades
Driver upgrades go through helm upgrade against the Operator Helm release, not through individual node changes. The Driver Manager drains the node, stops GPU-dependent pods, unloads the old kernel module, loads the new one, and reschedules pods. This is the correct flow because it ensures no GPU workloads are running when the driver is replaced.
# Upgrade the GPU Operator to a newer version
helm upgrade <release-name> nvidia/gpu-operator \
-n gpu-operator \
--version=<new-version> \
-f values-prod.yaml
Get the release name with helm list -n gpu-operator. The upgrade will temporarily drain GPU nodes. Schedule this during a maintenance window for production clusters or use a node-by-node rolling update policy.
MIG Reconfiguration Without SSH
Changing MIG geometry on a node is a one-label operation when the MIG Manager is running. The MIG Manager watches the nvidia.com/mig.config node label. When it changes, the MIG Manager stops GPU pods, applies the new partition configuration, and restarts them. No SSH, no manual MIG CLI commands.
# Change H100 node to 7x 1g.10gb MIG profile
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb --overwrite
# Verify the MIG configuration applied
kubectl get node <gpu-node> -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
Node Readiness Sequence: What Happens After a Node Joins
When a new GPU node joins the cluster, the Operator detects the feature.node.kubernetes.io/pci-10de.present=true label (set by NFD) and begins deploying DaemonSet pods in a defined order. The validator enforces this order by running as an init-container chain: each stage must complete before the next starts. Understanding this sequence is what lets you debug a stuck install without guessing.
When NOT to Use the GPU Operator (or Which Parts to Disable)
The Operator is not the right answer in every situation. Three cases where you should disable specific components or skip the Operator entirely:
| Scenario | What to Do | Why |
|---|---|---|
| Immutable OS / pre-baked images (NVIDIA DGX Base OS, Bottlerocket, Flatcar) | Install Operator with --set driver.enabled=false; keep Toolkit, Device Plugin, DCGM | The driver is bundled in the OS image; the Operator driver container cannot load a competing module anyway |
| Managed cloud node pools with GPU drivers baked in (GKE Accelerator nodes, AKS GPU SKUs with driver extension) | Set driver.enabled=false and toolkit.enabled=false; run Device Plugin + GFD + DCGM only | Cloud providers install drivers and the toolkit as a node extension; duplication causes conflicts |
| Very small cluster (<= 2 GPU nodes) where all config is static | Manual install is feasible but you still need the Device Plugin and DCGM; use the Operator anyway for the validation chain and the upgrade path you will eventually need | Small cluster today becomes 10 nodes after the first project succeeds; retrofitting the Operator later is harder than starting with it |
What the Operator Does Not Handle
Setting expectations here matters. The GPU Operator is specifically a node-level software provisioner. It does not:
- Handle GPUDirect Storage without an additional step — you need to enable
driver.rdmaand separately configure the GDS driver. The Operator supports it but it is off by default. - Manage NVSwitch / fabric manager on multi-GPU tray systems like the DGX H100 or GB200 NVL72 — fabricmanager must be installed on the host before the Operator is deployed. Without it, multi-GPU workloads that use NVLink across GPUs will fail at runtime.
- Handle network fabric provisioning — that is the job of the NVIDIA Network Operator, covered in Part 13 of this series. The two operators are independent and can coexist.
- Schedule workloads — it only makes GPUs visible to the scheduler. Scheduling policies, resource quotas, and gang scheduling are your cluster admin responsibility; the GPU Operator does not include Volcano or Yunikorn.
- Manage driver licensing on vGPU nodes — NVAIE license tokens are managed by the vGPU driver and the NVIDIA License System, not by the GPU Operator.
What to Validate Before Calling the Install Done
After all pods reach Running state, run these three checks before declaring the node production-ready:
# 1. GPU count visible to the scheduler
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'
# Expected: 8 (or however many physical GPUs the node has)
# 2. Run a CUDA workload to confirm end-to-end path
kubectl run cuda-test --image=nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04 \
--restart=Never \
--limits='nvidia.com/gpu=1' \
-- nvidia-smi
kubectl logs cuda-test
# Expected: nvidia-smi table showing driver version ~570.x and GPU name
# 3. Confirm DCGM metrics endpoint is reachable
kubectl exec -n gpu-operator \
$(kubectl get pod -n gpu-operator -l app=nvidia-dcgm-exporter -o name | head -1) \
-- curl -s http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL | head -5
# Expected: metric lines with instance labels matching your GPU node
My Take
The GPU Operator is the correct default for any self-managed Kubernetes cluster with NVIDIA GPUs. The alternative — maintaining a custom Ansible playbook or Terraform provisioner that installs the driver at the right branch, configures containerd, and deploys the device plugin DaemonSet — creates a hidden dependency between your infrastructure code and the GPU driver version matrix. The Operator makes that dependency explicit and managed. The only argument against using it is on platforms where the OS is immutable and the driver is pre-baked, and in that case you disable the driver component and use everything else. That is a configuration decision, not a reason to avoid the Operator.
What to validate first before any production rollout: confirm NFD detects your GPU (check for the feature.node.kubernetes.io/pci-10de.present=true label), confirm the nouveau module is blacklisted on the host, and confirm fabricmanager is running if you are using multi-GPU trays with NVSwitch. Those three checks eliminate 80% of first-install failures before you even run helm install.
The Verdict
Use the GPU Operator on every self-managed Kubernetes cluster with NVIDIA GPUs. It turns a five-step manual process with a node-specific failure surface into a single declarative object. The ClusterPolicy CRD gives you a clear audit trail, a single place to change configuration, and the validation chain gives you a structured way to debug when something goes wrong. These are not minor conveniences at scale: they are the difference between a GPU pool you can operate and one that accumulates configuration drift.
When NOT to use the driver component: immutable OS, pre-baked cloud node pools, DGX Base OS. In those cases, disable driver.enabled and keep the rest.
What to validate first: nouveau blacklisted, NFD label present, fabricmanager running on NVSwitch nodes. Get those three right and the rest of the install is deterministic.
If you are building on VMware Cloud Foundation with NVIDIA AI Enterprise and vGPU, the GPU Operator works differently — the driver layer is handled in the guest VM. Read the Private AI series GPU Operator post for that deployment model.
Questions about how the accelerated data-center network integrates with this stack? That is covered next. The Network Operator that manages InfiniBand and RoCE fabric on Kubernetes is the subject of Part 13.
helm upgrade on a production cluster drains GPU nodes and interrupts workloads; schedule it during a maintenance window and test in a non-production environment first.References
- NVIDIA GPU Operator — About the Operator (docs.nvidia.com, June 2026)
- Installing the NVIDIA GPU Operator v26.3.1 (docs.nvidia.com, May 2026)
- GPU Operator with MIG — NVIDIA GPU Operator Documentation
- Troubleshooting the NVIDIA GPU Operator (docs.nvidia.com)
- DCGM Exporter — NVIDIA DCGM Documentation



