The first GPU host I inherited had three different driver versions installed from three different one-off installs, a CUDA Toolkit that predated the driver by eighteen months, and no Container Toolkit at all. Containers would start, but nvidia-smi inside the container reported the wrong driver version and half the cuBLAS calls segfaulted at runtime. Nobody had noticed because the workload produced numbers that looked plausible. The host stack is unglamorous plumbing, but it is where silent misconfigurations live. Get it right once, automate it, and everything above it works. Get it wrong and you spend days chasing phantom bugs in TensorRT or NIM that are actually a version mismatch three layers below.
The Three Layers and Why They Are Separate
Most teams install the driver, see CUDA listed in nvidia-smi, and assume they are done. That CUDA version in nvidia-smi is the maximum CUDA version the installed driver can support, not an installed toolkit. Confusion between those two is the single most common source of host-stack bugs I see in the field.
Layer 1: The GPU Driver
The GPU driver is a kernel module. Since driver r560, NVIDIA made the open-source kernel module the default recommended installation for data-center GPUs. For Hopper (H100/H200) and Blackwell (B200/B300, GB200/GB300 NVL72), the open-source modules are not just recommended, they are required: the proprietary kernel module is unsupported on these architectures. The current data-center driver series is r595 (April 2026 release). The open modules are dual-licensed MIT/GPLv2 and published at github.com/NVIDIA/open-gpu-kernel-modules with every driver release.
There are two packaging flavors to know: nvidia-driver-<version> (meta-package, pulls in kernel modules + userspace) and nvidia-open (open kernel module variant). On Ubuntu 22.04/24.04 with a data-center GPU you want the -open variant of the current production branch from the CUDA repository, not the distro-packaged driver which lags by months and often ships the proprietary module.
Layer 2: The CUDA Toolkit
The CUDA Toolkit (currently 13.3 as of June 2026, with 12.9.x as a widely deployed LTS-equivalent) gives you nvcc, the math libraries (cuBLAS, cuFFT, cuDNN), profiling tools (Nsight), and headers for host-side development. It is not required to run a containerized AI workload. If your team only runs NGC containers, you can skip the host-side Toolkit entirely and just install the driver plus the Container Toolkit. Install the Toolkit when your workflow includes host-side model compilation, custom CUDA kernels, or TensorRT-LLM engine builds outside of containers.
The critical compatibility rule: the installed driver must support a CUDA version greater than or equal to the CUDA Toolkit version. A driver at r595 supports up to CUDA 13.x [VERIFY exact ceiling]. You cannot run a CUDA 13.3 Toolkit on a driver that caps at CUDA 12.x. The nvidia-smi output field labeled "CUDA Version" is the driver-supported ceiling, not the installed toolkit version. Running nvcc --version shows the installed toolkit. Both must be checked; most teams only check one.
Layer 3: The NVIDIA Container Toolkit
The NVIDIA Container Toolkit (current: v1.19.0) is what makes docker run --gpus all work. It consists of nvidia-ctk (the CLI) and libnvidia-container (the runtime hook library). When a container starts, the toolkit injects the host driver’s userspace libraries (libcuda.so, libnvidia-ml.so, and friends) from the host filesystem into the container mount namespace. The container does not need its own driver or kernel module. It uses the host driver libs that were injected at startup. This is why the driver version and the CUDA version inside the container must be compatible with the host driver, not with whatever the container’s base image expects.
Version Compatibility: The Matrix That Bites You
NVIDIA CUDA has two compatibility modes that determine whether a container image will work on a given host driver. Understanding both is non-negotiable for production fleet management.
| Component | Current Version (June 2026) | Compatibility Rule | Breaks When |
|---|---|---|---|
| GPU Driver (open KM) | r595 series | Sets max CUDA runtime version the host supports | Container built against CUDA newer than host driver ceiling |
| CUDA Toolkit (host) | 13.3 / 12.9.x | Must be <= driver-supported CUDA version | Toolkit newer than driver; or toolkit version mismatch with build system |
| NVIDIA Container Toolkit | v1.19.0 | Independent of CUDA version; depends on driver userspace libs present | nvidia-ctk not run post-driver-update; daemon not restarted |
| Container image CUDA runtime | Varies per NGC image | CUDA minor-version compat: same major, newer driver OK; forward compat for older hosts | Container expects CUDA 13.x; host driver only supports CUDA 12.x |
Install Order and What to Run
The install order matters because the NVIDIA Container Toolkit post-configuration step rewrites your container runtime config to point at the NVIDIA runtime. If the runtime config is patched before the driver is fully installed and the daemon restarted, the hooks do not fire correctly. The rule: driver first, reboot, Container Toolkit second, configure, restart daemon.
Gotcha
The NVIDIA package repository ships two driver meta-packages: nvidia-driver-<version> (proprietary kernel module) and nvidia-open-<version> (open kernel module). On a new Hopper or Blackwell system, if you install the non-open variant, the driver appears to load, nvidia-smi shows the GPU, but at runtime you will hit NVML errors or container launch failures because the proprietary module is not supported on those GPU architectures. The fix is to remove the proprietary package and install the open variant. Confirm with cat /proc/driver/nvidia/version and look for “Open” in the kernel module version string.
The Operational Artifact: Verify, Configure, Run
This is the exact sequence I run on every new GPU host before declaring it ready. All commands are Ubuntu 22.04 / 24.04. The expected output is shown inline. A mismatch between any of these three readings is the failure mode.
# ---- STEP 1: Verify the host driver is loaded ----
nvidia-smi
# Expected output (r595, Hopper H100 example):
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 595.xx.xx Driver Version: 595.xx.xx CUDA Version: 13.x [VERIFY] |
# |-------------------------------+----------------------+----------------------+ |
# | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | |
# | 0 NVIDIA H100 80GB HBM3 Off | 00000000:00:05.0 Off | 0 | |
# +-----------------------------------------------------------------------------------------+
#
# NOTE: “CUDA Version: 13.x” here is the DRIVER-SUPPORTED ceiling, NOT an installed toolkit.
# ---- STEP 2: Verify the installed CUDA Toolkit (if you installed it) ----
nvcc --version
# Expected:
# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2026 NVIDIA Corporation
# Built on ...
# Cuda compilation tools, release 12.9, V12.9.x (or 13.3 if you installed latest)
#
# If nvcc is not found: the Toolkit is NOT installed. That is fine if you only run containers.
# DO NOT confuse the nvcc version with the driver CUDA ceiling shown by nvidia-smi.
# ---- STEP 3: Configure the NVIDIA Container Toolkit for Docker ----
sudo nvidia-ctk runtime configure --runtime=docker
# Expected output:
# INFO[0000] Loading config from /etc/docker/daemon.json
# INFO[0000] Wrote updated config to /etc/docker/daemon.json
# INFO[0000] It is recommended that the Docker daemon be restarted.
# Restart Docker:
sudo systemctl restart docker
# ---- STEP 4: Smoke test - run nvidia-smi inside a container ----
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
# Expected output:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 595.xx.xx Driver Version: 595.xx.xx CUDA Version: 12.6 |
# |-------------------------------+----------------------+----------------------+ |
# | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | |
# | 0 NVIDIA H100 80GB HBM3 Off | 00000000:00:05.0 Off | 0 | |
# +-----------------------------------------------------------------------------------------+
#
# Classic failure mode: docker: Error response from daemon: could not select device driver
# with capabilities: [[gpu]].
# This means either: (a) nvidia-ctk runtime configure was not run, or
# (b) Docker daemon was not restarted after running it.
# Check /etc/docker/daemon.json -- it must contain:
# {
# "runtimes": {
# "nvidia": {
# "path": "nvidia-container-runtime",
# "runtimeArgs": []
# }
# }
# }
# ---- STEP 5: Verify the kernel module flavor (open vs proprietary) ----
cat /proc/driver/nvidia/version
# Expected for open kernel module:
# NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 595.xx.xx ...
# GCC version: ...
# The word Open in the first line confirms you have the open-source kernel module.
Worked example
A team ran a TensorRT-LLM build container on a host with driver r550 (CUDA ceiling: 12.4). The build container was based on CUDA 12.6. nvidia-smi inside the container reported the host driver correctly. The build started, but linking failed with undefined symbol: __cudaRegisterFatBinaryEnd because the container tried to call a symbol only available in CUDA 12.6+ while the host injected the r550 userspace libraries. The fix: update the host driver to r560+ (CUDA ceiling 12.6+). The container image needed no changes. The lesson: always check the host driver CUDA ceiling against every container image you plan to run, not just the latest one.
Open vs Proprietary Kernel Modules: The Real Difference
The naming causes confusion. The open vs proprietary distinction applies only to the kernel module (the .ko file that loads into the Linux kernel). The userspace drivers, CUDA runtime libraries, and everything you interact with in containers are the same regardless of which kernel module you install. This matters for three reasons.
| Attribute | Open Kernel Module | Proprietary Kernel Module |
|---|---|---|
| License | MIT / GPLv2 | NVIDIA proprietary |
| Required for Hopper / Blackwell | Yes (mandatory) | No (unsupported) |
| Works on Turing and Ampere | Yes (default since r560) | Yes (still packaged) |
| Source availability | github.com/NVIDIA/open-gpu-kernel-modules | Binary only |
| DKMS / UEFI Secure Boot | Easier to sign; distro MOK flow cleaner | Possible but more friction |
| Container / CUDA behavior | Identical to proprietary | Identical to open |
The practical takeaway: if you have any Hopper, Blackwell, or Grace Hopper systems in your fleet, standardize on the open kernel module across all GPU generations. Running two different kernel module types in a heterogeneous fleet makes your automation more complex for zero benefit.
The Container Toolkit Architecture: What Actually Happens
When you run docker run --gpus all, the flow is: Docker delegates to nvidia-container-runtime, which invokes the OCI runtime hook. The hook calls libnvidia-container, which reads the list of GPU devices assigned to the container from the OCI spec, then bind-mounts the device nodes and the required driver userspace libraries from the host filesystem into the container. The container sees GPUs because its filesystem has been temporarily augmented with the host driver stack.
CDI: The New Way to Enumerate Devices
Container Device Interface (CDI) is the newer, standardized alternative to the OCI hook approach. With CDI, device specs are generated once by nvidia-ctk cdi generate and written to /etc/cdi/. The container runtime reads the spec at startup instead of invoking a hook at runtime. CDI is now the recommended approach for Kubernetes environments (the GPU Operator uses it by default), and it is supported in Docker 25+. On a bare-metal host running Docker, the classic OCI hook approach still works fine. On any Kubernetes node managed by the GPU Operator, CDI is handled for you automatically.
Failure Modes Worth Knowing Before You Hit Them
These are the five failures I see most often, in rough order of frequency.
1. Container runtime not configured after CTK install. docker run --gpus all fails with could not select device driver. The fix is sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker. This is the most common failure and entirely avoidable.
2. Driver updated without restarting the daemon. The old userspace libraries are cached; the new kernel module is not loaded. nvidia-smi may still work from the host if the old module is still resident, but container GPU calls fail or return wrong values. Always reboot after a driver update. Blackwell systems will refuse to load the old proprietary module entirely, so this is a hard failure, not a silent one.
3. Container CUDA newer than host driver ceiling. The container starts but any CUDA API call fails. cudaGetDeviceCount() returns 0 or returns an error. This is the version mismatch that the compatibility table above is designed to prevent. The forward compatibility package inside the container can bridge a one-major-version gap, but only if it is present in the image and the driver supports it.
4. Distro-packaged driver instead of CUDA repo driver. Ubuntu ships nvidia-driver-535 (or similar) in its main repos. This is consistently behind the CUDA repo driver by months and often installs the proprietary module. On an H100 or B200 you get a working nvidia-smi but NVML-based monitoring tools (including DCGM, covered in Part 29) report errors. Always install from the CUDA network repository.
5. Secure Boot conflict. On UEFI systems with Secure Boot enabled, unsigned kernel modules will not load. The open-source kernel module is easier to sign via DKMS and the Machine Owner Key (MOK) flow than the proprietary module, which is one practical reason to prefer it even on older GPU generations. If your team cannot disable Secure Boot for compliance reasons, plan the MOK enrollment into your automation from day one.
apt upgrade from pulling in a new driver mid-week and breaking a production workload. On Kubernetes, the GPU Operator (Part 12) manages the driver lifecycle for you, which removes most of this manual work at the cost of some Kubernetes overhead.When the GPU Operator Changes the Equation
If you are running Kubernetes, the NVIDIA GPU Operator (Part 12 of this series) installs and manages the driver as a DaemonSet container, deploys the Container Toolkit via another DaemonSet, and handles the CDI spec generation on every node. In that model, you do NOT install the driver manually on the OS. You install only a clean OS (no GPU driver) and let the Operator handle everything. The host baseline described in this part applies to any non-Kubernetes GPU host: bare metal workstations, VMs, HPC nodes, or Kubernetes nodes where the Operator is not used.
The NVIDIA AI Guide has the full series map. For air-gapped environments and lifecycle management of the host stack without internet access, that is Part 15.
Disclaimer
Installing or updating a GPU driver is a host-level change that requires a reboot and will interrupt all running GPU workloads on that node. In production environments, drain and cordon the node before starting, validate the new driver version in a staging environment first, and have a rollback path (previous driver package held in your repo). Never perform driver updates while a training job or inference service is running on the host.
My Take: The Verdict on a Clean Host Baseline
The right host baseline for a data-center GPU node in 2026 is this: open kernel module driver from the CUDA repo (current r595 series), CUDA Toolkit only if you need host-side builds, and NVIDIA Container Toolkit v1.19.0 configured for your container runtime. On Hopper and Blackwell, the open kernel module is not a preference, it is a requirement. On Turing and Ampere, standardize on it anyway so your fleet is consistent.
When NOT to follow this: if your organization mandates a specific OS image where the driver is baked in and versioned by a separate team (common in large enterprises or HPC centers), work with that team to verify the open module is included for Hopper/Blackwell and that the CUDA ceiling matches your container image requirements. Overriding a production OS image with a manual driver install is a support and lifecycle problem waiting to happen.
What to validate first: run the five commands in the artifact above on every new host before it joins any workload pool. Automate the checks in your provisioning pipeline. If /proc/driver/nvidia/version does not contain Open as a string on a Blackwell node, stop and fix it before the workload team touches the node.
The host stack is unglamorous. Nobody writes blog posts celebrating a clean nvidia-smi output. But a misconfigured host stack costs days of debugging time that should have been spent on the actual workload. Get it right, automate the verification, and move on to the interesting work.
References
- NVIDIA Driver Installation Guide: Kernel Modules (docs.nvidia.com)
- NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules (NVIDIA Technical Blog)
- Installing the NVIDIA Container Toolkit (docs.nvidia.com)
- NVIDIA Container Toolkit Release Notes v1.19.0 (docs.nvidia.com)
- CUDA Toolkit 13.3 Release Notes (docs.nvidia.com)
- CUDA Compatibility Guide (docs.nvidia.com)



