Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NVIDIA Drivers, CUDA, and the Container Toolkit: Building a Clean GPU Host Baseline (NVIDIA AI Series, Part 11)

The GPU host stack has three distinct layers: the data-center driver (open kernel module now required for Hopper and Blackwell), the CUDA Toolkit, and the NVIDIA Container Toolkit. Get the install order or versions wrong and containers fail silently. Here is the right sequence, the compatibility matrix, and the failure modes.

NVIDIA AI Series · Part 11 of 30
TL;DR: Three components form the GPU host stack: the data-center GPU driver (open-source kernel modules now default and required for Blackwell), the CUDA Toolkit (a separate install from the driver-embedded CUDA runtime), and the NVIDIA Container Toolkit (nvidia-ctk). Get the install order wrong or mismatch versions and your containers either fail to start or silently use the wrong CUDA runtime. This post shows the right order, the compatibility matrix, the exact verification commands, and the failure mode that catches every team at least once.
Who this is for: Platform engineers and AI infrastructure architects installing a bare-metal or VM GPU host for the first time, or standardizing a repeatable baseline across a fleet. You should be comfortable with Linux package management. Container-runtime familiarity (Docker or containerd) helps. This is the foundation that Part 12 (GPU Operator on Kubernetes) and Part 15 (air-gapped lifecycle) build on directly.

The first GPU host I inherited had three different driver versions installed from three different one-off installs, a CUDA Toolkit that predated the driver by eighteen months, and no Container Toolkit at all. Containers would start, but nvidia-smi inside the container reported the wrong driver version and half the cuBLAS calls segfaulted at runtime. Nobody had noticed because the workload produced numbers that looked plausible. The host stack is unglamorous plumbing, but it is where silent misconfigurations live. Get it right once, automate it, and everything above it works. Get it wrong and you spend days chasing phantom bugs in TensorRT or NIM that are actually a version mismatch three layers below.

The Three Layers and Why They Are Separate

Most teams install the driver, see CUDA listed in nvidia-smi, and assume they are done. That CUDA version in nvidia-smi is the maximum CUDA version the installed driver can support, not an installed toolkit. Confusion between those two is the single most common source of host-stack bugs I see in the field.

Figure 1: The Three-Layer Host Stack
Driver kernel module, CUDA Toolkit, and Container Toolkit are independent packages
GPU Driver (Open Kernel Module) nvidia.ko + firmware | sets max CUDA version | r595 series (April 2026) CUDA Toolkit (nvcc, libraries, headers) Separate package: CUDA 13.3 / 12.9.x | installs /usr/local/cuda | NOT the driver NVIDIA Container Toolkit v1.19.0 (nvidia-ctk) Patches container runtime config | injects driver libs into containers at run-time requires requires
The Container Toolkit does not bundle a CUDA runtime; it mounts the host driver libraries into the container. The CUDA Toolkit on the host is for host-side builds and tools only.

Layer 1: The GPU Driver

The GPU driver is a kernel module. Since driver r560, NVIDIA made the open-source kernel module the default recommended installation for data-center GPUs. For Hopper (H100/H200) and Blackwell (B200/B300, GB200/GB300 NVL72), the open-source modules are not just recommended, they are required: the proprietary kernel module is unsupported on these architectures. The current data-center driver series is r595 (April 2026 release). The open modules are dual-licensed MIT/GPLv2 and published at github.com/NVIDIA/open-gpu-kernel-modules with every driver release.

There are two packaging flavors to know: nvidia-driver-<version> (meta-package, pulls in kernel modules + userspace) and nvidia-open (open kernel module variant). On Ubuntu 22.04/24.04 with a data-center GPU you want the -open variant of the current production branch from the CUDA repository, not the distro-packaged driver which lags by months and often ships the proprietary module.

Layer 2: The CUDA Toolkit

The CUDA Toolkit (currently 13.3 as of June 2026, with 12.9.x as a widely deployed LTS-equivalent) gives you nvcc, the math libraries (cuBLAS, cuFFT, cuDNN), profiling tools (Nsight), and headers for host-side development. It is not required to run a containerized AI workload. If your team only runs NGC containers, you can skip the host-side Toolkit entirely and just install the driver plus the Container Toolkit. Install the Toolkit when your workflow includes host-side model compilation, custom CUDA kernels, or TensorRT-LLM engine builds outside of containers.

The critical compatibility rule: the installed driver must support a CUDA version greater than or equal to the CUDA Toolkit version. A driver at r595 supports up to CUDA 13.x [VERIFY exact ceiling]. You cannot run a CUDA 13.3 Toolkit on a driver that caps at CUDA 12.x. The nvidia-smi output field labeled &quot;CUDA Version&quot; is the driver-supported ceiling, not the installed toolkit version. Running nvcc --version shows the installed toolkit. Both must be checked; most teams only check one.

Layer 3: The NVIDIA Container Toolkit

The NVIDIA Container Toolkit (current: v1.19.0) is what makes docker run --gpus all work. It consists of nvidia-ctk (the CLI) and libnvidia-container (the runtime hook library). When a container starts, the toolkit injects the host driver’s userspace libraries (libcuda.so, libnvidia-ml.so, and friends) from the host filesystem into the container mount namespace. The container does not need its own driver or kernel module. It uses the host driver libs that were injected at startup. This is why the driver version and the CUDA version inside the container must be compatible with the host driver, not with whatever the container’s base image expects.

Version Compatibility: The Matrix That Bites You

NVIDIA CUDA has two compatibility modes that determine whether a container image will work on a given host driver. Understanding both is non-negotiable for production fleet management.

Component Current Version (June 2026) Compatibility Rule Breaks When
GPU Driver (open KM) r595 series Sets max CUDA runtime version the host supports Container built against CUDA newer than host driver ceiling
CUDA Toolkit (host) 13.3 / 12.9.x Must be <= driver-supported CUDA version Toolkit newer than driver; or toolkit version mismatch with build system
NVIDIA Container Toolkit v1.19.0 Independent of CUDA version; depends on driver userspace libs present nvidia-ctk not run post-driver-update; daemon not restarted
Container image CUDA runtime Varies per NGC image CUDA minor-version compat: same major, newer driver OK; forward compat for older hosts Container expects CUDA 13.x; host driver only supports CUDA 12.x
Figure 2: CUDA Compatibility Decision Tree
Determines whether a container image will run on your host driver without rebuilding
Container CUDA <= host driver CUDA ceiling? Yes Same major version? (e.g. both CUDA 12.x) No BREAKS Upgrade host driver Yes Minor-version compat Works, some new APIs unavailable No Forward Compat Package required in image Container starts GPU accessible; verify with nvidia-smi
Forward compatibility requires the cuda-compat package inside the container image; most NGC containers ship it. Minor-version compatibility (same major CUDA, newer minor in container than host) works but some newer API calls will fail silently at runtime.

Install Order and What to Run

The install order matters because the NVIDIA Container Toolkit post-configuration step rewrites your container runtime config to point at the NVIDIA runtime. If the runtime config is patched before the driver is fully installed and the daemon restarted, the hooks do not fire correctly. The rule: driver first, reboot, Container Toolkit second, configure, restart daemon.

Figure 3: Install Sequence Flow
Ordered steps from bare metal to GPU-ready container host
1. Add Repo cuda-keyring + apt-key 2. Install Driver nvidia-open -595 [VERIFY] 3. Reboot Load new KM verify nvidia-smi 4. Install CTK nvidia-container- toolkit v1.19.0 5. Configure nvidia-ctk runtime configure –runtime =docker + restart Driver before Toolkit before Container Runtime Config Never configure the container runtime before the driver is loaded and verified
Step 3 (reboot) is not optional. A driver update that skips a reboot leaves the old kernel module loaded. The nvidia-smi output may still work, but the container hooks will fire against the new userspace libraries while the old kernel module is running, producing subtle failures.

Gotcha

The NVIDIA package repository ships two driver meta-packages: nvidia-driver-<version> (proprietary kernel module) and nvidia-open-<version> (open kernel module). On a new Hopper or Blackwell system, if you install the non-open variant, the driver appears to load, nvidia-smi shows the GPU, but at runtime you will hit NVML errors or container launch failures because the proprietary module is not supported on those GPU architectures. The fix is to remove the proprietary package and install the open variant. Confirm with cat /proc/driver/nvidia/version and look for “Open” in the kernel module version string.

The Operational Artifact: Verify, Configure, Run

This is the exact sequence I run on every new GPU host before declaring it ready. All commands are Ubuntu 22.04 / 24.04. The expected output is shown inline. A mismatch between any of these three readings is the failure mode.


# ---- STEP 1: Verify the host driver is loaded ----
nvidia-smi

# Expected output (r595, Hopper H100 example):
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 595.xx.xx   Driver Version: 595.xx.xx   CUDA Version: 13.x [VERIFY]        |
# |-------------------------------+----------------------+----------------------+           |
# | GPU  Name        Persistence-M | Bus-Id Disp.A |  Volatile Uncorr. ECC |               |
# |   0  NVIDIA H100 80GB HBM3  Off | 00000000:00:05.0 Off | 0 |                          |
# +-----------------------------------------------------------------------------------------+
#
# NOTE: “CUDA Version: 13.x” here is the DRIVER-SUPPORTED ceiling, NOT an installed toolkit.

# ---- STEP 2: Verify the installed CUDA Toolkit (if you installed it) ----
nvcc --version

# Expected:
# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2026 NVIDIA Corporation
# Built on ...
# Cuda compilation tools, release 12.9, V12.9.x  (or 13.3 if you installed latest)
#
# If nvcc is not found: the Toolkit is NOT installed. That is fine if you only run containers.
# DO NOT confuse the nvcc version with the driver CUDA ceiling shown by nvidia-smi.

# ---- STEP 3: Configure the NVIDIA Container Toolkit for Docker ----
sudo nvidia-ctk runtime configure --runtime=docker

# Expected output:
# INFO[0000] Loading config from /etc/docker/daemon.json
# INFO[0000] Wrote updated config to /etc/docker/daemon.json
# INFO[0000] It is recommended that the Docker daemon be restarted.

# Restart Docker:
sudo systemctl restart docker

# ---- STEP 4: Smoke test - run nvidia-smi inside a container ----
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

# Expected output:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 595.xx.xx   Driver Version: 595.xx.xx   CUDA Version: 12.6                |
# |-------------------------------+----------------------+----------------------+           |
# | GPU  Name        Persistence-M | Bus-Id Disp.A |  Volatile Uncorr. ECC |               |
# |   0  NVIDIA H100 80GB HBM3  Off | 00000000:00:05.0 Off | 0 |                          |
# +-----------------------------------------------------------------------------------------+
#
# Classic failure mode: docker: Error response from daemon: could not select device driver
# with capabilities: [[gpu]].
# This means either: (a) nvidia-ctk runtime configure was not run, or
# (b) Docker daemon was not restarted after running it.
# Check /etc/docker/daemon.json -- it must contain:
# {
#   "runtimes": {
#     "nvidia": {
#       "path": "nvidia-container-runtime",
#       "runtimeArgs": []
#     }
#   }
# }

# ---- STEP 5: Verify the kernel module flavor (open vs proprietary) ----
cat /proc/driver/nvidia/version

# Expected for open kernel module:
# NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  595.xx.xx  ...
# GCC version: ...
# The word Open in the first line confirms you have the open-source kernel module.

Worked example

A team ran a TensorRT-LLM build container on a host with driver r550 (CUDA ceiling: 12.4). The build container was based on CUDA 12.6. nvidia-smi inside the container reported the host driver correctly. The build started, but linking failed with undefined symbol: __cudaRegisterFatBinaryEnd because the container tried to call a symbol only available in CUDA 12.6+ while the host injected the r550 userspace libraries. The fix: update the host driver to r560+ (CUDA ceiling 12.6+). The container image needed no changes. The lesson: always check the host driver CUDA ceiling against every container image you plan to run, not just the latest one.

Open vs Proprietary Kernel Modules: The Real Difference

The naming causes confusion. The open vs proprietary distinction applies only to the kernel module (the .ko file that loads into the Linux kernel). The userspace drivers, CUDA runtime libraries, and everything you interact with in containers are the same regardless of which kernel module you install. This matters for three reasons.

Attribute Open Kernel Module Proprietary Kernel Module
License MIT / GPLv2 NVIDIA proprietary
Required for Hopper / Blackwell Yes (mandatory) No (unsupported)
Works on Turing and Ampere Yes (default since r560) Yes (still packaged)
Source availability github.com/NVIDIA/open-gpu-kernel-modules Binary only
DKMS / UEFI Secure Boot Easier to sign; distro MOK flow cleaner Possible but more friction
Container / CUDA behavior Identical to proprietary Identical to open

The practical takeaway: if you have any Hopper, Blackwell, or Grace Hopper systems in your fleet, standardize on the open kernel module across all GPU generations. Running two different kernel module types in a heterogeneous fleet makes your automation more complex for zero benefit.

The Container Toolkit Architecture: What Actually Happens

When you run docker run --gpus all, the flow is: Docker delegates to nvidia-container-runtime, which invokes the OCI runtime hook. The hook calls libnvidia-container, which reads the list of GPU devices assigned to the container from the OCI spec, then bind-mounts the device nodes and the required driver userspace libraries from the host filesystem into the container. The container sees GPUs because its filesystem has been temporarily augmented with the host driver stack.

Figure 4: Container Toolkit Runtime Flow
How driver libraries are injected into the container at startup
Docker Daemon –gpus all flag routes to nvidia runtime nvidia-container- runtime (OCI hook) reads OCI spec + env vars libnvidia-container bind-mounts /dev/nvidia* injects libcuda.so etc. Container Namespace sees GPU devices + host driver libs no driver in the image required Host: /usr/lib/ nvidia/*.so docker run –rm –gpus all <image> nvidia-smi
The container image does not need to contain driver libraries. The Container Toolkit mounts them from the host at startup. This means updating the host driver automatically propagates to all containers on that host, with no container rebuild needed.

CDI: The New Way to Enumerate Devices

Container Device Interface (CDI) is the newer, standardized alternative to the OCI hook approach. With CDI, device specs are generated once by nvidia-ctk cdi generate and written to /etc/cdi/. The container runtime reads the spec at startup instead of invoking a hook at runtime. CDI is now the recommended approach for Kubernetes environments (the GPU Operator uses it by default), and it is supported in Docker 25+. On a bare-metal host running Docker, the classic OCI hook approach still works fine. On any Kubernetes node managed by the GPU Operator, CDI is handled for you automatically.

Failure Modes Worth Knowing Before You Hit Them

These are the five failures I see most often, in rough order of frequency.

1. Container runtime not configured after CTK install. docker run --gpus all fails with could not select device driver. The fix is sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker. This is the most common failure and entirely avoidable.

2. Driver updated without restarting the daemon. The old userspace libraries are cached; the new kernel module is not loaded. nvidia-smi may still work from the host if the old module is still resident, but container GPU calls fail or return wrong values. Always reboot after a driver update. Blackwell systems will refuse to load the old proprietary module entirely, so this is a hard failure, not a silent one.

3. Container CUDA newer than host driver ceiling. The container starts but any CUDA API call fails. cudaGetDeviceCount() returns 0 or returns an error. This is the version mismatch that the compatibility table above is designed to prevent. The forward compatibility package inside the container can bridge a one-major-version gap, but only if it is present in the image and the driver supports it.

4. Distro-packaged driver instead of CUDA repo driver. Ubuntu ships nvidia-driver-535 (or similar) in its main repos. This is consistently behind the CUDA repo driver by months and often installs the proprietary module. On an H100 or B200 you get a working nvidia-smi but NVML-based monitoring tools (including DCGM, covered in Part 29) report errors. Always install from the CUDA network repository.

5. Secure Boot conflict. On UEFI systems with Secure Boot enabled, unsigned kernel modules will not load. The open-source kernel module is easier to sign via DKMS and the Machine Owner Key (MOK) flow than the proprietary module, which is one practical reason to prefer it even on older GPU generations. If your team cannot disable Secure Boot for compliance reasons, plan the MOK enrollment into your automation from day one.

In practice: In a fleet of more than a handful of GPU hosts, manual driver installs become the biggest source of version drift. The right answer is to build a golden OS image (packer, cloud-init, or Ansible) that installs from the CUDA repo at a pinned version, runs the verification commands above, and fails the build if any check fails. Version pinning also prevents unattended apt upgrade from pulling in a new driver mid-week and breaking a production workload. On Kubernetes, the GPU Operator (Part 12) manages the driver lifecycle for you, which removes most of this manual work at the cost of some Kubernetes overhead.

When the GPU Operator Changes the Equation

If you are running Kubernetes, the NVIDIA GPU Operator (Part 12 of this series) installs and manages the driver as a DaemonSet container, deploys the Container Toolkit via another DaemonSet, and handles the CDI spec generation on every node. In that model, you do NOT install the driver manually on the OS. You install only a clean OS (no GPU driver) and let the Operator handle everything. The host baseline described in this part applies to any non-Kubernetes GPU host: bare metal workstations, VMs, HPC nodes, or Kubernetes nodes where the Operator is not used.

The NVIDIA AI Guide has the full series map. For air-gapped environments and lifecycle management of the host stack without internet access, that is Part 15.

Disclaimer

Installing or updating a GPU driver is a host-level change that requires a reboot and will interrupt all running GPU workloads on that node. In production environments, drain and cordon the node before starting, validate the new driver version in a staging environment first, and have a rollback path (previous driver package held in your repo). Never perform driver updates while a training job or inference service is running on the host.

My Take: The Verdict on a Clean Host Baseline

The right host baseline for a data-center GPU node in 2026 is this: open kernel module driver from the CUDA repo (current r595 series), CUDA Toolkit only if you need host-side builds, and NVIDIA Container Toolkit v1.19.0 configured for your container runtime. On Hopper and Blackwell, the open kernel module is not a preference, it is a requirement. On Turing and Ampere, standardize on it anyway so your fleet is consistent.

When NOT to follow this: if your organization mandates a specific OS image where the driver is baked in and versioned by a separate team (common in large enterprises or HPC centers), work with that team to verify the open module is included for Hopper/Blackwell and that the CUDA ceiling matches your container image requirements. Overriding a production OS image with a manual driver install is a support and lifecycle problem waiting to happen.

What to validate first: run the five commands in the artifact above on every new host before it joins any workload pool. Automate the checks in your provisioning pipeline. If /proc/driver/nvidia/version does not contain Open as a string on a Blackwell node, stop and fix it before the workload team touches the node.

The host stack is unglamorous. Nobody writes blog posts celebrating a clean nvidia-smi output. But a misconfigured host stack costs days of debugging time that should have been spent on the actual workload. Get it right, automate the verification, and move on to the interesting work.

References

NVIDIA AI Series · Part 11 of 30
« Previous: Part 10  |  NVIDIA AI Guide  |  Next: Part 12 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading