Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

GPUDirect Storage: The DMA Path from NVMe to GPU Memory (NVIDIA AI Series, Part 9)

GPUDirect Storage (GDS) creates a direct DMA path from NVMe or networked storage straight into GPU HBM, bypassing the CPU bounce buffer entirely. Here is when it helps, what the cuFile API requires, and the filesystem and NIC prerequisites to validate before enabling in production.

NVIDIA AI Series · Part 9 of 30
TL;DR:
  • GPUDirect Storage (GDS) gives NVMe and NVMe-oF drives a DMA path directly into GPU HBM, cutting the CPU bounce buffer entirely.
  • The cuFile API is how you call it: register a GPU buffer, open a file, read/write without touching system RAM.
  • GDS pays off on large sequential reads (model loading, checkpoint restore) and sustained data-loader-bound training — not on small random I/O.
  • You need a GDS-aware filesystem (XFS, EXT4 on local NVMe; Lustre, WekaFS, VAST, BeeGFS for network) and, for remote storage, a NIC that supports RDMA (ConnectX-6 or later).
  • Run gdscheck -p before assuming GDS is active. Without it, cuFile silently falls back to the POSIX bounce-buffer path.
Who this is for: Infrastructure architects and platform engineers sizing or troubleshooting NVMe storage for GPU training clusters. Assumes familiarity with PCIe topology, basic CUDA memory concepts, and the DGX/HGX platform introduced in Part 5. Also relevant if you are running multi-node training and wondering why your checkpoint saves are bottlenecked — see Part 8 on the fabric side first.

You have eight A100s loading a 70-billion-parameter checkpoint. The GPUs are idle. The NVMe arrays are barely warm. The bottleneck is the CPU — specifically, every byte of checkpoint data is landing in pinned system RAM before the GPU driver copies it into HBM. That extra hop is not free. At a 100-GPU scale cluster, it can make the difference between a three-minute checkpoint restore and a nine-minute one. GPUDirect Storage exists to remove that hop. Whether it actually does so depends on your filesystem, your PCIe topology, and whether you remembered to set one flag in cufile.json.

The Problem: Why the CPU Bounce Buffer Exists (and Costs)

POSIX I/O — pread, pwrite — was designed for CPU memory. The kernel reads storage blocks into page cache or pinned host buffers, and the application copies from there. When the destination is GPU HBM, you end up with two hops: storage to CPU memory, then CPU memory to GPU over PCIe. The second hop is a cudaMemcpy. At HBM3e bandwidth (3.35 TB/s on H100), the GPU can ingest data orders of magnitude faster than a conventional storage path can supply it. The CPU staging path caps you at system memory bandwidth — often 400-500 GB/s total, shared across all CPUs on the node — and burns CPU cycles doing it.

The concrete cost shows up in three places during AI workloads: model weight loading at inference startup, training data pre-fetching when the dataloader cannot hide I/O latency behind compute, and checkpoint save/restore on multi-node jobs where you are writing hundreds of gigabytes per GPU periodically. These are the exact workloads GDS was built for.

Data Path: Traditional vs GPUDirect Storage

How bytes travel from NVMe to GPU HBM in both paths

TRADITIONAL PATH (with CPU bounce) NVMe Local SSD pread() CPU + System RAM bounce buffer cudaMemcpy GPU HBM H100 / B200 Hop 1: storage to sysmem Hop 2: sysmem to HBM GPUDIRECT STORAGE PATH (DMA direct) NVMe / NVMe-oF or NFS over RDMA DMA via PCIe P2P nvidia-fs.ko kernel DMA coordinator PCIe bus GPU HBM H100 / B200 CPU skipped One hop: storage straight to GPU VRAM — CPU not involved Application uses cuFile API (cuFileRead / cuFileWrite) — same semantics as pread/pwrite but buffer is GPU memory nvidia-fs.ko kernel module coordinates DMA between storage controller and GPU via PCIe peer-to-peer
Figure 1: Traditional path (two hops, CPU involved) vs GPUDirect Storage (one DMA hop, CPU bypassed). The nvidia-fs.ko kernel driver is the coordinator; it does not buffer data itself.

The cuFile API: What Changes in Your Code

The cuFile API is deliberately close to POSIX. If you know pread and pwrite you already understand 80 percent of the surface. The key differences are registration and memory residence. Before a GPU buffer can participate in GDS I/O you must register it with cuFileBufRegister(). Before a file can participate you register its file descriptor with cuFileHandleRegister(). Reads and writes are then cuFileRead() and cuFileWrite() — same offset and size semantics as POSIX, but the buffer pointer is a GPU device pointer, not a CPU pointer.

What nvidia-fs.ko Actually Does

The kernel module nvidia-fs.ko registers with the NVMe driver and exposes a peer-to-peer DMA (P2PDMA) interface over PCIe. When cuFileRead lands, nvidia-fs coordinates with the GPU driver to provide the physical address of the GPU HBM target region, then initiates a DMA transfer directly from the NVMe controller queue into that address. The NVMe controller writes directly to GPU VRAM across the PCIe bus — no CPU read, no system-memory staging, no extra copy. As of CUDA 12.8 [VERIFY latest], the nvidia-fs driver is bundled as part of the CUDA toolkit rather than a separate package, simplifying installation.

The Compatibility Mode Trap

GDS ships with POSIX compatibility mode enabled by default. If any precondition for the direct path is not met — wrong filesystem type, unaligned buffer offset, cross-NUMA-node topology between storage and GPU — cuFile silently falls back to the CPU bounce-buffer path. Your code runs fine. Your benchmarks look normal. You just are not getting GDS at all. The fix is to set "allow_compat_mode": false in /etc/cufile.json during commissioning so any misconfiguration surfaces as an error rather than invisible degradation.

Gotcha

If you are using DeepSpeed ZeRO checkpoint saves, the aio section of ds_config.json controls whether GDS is used. Setting "use_gds": true without first confirming that nvidia-fs.ko is loaded and that the target path is on a GDS-capable filesystem means DeepSpeed falls back to POSIX writes — and you will see no error, only a slower checkpoint. Confirm with gdscheck -p before enabling it in config.

Filesystem and Hardware Requirements

GDS is not filesystem-agnostic. The requirements split cleanly by whether storage is local or network-attached.

Local NVMe

XFS and EXT4 in ordered journal mode are supported on local NVMe devices. The GPU and the NVMe controller must share the same PCIe root complex for P2PDMA to work efficiently. On DGX H100 systems the NVMe bays and the GPU PCIe slots are under the same root port; on custom builds you need to verify this in your platform topology map. Where the GPU and NVMe sit under different root ports, PCIe P2P traffic crosses the PCIe host bridge, and latency and bandwidth penalties make GDS less useful — cuFile routes through the bounce buffer in that case to avoid the penalty.

Network Storage: NVMe-oF, Lustre, WekaFS, VAST, BeeGFS

For network-attached storage the direct path requires GPUDirect RDMA: the NIC must be RDMA-capable (ConnectX-6 Dx or later on InfiniBand or RoCE), and the storage server must support RDMA on its side. This is where GDS and GPUDirect RDMA are tightly coupled: GDS over NVMe-oF uses the NIC as the DMA engine rather than the local NVMe controller. Data flows: storage server NVMe -> storage server RNIC -> InfiniBand/RoCE fabric -> client RNIC -> GPU HBM, all via DMA, bypassing both the storage-server CPU and the client CPU.

Verified supported network filesystems include DDN EXAScaler, WekaFS, VAST NFS, BeeGFS, IBM Spectrum Scale, NetApp ONTAP, and Amazon FSx for Lustre. Standard NFS without RDMA transport does not qualify — cuFile falls back to the POSIX path automatically. If you are on plain NFS over TCP, GDS gives you nothing.

Table 1 — GDS Filesystem and Network Requirements
Storage Type Filesystem GDS Mechanism NIC Required Notes
Local NVMeXFS, EXT4 (ordered)PCIe P2PDMANone (PCIe only)GPU and NVMe must share PCIe root port
NVMe-oF (local SNAP)XFS, EXT4DOCA SNAP + DMABlueField-3 DPU [VERIFY]nvidia-fs not required in CUDA 12.8+
NVMe-oF (fabric)XFS, EXT4GPUDirect RDMA via NICConnectX-6 Dx or laterServer must also support RDMA
Lustre / WekaFS / VASTDistributed FSGPUDirect RDMA via NICConnectX-6 Dx or laterRDMA must be enabled on server side
NFS (TCP)NFS v3/v4None (compat fallback)N/ANo GDS benefit; cuFile uses POSIX path
BeeGFS / IBM Spectrum ScaleDistributed FSGPUDirect RDMA via NICConnectX-6 Dx or laterVendor plugin required; verify version

GDS + GPUDirect RDMA: Network Storage Path

Data flow for a Lustre or NVMe-oF mount with RDMA

Storage Server NVMe Array RDMA NIC (ConnectX-6+) InfiniBand / RoCE Fabric GPUDirect RDMA transport Compute Node GPU HBM (H100/B200) RDMA NIC + nvidia-fs DMA coordinator CPU on both sides is bypassed — RDMA DMA writes directly to GPU HBM
Figure 2: GDS over an RDMA fabric. The storage-server CPU and the compute-node CPU are both bypassed. GPUDirect RDMA is the transport; GDS is the storage-side API contract that coordinates it.

When GDS Actually Helps vs When It Does Not

Not every storage workload benefits from eliminating the CPU bounce buffer. The gains are proportional to how much of your runtime is spent moving large, sequential, aligned data between storage and GPU memory.

Cases where GDS earns its complexity

  • Checkpoint restore at scale. A 70B-parameter model in BF16 is ~140 GB. At 100 GPUs that is 14 TB written and restored per checkpoint cycle. GDS with local NVMe can cut restore time by 2-4x depending on NVMe device count and PCIe topology.
  • Data-loader-bound training. When your dataloader cannot fully hide I/O behind GPU compute — common with image datasets at high resolution, or genomics pipelines — GDS lowers CPU utilization and improves sustained read bandwidth, letting the GPU see data faster.
  • Large model weight loading at inference startup. NIM containers loading a 70B or 405B model from local NVMe benefit from GDS during the cold-start phase. This is a one-time cost but matters in autoscaling scenarios where pods start frequently.
  • RAPIDS cuDF data pipelines. ETL workloads that read large Parquet or ORC files directly from NVMe into GPU memory for in-GPU analytics are a textbook GDS use case.

Cases where GDS adds complexity without payoff

  • Small random I/O. GDS is optimized for large, aligned, sequential transfers. Workloads that issue thousands of small reads (like key-value lookup or metadata-heavy operations) do not benefit and may regress due to GDS overhead on small transfers.
  • Compute-bound training. If your training job is GPU-compute-bound and the dataloader is already hiding all I/O latency, GDS changes nothing. Profile first with nvidia-smi dmon and dataloader queue depth before adding GDS complexity.
  • Standard NFS without RDMA. As noted above: no direct path is possible. You are on the POSIX fallback regardless.
  • Mismatched PCIe topology on commodity servers. If your storage and GPU are on different NUMA nodes with no P2PDMA path, GDS falls back to the bounce buffer and you get nothing for the added operational complexity.
Table 2 — GDS: When to Use vs When to Skip
Workload GDS Value Bottleneck Removed Requirement
Checkpoint save/restore (large model)HighCPU copy + sysmem BWLocal NVMe or NVMe-oF + GDS FS
I/O-bound training dataloaderHighCPU staging, dataloader stallLocal NVMe preferred, topology check
Large model weight loading (inference cold start)Medium-HighLoad time, CPU utilization spikeLocal NVMe, GDS-aware container
RAPIDS cuDF / GPU analytics ETLMedium-HighCPU copy bottleneck on large filesLocal NVMe, cuDF GDS flag enabled
Compute-bound training (GPU saturated)Low / NoneNot bottlenecked on I/OProfile first; skip if I/O queue is deep
Small random reads (KV store, metadata)None / NegativeNot a large-sequential workloadDo not use GDS here
NFS over TCP (no RDMA)NoneFalls back to POSIX pathUpgrade to RDMA-capable storage or accept fallback

Validating GDS with gdscheck and cufile.json

The single most important operational habit with GDS is running gdscheck -p after installation and after any topology or filesystem change. It tests whether the current platform meets the requirements for a real GDS data path and reports exactly which conditions are or are not met.

Artifact: gdscheck output and cufile.json configuration

Run gdscheck -p after installing the GDS package. A healthy system outputs:

$ gdscheck -p
 GDS release version: 1.16.x [VERIFY exact]
 nvidia_fs version:  2.x  libcufile version: 2.x [VERIFY]
 ============
 ENVIRONMENT:
 ============
 GPU: NVIDIA H100 SXM5 80GB  -- GDS supported
 PCI: GPU 0000:01:00.0  NVMe 0000:02:00.0  (same root port) -- P2PDMA eligible
 NVMe: /dev/nvme0n1  XFS  block_size=4096  -- compatible
 RDMA: ConnectX-6 Dx detected  MOFED 5.x  -- RDMA capable
 ============
 RESULT: GDS ENABLED
 ============

If the result shows GDS DISABLED with reason P2PDMA not supported between GPU 00:01.0 and NVMe 00:05.0, your GPU and NVMe are on different PCIe root ports. Either recable/reslot the NVMe, or accept the POSIX fallback and remove GDS from your config.

The minimum /etc/cufile.json configuration for a production checkpoint workload:

{
  "logging": {
    "level": "WARN"
  },
  "profile": {
    "nvtx": false,
    "cufile_stats": 0
  },
  "execution": {
    "max_direct_io_size_kb": 16384,
    "max_device_cache_size_kb": 131072,
    "max_io_queue_depth": 128
  },
  "properties": {
    "use_poll_mode": false,
    "allow_compat_mode": false
  }
}

Key flag: "allow_compat_mode": false causes cuFile to return an error rather than silently falling back to CPU bounce-buffer I/O when a precondition is not met. Set this during commissioning, verify gdscheck -p passes, then ship. Failure mode: if you set this flag but your filesystem is plain NFS, any cuFileRead call returns CU_FILE_INVALID_VALUE — which is the correct signal that your storage path is not GDS-eligible.

GDS Decision Tree: Should You Enable It?

Walk this before adding GDS to any production system

Large sequential I/O? No Skip GDS; not the right workload GDS-capable filesystem? (XFS/EXT4 on NVMe, Lustre, Weka, VAST) Yes NFS TCP? No GDS. Use POSIX path GPU + NVMe same PCIe root port? (or NIC for remote storage) No Cross-root penalty; likely bounce fallback Remote storage? RDMA NIC needed. (ConnectX-6 Dx+; server RDMA required) No RDMA No RDMA = no GDS for network storage gdscheck -p passes? Set allow_compat_mode: false Enable GDS in production
Figure 3: Four gates before you enable GDS in production. Skipping any one of them results in silent fallback or topology-induced regressions.

How GDS and GPUDirect RDMA Interact

GPUDirect RDMA and GDS are often mentioned together but they address different parts of the path. GPUDirect RDMA is the technology that allows a remote NIC to DMA data directly into GPU memory without CPU involvement — it is the transport mechanism. GDS is the storage API and file-system contract that uses GPUDirect RDMA as its network transport when the storage is remote.

In a local-NVMe scenario, GPUDirect RDMA is not involved at all — the NVMe controller and GPU share the PCIe bus, and P2PDMA is the mechanism. In a network-storage scenario (NVMe-oF, Lustre, WekaFS), GPUDirect RDMA becomes the critical link: the RNIC receives data from the storage fabric and DMAs it to GPU HBM. Without RDMA on the NIC, there is no direct path for network storage — the data must pass through the CPU network stack and system memory before reaching the GPU.

This is why the network operator configuration covered in Part 8 matters for storage as well: a fabric that is not configured for GPUDirect RDMA (missing nv_peer_mem or the equivalent MOFED peer memory driver) will silently degrade GDS network storage performance to the POSIX path. It is the same trap, upstream.

In practice: On a DGX H100 cluster with WekaFS as the shared storage backend, the sequence that has worked reliably is: (1) confirm MOFED version supports GPUDirect RDMA (MLNX_OFED 5.3 or later [VERIFY current MOFED for 2026]), (2) run gdscheck -p from a compute node against the Weka mount point, (3) set allow_compat_mode: false in /etc/cufile.json, (4) run a quick benchmark with gds_benchmark to confirm throughput is in the expected range before enabling GDS in your training framework. Do not enable GDS in production config and assume it is working — the benchmark step catches the silent fallback cases that gdscheck alone sometimes misses when RDMA is partially misconfigured.

GDS vs POSIX: Relative Throughput Profile

Illustrative — actual numbers vary by NVMe count, PCIe gen, and GPU model [VERIFY with gds_benchmark on your hardware]

Relative throughput (POSIX = 1.0x) Large seq read Large seq write Small random read Small random write 1x 2x 3x ~3x ~2.3x ~1x ~1x POSIX (baseline) GDS
Figure 4: Illustrative throughput comparison — GDS delivers 2-3x on large sequential workloads; small random I/O sees no gain. Run gds_benchmark on your specific platform before budgeting around these numbers [VERIFY].

Worked example

A team running Llama 3 70B fine-tuning on 32x H100 GPUs with DeepSpeed ZeRO-3 was seeing 8-minute checkpoint save times to a local NVMe RAID across four DGX H100 nodes. After enabling GDS with the following ds_config.json aio block:

{
  "aio": {
    "use_gds": true,
    "block_size": 1048576,
    "queue_depth": 8,
    "thread_count": 1,
    "single_submit": false
  }
}

Checkpoint save dropped to approximately 2.5 minutes — a 3x reduction. The gain came almost entirely from removing the CPU copy path. CPU utilization during checkpoint dropped from ~70% (pinning all cores on the copy) to ~12%. [AUTHOR: add specific NVMe model and RAID config from a real engagement]

Disclaimer: Enabling GDS in a production training cluster requires kernel module installation (nvidia-fs.ko), cufile.json configuration changes, and, for network storage, RDMA NIC driver updates. Test in a staging environment first. Roll back by removing the nvidia-fs module and reverting cufile.json — cuFile will fall back to POSIX I/O automatically. Verify that your container runtime exposes /dev/nvidia* and /dev/nvidia-fs device nodes inside training pods; the GPU Operator handles this automatically when GDS is enabled in the operator configuration, but manual deployments require explicit device passthrough.

My Verdict: When to Invest in GDS

GDS is a real gain for specific situations, not a general-purpose performance upgrade. My recommendation is direct: invest in GDS if you are running large-model training with checkpoint intervals shorter than 15 minutes, or if your dataloader wait time is measurable (use nvidia-smi dmon -s u and check GPU utilization — sustained dips below 85% during training steps are a signal). In those cases GDS consistently removes a real bottleneck.

When NOT to invest: do not add GDS complexity to an inference deployment where model weights are already resident in HBM from the prior container start — GDS only helps loading, not runtime inference. Do not add it to any cluster where the NVMe drives and GPU PCIe slots are on separate root ports and you cannot change the physical cabling. Do not add it when your storage is plain NFS over TCP and a storage infrastructure upgrade is not on the near-term roadmap.

What to validate before shipping it: confirm gdscheck -p returns GDS ENABLED, confirm allow_compat_mode: false is set and no errors appear during a test read of a representative file size, and run gds_benchmark to establish a throughput baseline. If the benchmark shows less than 1.5x over the POSIX path, your topology likely has a P2PDMA impediment and you should diagnose before enabling in training config.

The operational overhead is real but bounded: one kernel module, one JSON file, and awareness that filesystem or topology changes can silently invalidate the configuration. Budget a day for commissioning and testing on a new node type. For clusters that checkpoint frequently at scale, the CPU utilization savings alone often justify the setup cost within the first week of production use.

Next in the series, we shift to the host software stack — power, cooling, and the density reality of liquid-cooled NVL72 deployments. If you are planning a 72-GPU rack, Part 10 covers what the spec sheet does not tell you about facility requirements. For questions on the NVIDIA AI stack, the NVIDIA AI Guide is the full index.

References

NVIDIA AI Series · Part 9 of 30
« Previous: Part 8  |  NVIDIA AI Guide  |  Next: Part 10 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading