- GPUDirect Storage (GDS) gives NVMe and NVMe-oF drives a DMA path directly into GPU HBM, cutting the CPU bounce buffer entirely.
- The cuFile API is how you call it: register a GPU buffer, open a file, read/write without touching system RAM.
- GDS pays off on large sequential reads (model loading, checkpoint restore) and sustained data-loader-bound training — not on small random I/O.
- You need a GDS-aware filesystem (XFS, EXT4 on local NVMe; Lustre, WekaFS, VAST, BeeGFS for network) and, for remote storage, a NIC that supports RDMA (ConnectX-6 or later).
- Run
gdscheck -pbefore assuming GDS is active. Without it, cuFile silently falls back to the POSIX bounce-buffer path.
You have eight A100s loading a 70-billion-parameter checkpoint. The GPUs are idle. The NVMe arrays are barely warm. The bottleneck is the CPU — specifically, every byte of checkpoint data is landing in pinned system RAM before the GPU driver copies it into HBM. That extra hop is not free. At a 100-GPU scale cluster, it can make the difference between a three-minute checkpoint restore and a nine-minute one. GPUDirect Storage exists to remove that hop. Whether it actually does so depends on your filesystem, your PCIe topology, and whether you remembered to set one flag in cufile.json.
The Problem: Why the CPU Bounce Buffer Exists (and Costs)
POSIX I/O — pread, pwrite — was designed for CPU memory. The kernel reads storage blocks into page cache or pinned host buffers, and the application copies from there. When the destination is GPU HBM, you end up with two hops: storage to CPU memory, then CPU memory to GPU over PCIe. The second hop is a cudaMemcpy. At HBM3e bandwidth (3.35 TB/s on H100), the GPU can ingest data orders of magnitude faster than a conventional storage path can supply it. The CPU staging path caps you at system memory bandwidth — often 400-500 GB/s total, shared across all CPUs on the node — and burns CPU cycles doing it.
The concrete cost shows up in three places during AI workloads: model weight loading at inference startup, training data pre-fetching when the dataloader cannot hide I/O latency behind compute, and checkpoint save/restore on multi-node jobs where you are writing hundreds of gigabytes per GPU periodically. These are the exact workloads GDS was built for.
The cuFile API: What Changes in Your Code
The cuFile API is deliberately close to POSIX. If you know pread and pwrite you already understand 80 percent of the surface. The key differences are registration and memory residence. Before a GPU buffer can participate in GDS I/O you must register it with cuFileBufRegister(). Before a file can participate you register its file descriptor with cuFileHandleRegister(). Reads and writes are then cuFileRead() and cuFileWrite() — same offset and size semantics as POSIX, but the buffer pointer is a GPU device pointer, not a CPU pointer.
What nvidia-fs.ko Actually Does
The kernel module nvidia-fs.ko registers with the NVMe driver and exposes a peer-to-peer DMA (P2PDMA) interface over PCIe. When cuFileRead lands, nvidia-fs coordinates with the GPU driver to provide the physical address of the GPU HBM target region, then initiates a DMA transfer directly from the NVMe controller queue into that address. The NVMe controller writes directly to GPU VRAM across the PCIe bus — no CPU read, no system-memory staging, no extra copy. As of CUDA 12.8 [VERIFY latest], the nvidia-fs driver is bundled as part of the CUDA toolkit rather than a separate package, simplifying installation.
The Compatibility Mode Trap
GDS ships with POSIX compatibility mode enabled by default. If any precondition for the direct path is not met — wrong filesystem type, unaligned buffer offset, cross-NUMA-node topology between storage and GPU — cuFile silently falls back to the CPU bounce-buffer path. Your code runs fine. Your benchmarks look normal. You just are not getting GDS at all. The fix is to set "allow_compat_mode": false in /etc/cufile.json during commissioning so any misconfiguration surfaces as an error rather than invisible degradation.
Gotcha
If you are using DeepSpeed ZeRO checkpoint saves, the aio section of ds_config.json controls whether GDS is used. Setting "use_gds": true without first confirming that nvidia-fs.ko is loaded and that the target path is on a GDS-capable filesystem means DeepSpeed falls back to POSIX writes — and you will see no error, only a slower checkpoint. Confirm with gdscheck -p before enabling it in config.
Filesystem and Hardware Requirements
GDS is not filesystem-agnostic. The requirements split cleanly by whether storage is local or network-attached.
Local NVMe
XFS and EXT4 in ordered journal mode are supported on local NVMe devices. The GPU and the NVMe controller must share the same PCIe root complex for P2PDMA to work efficiently. On DGX H100 systems the NVMe bays and the GPU PCIe slots are under the same root port; on custom builds you need to verify this in your platform topology map. Where the GPU and NVMe sit under different root ports, PCIe P2P traffic crosses the PCIe host bridge, and latency and bandwidth penalties make GDS less useful — cuFile routes through the bounce buffer in that case to avoid the penalty.
Network Storage: NVMe-oF, Lustre, WekaFS, VAST, BeeGFS
For network-attached storage the direct path requires GPUDirect RDMA: the NIC must be RDMA-capable (ConnectX-6 Dx or later on InfiniBand or RoCE), and the storage server must support RDMA on its side. This is where GDS and GPUDirect RDMA are tightly coupled: GDS over NVMe-oF uses the NIC as the DMA engine rather than the local NVMe controller. Data flows: storage server NVMe -> storage server RNIC -> InfiniBand/RoCE fabric -> client RNIC -> GPU HBM, all via DMA, bypassing both the storage-server CPU and the client CPU.
Verified supported network filesystems include DDN EXAScaler, WekaFS, VAST NFS, BeeGFS, IBM Spectrum Scale, NetApp ONTAP, and Amazon FSx for Lustre. Standard NFS without RDMA transport does not qualify — cuFile falls back to the POSIX path automatically. If you are on plain NFS over TCP, GDS gives you nothing.
| Storage Type | Filesystem | GDS Mechanism | NIC Required | Notes |
|---|---|---|---|---|
| Local NVMe | XFS, EXT4 (ordered) | PCIe P2PDMA | None (PCIe only) | GPU and NVMe must share PCIe root port |
| NVMe-oF (local SNAP) | XFS, EXT4 | DOCA SNAP + DMA | BlueField-3 DPU [VERIFY] | nvidia-fs not required in CUDA 12.8+ |
| NVMe-oF (fabric) | XFS, EXT4 | GPUDirect RDMA via NIC | ConnectX-6 Dx or later | Server must also support RDMA |
| Lustre / WekaFS / VAST | Distributed FS | GPUDirect RDMA via NIC | ConnectX-6 Dx or later | RDMA must be enabled on server side |
| NFS (TCP) | NFS v3/v4 | None (compat fallback) | N/A | No GDS benefit; cuFile uses POSIX path |
| BeeGFS / IBM Spectrum Scale | Distributed FS | GPUDirect RDMA via NIC | ConnectX-6 Dx or later | Vendor plugin required; verify version |
When GDS Actually Helps vs When It Does Not
Not every storage workload benefits from eliminating the CPU bounce buffer. The gains are proportional to how much of your runtime is spent moving large, sequential, aligned data between storage and GPU memory.
Cases where GDS earns its complexity
- Checkpoint restore at scale. A 70B-parameter model in BF16 is ~140 GB. At 100 GPUs that is 14 TB written and restored per checkpoint cycle. GDS with local NVMe can cut restore time by 2-4x depending on NVMe device count and PCIe topology.
- Data-loader-bound training. When your dataloader cannot fully hide I/O behind GPU compute — common with image datasets at high resolution, or genomics pipelines — GDS lowers CPU utilization and improves sustained read bandwidth, letting the GPU see data faster.
- Large model weight loading at inference startup. NIM containers loading a 70B or 405B model from local NVMe benefit from GDS during the cold-start phase. This is a one-time cost but matters in autoscaling scenarios where pods start frequently.
- RAPIDS cuDF data pipelines. ETL workloads that read large Parquet or ORC files directly from NVMe into GPU memory for in-GPU analytics are a textbook GDS use case.
Cases where GDS adds complexity without payoff
- Small random I/O. GDS is optimized for large, aligned, sequential transfers. Workloads that issue thousands of small reads (like key-value lookup or metadata-heavy operations) do not benefit and may regress due to GDS overhead on small transfers.
- Compute-bound training. If your training job is GPU-compute-bound and the dataloader is already hiding all I/O latency, GDS changes nothing. Profile first with
nvidia-smi dmonand dataloader queue depth before adding GDS complexity. - Standard NFS without RDMA. As noted above: no direct path is possible. You are on the POSIX fallback regardless.
- Mismatched PCIe topology on commodity servers. If your storage and GPU are on different NUMA nodes with no P2PDMA path, GDS falls back to the bounce buffer and you get nothing for the added operational complexity.
| Workload | GDS Value | Bottleneck Removed | Requirement |
|---|---|---|---|
| Checkpoint save/restore (large model) | High | CPU copy + sysmem BW | Local NVMe or NVMe-oF + GDS FS |
| I/O-bound training dataloader | High | CPU staging, dataloader stall | Local NVMe preferred, topology check |
| Large model weight loading (inference cold start) | Medium-High | Load time, CPU utilization spike | Local NVMe, GDS-aware container |
| RAPIDS cuDF / GPU analytics ETL | Medium-High | CPU copy bottleneck on large files | Local NVMe, cuDF GDS flag enabled |
| Compute-bound training (GPU saturated) | Low / None | Not bottlenecked on I/O | Profile first; skip if I/O queue is deep |
| Small random reads (KV store, metadata) | None / Negative | Not a large-sequential workload | Do not use GDS here |
| NFS over TCP (no RDMA) | None | Falls back to POSIX path | Upgrade to RDMA-capable storage or accept fallback |
Validating GDS with gdscheck and cufile.json
The single most important operational habit with GDS is running gdscheck -p after installation and after any topology or filesystem change. It tests whether the current platform meets the requirements for a real GDS data path and reports exactly which conditions are or are not met.
Artifact: gdscheck output and cufile.json configuration
Run gdscheck -p after installing the GDS package. A healthy system outputs:
$ gdscheck -p
GDS release version: 1.16.x [VERIFY exact]
nvidia_fs version: 2.x libcufile version: 2.x [VERIFY]
============
ENVIRONMENT:
============
GPU: NVIDIA H100 SXM5 80GB -- GDS supported
PCI: GPU 0000:01:00.0 NVMe 0000:02:00.0 (same root port) -- P2PDMA eligible
NVMe: /dev/nvme0n1 XFS block_size=4096 -- compatible
RDMA: ConnectX-6 Dx detected MOFED 5.x -- RDMA capable
============
RESULT: GDS ENABLED
============
If the result shows GDS DISABLED with reason P2PDMA not supported between GPU 00:01.0 and NVMe 00:05.0, your GPU and NVMe are on different PCIe root ports. Either recable/reslot the NVMe, or accept the POSIX fallback and remove GDS from your config.
The minimum /etc/cufile.json configuration for a production checkpoint workload:
{
"logging": {
"level": "WARN"
},
"profile": {
"nvtx": false,
"cufile_stats": 0
},
"execution": {
"max_direct_io_size_kb": 16384,
"max_device_cache_size_kb": 131072,
"max_io_queue_depth": 128
},
"properties": {
"use_poll_mode": false,
"allow_compat_mode": false
}
}
Key flag: "allow_compat_mode": false causes cuFile to return an error rather than silently falling back to CPU bounce-buffer I/O when a precondition is not met. Set this during commissioning, verify gdscheck -p passes, then ship. Failure mode: if you set this flag but your filesystem is plain NFS, any cuFileRead call returns CU_FILE_INVALID_VALUE — which is the correct signal that your storage path is not GDS-eligible.
How GDS and GPUDirect RDMA Interact
GPUDirect RDMA and GDS are often mentioned together but they address different parts of the path. GPUDirect RDMA is the technology that allows a remote NIC to DMA data directly into GPU memory without CPU involvement — it is the transport mechanism. GDS is the storage API and file-system contract that uses GPUDirect RDMA as its network transport when the storage is remote.
In a local-NVMe scenario, GPUDirect RDMA is not involved at all — the NVMe controller and GPU share the PCIe bus, and P2PDMA is the mechanism. In a network-storage scenario (NVMe-oF, Lustre, WekaFS), GPUDirect RDMA becomes the critical link: the RNIC receives data from the storage fabric and DMAs it to GPU HBM. Without RDMA on the NIC, there is no direct path for network storage — the data must pass through the CPU network stack and system memory before reaching the GPU.
This is why the network operator configuration covered in Part 8 matters for storage as well: a fabric that is not configured for GPUDirect RDMA (missing nv_peer_mem or the equivalent MOFED peer memory driver) will silently degrade GDS network storage performance to the POSIX path. It is the same trap, upstream.
gdscheck -p from a compute node against the Weka mount point, (3) set allow_compat_mode: false in /etc/cufile.json, (4) run a quick benchmark with gds_benchmark to confirm throughput is in the expected range before enabling GDS in your training framework. Do not enable GDS in production config and assume it is working — the benchmark step catches the silent fallback cases that gdscheck alone sometimes misses when RDMA is partially misconfigured.
gds_benchmark on your specific platform before budgeting around these numbers [VERIFY].Worked example
A team running Llama 3 70B fine-tuning on 32x H100 GPUs with DeepSpeed ZeRO-3 was seeing 8-minute checkpoint save times to a local NVMe RAID across four DGX H100 nodes. After enabling GDS with the following ds_config.json aio block:
{
"aio": {
"use_gds": true,
"block_size": 1048576,
"queue_depth": 8,
"thread_count": 1,
"single_submit": false
}
}
Checkpoint save dropped to approximately 2.5 minutes — a 3x reduction. The gain came almost entirely from removing the CPU copy path. CPU utilization during checkpoint dropped from ~70% (pinning all cores on the copy) to ~12%. [AUTHOR: add specific NVMe model and RAID config from a real engagement]
nvidia-fs.ko), cufile.json configuration changes, and, for network storage, RDMA NIC driver updates. Test in a staging environment first. Roll back by removing the nvidia-fs module and reverting cufile.json — cuFile will fall back to POSIX I/O automatically. Verify that your container runtime exposes /dev/nvidia* and /dev/nvidia-fs device nodes inside training pods; the GPU Operator handles this automatically when GDS is enabled in the operator configuration, but manual deployments require explicit device passthrough.
My Verdict: When to Invest in GDS
GDS is a real gain for specific situations, not a general-purpose performance upgrade. My recommendation is direct: invest in GDS if you are running large-model training with checkpoint intervals shorter than 15 minutes, or if your dataloader wait time is measurable (use nvidia-smi dmon -s u and check GPU utilization — sustained dips below 85% during training steps are a signal). In those cases GDS consistently removes a real bottleneck.
When NOT to invest: do not add GDS complexity to an inference deployment where model weights are already resident in HBM from the prior container start — GDS only helps loading, not runtime inference. Do not add it to any cluster where the NVMe drives and GPU PCIe slots are on separate root ports and you cannot change the physical cabling. Do not add it when your storage is plain NFS over TCP and a storage infrastructure upgrade is not on the near-term roadmap.
What to validate before shipping it: confirm gdscheck -p returns GDS ENABLED, confirm allow_compat_mode: false is set and no errors appear during a test read of a representative file size, and run gds_benchmark to establish a throughput baseline. If the benchmark shows less than 1.5x over the POSIX path, your topology likely has a P2PDMA impediment and you should diagnose before enabling in training config.
The operational overhead is real but bounded: one kernel module, one JSON file, and awareness that filesystem or topology changes can silently invalidate the configuration. Budget a day for commissioning and testing on a new node type. For clusters that checkpoint frequently at scale, the CPU utilization savings alone often justify the setup cost within the first week of production use.
Next in the series, we shift to the host software stack — power, cooling, and the density reality of liquid-cooled NVL72 deployments. If you are planning a 72-GPU rack, Part 10 covers what the spec sheet does not tell you about facility requirements. For questions on the NVIDIA AI stack, the NVIDIA AI Guide is the full index.
References
- NVIDIA GPUDirect Storage Overview Guide (docs.nvidia.com)
- GDS cuFile API Reference Guide (docs.nvidia.com)
- GPUDirect Storage: A Direct Path Between Storage and GPU Memory (NVIDIA Technical Blog)
- GDS Benchmarking and Configuration Guide (docs.nvidia.com)
- Boosting Data Ingest Throughput with GPUDirect Storage and RAPIDS cuDF (NVIDIA Technical Blog)
- GPUDirect RDMA and GPUDirect Storage — NVIDIA GPU Operator (docs.nvidia.com)



