You deploy the GPU Operator, pods can see the GPUs, and then you try to run a multi-node training job. NCCL hangs at the AllReduce barrier. The NICs are visible to the OS, ibstat shows the ports, but the pod never gets an RDMA device. The problem is that Kubernetes does not know how to hand an InfiniBand Virtual Function to a pod. That is precisely what the NVIDIA Network Operator solves, and it does it without you manually managing MOFED across 32 nodes.
What the Network Operator Actually Does
The Network Operator is an open-source Kubernetes operator (GitHub: Mellanox/network-operator) that manages the lifecycle of every component in the accelerated-fabric stack on a Kubernetes node. Its control-plane object is a single CRD called NicClusterPolicy. You write one manifest; the operator reconciles MOFED driver DaemonSets, device plugin DaemonSets, Multus CNI, SR-IOV operator integration, and IB Kubernetes across every matching node.
As of v26.1.1 (April 2026) [VERIFY exact April release date], the operator ships these sub-components, each independently toggled in the NicClusterPolicy spec:
- ofedDriver (MOFED / DOCA-OFED): containerised kernel driver for ConnectX NICs, compiled and loaded per kernel version.
- rdmaSharedDevicePlugin: exposes RDMA devices as Kubernetes extended resources (
rdma/rdma_shared_dev_a) so pods can request them viaresources.limits. - sriovDevicePlugin: works alongside the SR-IOV Network Operator to expose SR-IOV Virtual Functions as allocatable resources.
- secondaryNetwork: deploys Multus CNI (the meta-plugin), container-networking plugins (macvlan, ipvlan, host-device), and IPAM (whereabouts or nv-ipam).
- ibKubernetes: a daemon that reads pod network annotations and programs GUID assignments on InfiniBand subnet managers.
- nvIpam: NVIDIA IP Address Manager, a CNI IPAM plugin with pool-based allocation across nodes.
The MOFED Container and Why It Matters
MOFED (Mellanox OpenFabrics Enterprise Distribution), now also shipped as DOCA-OFED, is the kernel-level driver suite for ConnectX NICs. It provides the InfiniBand verbs stack, the RoCE acceleration, and the nv_peer_mem (or nvidia-peermem) module that enables GPUDirect RDMA by allowing the NIC to DMA directly into GPU memory without touching the CPU.
Installing MOFED on bare metal used to mean: download a tarball, run an installer, cross-compile against your kernel headers, pray the kernel did not update. The Network Operator replaces that with a containerised driver DaemonSet. The ofedDriver init container compiles the driver against the host kernel at startup, loads it into the host kernel via hostPath, and registers a node label (network.operator.nvidia.com/mofed-driver-upgraded) when complete. The rest of the DaemonSets wait on that label before starting. If the kernel updates, the operator redoes the compile on the next node restart.
Practical gotcha: MOFED version pinning is non-negotiable. The ofedDriver.version field in NicClusterPolicy must match the version the GPU Operator was told to expect. When the two diverge, nvidia-peermem fails to load, GPUDirect RDMA silently falls back to CPU-path transfers, and your NCCL bandwidth drops from 400 Gb/s NDR to effectively CPU PCIe throughput. The symptom looks exactly like a slow network rather than a driver mismatch.
WARN Connect to 0.0.0.0 failed or AllReduce stalls on multi-node jobs immediately after a node OS update, check lsmod | grep nvidia_peermem on each node before blaming the network. A kernel update that invalidated the compiled MOFED module is the most common cause. The operator should recompile automatically on pod restart; if it does not, check the ofedDriver DaemonSet pod logs for compile errors related to missing kernel headers.
RDMA Shared Device Plugin vs SR-IOV: Which Mode to Use
The Network Operator supports two distinct modes for exposing RDMA to pods, and choosing the wrong one wastes either performance or flexibility.
RDMA Shared Device Plugin
The shared device plugin exposes the physical NIC RDMA device as a countable Kubernetes resource (e.g., rdma/rdma_shared_dev_a). Multiple pods on the same node can share the underlying physical device. This is the simpler path: no SR-IOV hardware partitioning, no VF count configuration, no firmware settings. It works when your pods each need RDMA access but do not need guaranteed, isolated NIC bandwidth. A typical use case is a small research cluster where 2-3 distributed training pods share a ConnectX-7 port without strict bandwidth guarantees.
SR-IOV VF Mode
SR-IOV creates hardware-level Virtual Functions on the NIC. Each VF is a dedicated slice of the physical NIC with isolated queues, bandwidth, and RDMA context. The pod gets a VF as if it owned a physical NIC. This is the right choice for production multi-tenant clusters where workloads must not interfere with each other, or where you need deterministic latency on InfiniBand. The constraint: the physical NIC must support SR-IOV, the BIOS must have IOMMU enabled, and the number of VFs is set in firmware (common range: 2-127 per port; ConnectX-7 supports up to 127 [VERIFY exact max per port]). You burn VF count at deployment time. Running a 64-GPU node with 8 training pods each needing 2 VFs means you need 16 VFs configured before the Network Operator runs.
| Dimension | RDMA Shared Device Plugin | SR-IOV VF Mode |
|---|---|---|
| Hardware requirement | ConnectX NIC with RDMA support | ConnectX NIC + IOMMU + SR-IOV in firmware |
| Isolation | Shared; pods compete for NIC queues | Hardware-isolated per VF |
| Setup complexity | Low; one resource config in NicClusterPolicy | High; firmware VF count + SriovNetworkNodePolicy CRD |
| Bandwidth guarantee | None | Yes, per VF |
| Multi-tenant suitability | Research / single team | Production multi-tenant |
| GPUDirect RDMA | Works (shared path) | Works (preferred for deterministic perf) |
Multus and Secondary Networks: How Pods Get an RDMA Interface
Kubernetes has one primary network interface per pod, managed by the cluster CNI (Calico, Cilium, Flannel, etc.). RDMA workloads need a second interface wired directly to the accelerated fabric. That is where Multus comes in.
Multus is a CNI meta-plugin: it calls your primary CNI as normal, then reads a pod annotation to attach additional network interfaces using any secondary CNI. The Network Operator installs Multus and the secondary CNI plugins as part of the secondaryNetwork spec in NicClusterPolicy. The secondary interface types supported are:
- Host Device Network: passes the entire physical NIC (or a VF) into the pod network namespace. Used for both SR-IOV VF attachment and RDMA shared mode.
- MacVLAN Network: layer-2 sub-interface sharing the parent NIC MAC space. Typically used for RoCE in Ethernet fabrics.
- IPoIB Network: IP over InfiniBand. Required for InfiniBand fabrics where you need TCP/IP alongside RDMA verbs.
The pod declares the secondary interface via a k8s.v1.cni.cncf.io/networks annotation referencing a NetworkAttachmentDefinition object. That NAD is created once per cluster and describes which CNI plugin to call and with what parameters. The RDMA resource request goes in resources.limits separately.
GPUDirect RDMA: Network Operator and GPU Operator Together
GPUDirect RDMA is the technology that lets a ConnectX NIC DMA data directly into or out of GPU HBM without involving the CPU. At the kernel level, this requires the nvidia-peermem module, which acts as the bridge between the NVIDIA GPU driver and the MOFED RDMA stack. Both drivers must be loaded and must agree on the ABI version.
The GPU Operator manages the GPU driver side. When you install it with --set driver.rdma.enabled=true --set driver.useOpenKernelModules=true, it will install and load nvidia-peermem after the MOFED module is present. The Network Operator manages MOFED. These two operators need to talk to each other only implicitly: the GPU Operator waits for a node label that confirms MOFED is loaded before starting the peermem DaemonSet. That label is set by the ofedDriver DaemonSet.
The deployment order in practice: install the Network Operator first so MOFED loads, then install (or restart) the GPU Operator. If you reverse this, the GPU Operator driver pod starts, finds no MOFED, skips peermem, and finishes. By the time MOFED loads later, peermem is absent. A driver pod restart fixes it, but it is an easy trap when you are doing a fresh cluster bring-up under time pressure.
In Practice
On a 16-node H100 cluster with ConnectX-7 and NDR InfiniBand, after a correct install sequence, kubectl describe node on a GPU+NIC node should show both nvidia.com/gpu: 8 and rdma/rdma_shared_dev_a: 1 (or an SR-IOV VF resource like nvidia.com/cx7_sriov_rdma: 8) under Allocatable. If only GPUs appear, the Network Operator DaemonSets have not completed. Check the ofedDriver pod first; it is almost always the blocker.
Operational Artifact: NicClusterPolicy and Pod Annotation
A minimal but realistic NicClusterPolicy for an InfiniBand cluster using RDMA shared device plugin, with DOCA-OFED driver pinned to v25.7 [VERIFY exact shipping version string for 2026], looks like this:
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: "25.07-0.6.1.0-0" # [VERIFY exact tag]
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
safeLoad: true
rdmaSharedDevicePlugin:
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: v1.5.1 # [VERIFY]
config: |
{
"configList": [{
"resourceName": "rdma_shared_dev_a",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ens1f0"]
}
}]
}
secondaryNetwork:
cniPlugins:
image: plugins
repository: ghcr.io/k8snetworkplumbingwg
version: v1.5.0 # [VERIFY]
multus:
image: multus-cni
repository: ghcr.io/k8snetworkplumbingwg
version: v4.0.2 # [VERIFY]
ipamPlugin:
image: whereabouts
repository: ghcr.io/k8snetworkplumbingwg
version: v0.7.0 # [VERIFY]
A NetworkAttachmentDefinition for the RDMA secondary network:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: rdma-net
namespace: gpu-workloads
annotations:
k8s.v1.cni.cncf.io/resourceName: rdma/rdma_shared_dev_a
spec:
config: '{
"cniVersion": "0.3.1",
"type": "host-device",
"device": "ens1f0"
}'
The training pod requests the RDMA resource and attaches the secondary network:
apiVersion: v1
kind: Pod
metadata:
name: nccl-training-pod
annotations:
k8s.v1.cni.cncf.io/networks: rdma-net
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/nemo:24.12 # [VERIFY latest NeMo tag]
resources:
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_dev_a: "1"
env:
- name: NCCL_IB_HCA
value: "mlx5_0"
- name: NCCL_DEBUG
value: "INFO"
Expected result: pod starts with two network interfaces (eth0 cluster network, net1 RDMA network); kubectl exec -- ibstat inside the pod shows the ConnectX port in Active state; NCCL all-reduce bus bandwidth approaches the physical link speed.
Common failure mode 1 (missing VFs): pod scheduling fails with Insufficient rdma/rdma_shared_dev_a. Cause: either the ifName in NicClusterPolicy does not match the actual interface name on the node (ip link to verify), or the device plugin DaemonSet has not reconciled yet. Check kubectl get pods -n nvidia-network-operator for DaemonSets still in Init state.
Common failure mode 2 (MOFED mismatch): pod starts, NCCL reports WARN No interface found in cudaDevices. Check kubectl logs -n nvidia-network-operator ds/mofed-ubuntu22.04-ds -c mofed-container for version string. Cross-reference against what ofed_info reports on the host after the module loads. If they diverge, update ofedDriver.version in NicClusterPolicy and delete the stale DaemonSet pods to force a reconcile.
InfiniBand vs RoCE: What Changes in the Operator Config
The Network Operator handles both InfiniBand and RoCE, but the configuration paths diverge in two places.
For InfiniBand, you need ibKubernetes enabled in NicClusterPolicy. This daemon watches pod annotations and programs GUID assignments on the InfiniBand subnet manager (MLNX-OS or UFM) so that each pod port gets a unique GUID for routing. Without it, multiple pods on the same node share a GUID, causing routing ambiguity and dropped traffic. For IPoIB (TCP over InfiniBand), the secondaryNetwork must include the ipoib CNI type in the NAD config. NCCL uses RDMA verbs directly and does not need IPoIB, but some MPI runtimes fall back to TCP and require it.
For RoCE (RDMA over Converged Ethernet), the NIC must be in Ethernet mode and PFC (Priority Flow Control) plus ECN (Explicit Congestion Notification) must be configured on the switch fabric. The Network Operator does not configure the switch; that is still a day-0 network-team task. On the node side, you use the macvlan or host-device secondary CNI type rather than ipoib, and the NCCL env var changes from NCCL_IB_HCA to the RoCE interface name. RoCE is the right choice for Spectrum-X Ethernet fabrics from Part 8 of this series.
| Config Element | InfiniBand | RoCE / Ethernet |
|---|---|---|
| ibKubernetes | Required (GUID management) | Not used |
| Secondary CNI type | ipoib or host-device |
macvlan or host-device |
| Switch requirement | IB subnet manager (UFM / MLNX-OS) | PFC + ECN on Ethernet switches |
| NCCL env hint | NCCL_IB_HCA=mlx5_0 |
NCCL_SOCKET_IFNAME=net1 |
| Latency profile | Lower baseline (sub-1 microsecond) [VERIFY] | Slightly higher; switch-dependent |
| NVL72 / DGX platform default | NDR IB (400 Gb/s per port) | Spectrum-X (optional) |
When NOT to Deploy the Network Operator
The Network Operator adds real operational weight. Before you install it, answer these four questions honestly:
- Do your nodes have ConnectX NICs? If your GPU nodes use commodity Ethernet (Intel, Broadcom) with no RDMA support, the Network Operator cannot provision anything useful. MOFED will refuse to load.
- Do you run distributed training or inference across nodes? Single-node GPU workloads (even 8xH100) do not need RDMA; NVLink handles intra-node communication. The Network Operator is a multi-node tool.
- Is your Kubernetes version compatible? Kubernetes 1.28 or later is required for current Network Operator versions [VERIFY minimum K8s version for v26.x].
- Do you have SR-IOV in your BIOS and IOMMU enabled? Without this, you can use shared RDMA mode but not SR-IOV VFs, which limits multi-tenant isolation significantly.
If you answered no to the first two, skip it. A cluster running single-node fine-tuning jobs on H100s with standard CNI is simpler, faster to debug, and cheaper to operate. Add the Network Operator when the workloads demand it.
Day-2 Operations: Upgrades, Debugging, and Node Drain
The Network Operator includes a built-in upgrade controller for MOFED via the upgradePolicy block in NicClusterPolicy. With autoUpgrade: true and maxParallelUpgrades: 1, the operator drains one node at a time, unloads MOFED, loads the new version, and cordons the node back in. This is the same rolling-drain pattern used for node OS updates.
One subtlety: safeLoad: true in the upgradePolicy tells the operator to load the new MOFED version inside a temporary namespace before committing it to the host. If the load fails (missing kernel headers, module conflict), the node stays on the old version and an event is emitted. Without safeLoad, a bad MOFED version can leave a node with no functional NIC driver.
For debugging, these are the four commands you run in order when RDMA resources do not appear in kubectl describe node:
kubectl get pods -n nvidia-network-operator -o wide: confirms all DaemonSet pods are Running.kubectl logs -n nvidia-network-operator -l app=mofed-ubuntu22.04-ds -c mofed-container --tail=50: MOFED compile log; look for kernel header errors.kubectl logs -n nvidia-network-operator -l app=rdma-shared-dp-ds --tail=30: device plugin registration log; confirms resource name is published.kubectl get node <nodename> -o json | jq '.status.allocatable': confirms the RDMA resource appears and has a non-zero quantity.
Worked Example
A 32-node NDR InfiniBand cluster running B200 GPUs with ConnectX-7 in SR-IOV mode. Each node has 8 GPUs and 2 NIC ports. Target: 8 SR-IOV VFs per port, so 16 VFs per node for up to 16 training pod slots. Firmware configured with mstconfig -d <PCI_BDF> set SRIOV_EN=1 NUM_OF_VFS=16 [VERIFY mstconfig flag names]. NicClusterPolicy enables sriovDevicePlugin and secondaryNetwork with host-device CNI. Each training pod requests nvidia.com/cx7_sriov_rdma: 1 and annotates k8s.v1.cni.cncf.io/networks: sriov-rdma-net. Result after correct deploy: kubectl describe node shows nvidia.com/cx7_sriov_rdma: 16 allocatable per node. Eight 4-pod distributed training jobs can run in parallel across the cluster with full RDMA isolation between jobs.
The Verdict
Deploy the Network Operator when you have ConnectX NICs, InfiniBand or RoCE fabric, and multi-node GPU workloads. In that scenario it is not optional; managing MOFED and device plugins at scale without it is a maintenance trap that gets worse with every Kubernetes or kernel upgrade.
Use SR-IOV VF mode for production multi-tenant clusters where bandwidth isolation between jobs is a requirement. Use RDMA shared device plugin for research clusters with a single team and no need for strict guarantees. Do not mix the two modes on the same node without fully understanding which resource name maps to which path.
What to validate before going production: confirm MOFED version is pinned and matches the GPU Operator expectation; run ib_write_bw between two pods on different nodes and measure against your link speed; verify that a node drain plus MOFED upgrade completes cleanly without leaving orphaned VFs; and confirm that nvidia-peermem is loaded on every GPU node after the boot sequence.
The Network Operator does one thing well: it makes accelerated fabric on Kubernetes repeatable and version-controlled. The day-2 work is still there (switch config, firmware, BIOS), but at least the Kubernetes side stays in a declarative manifest. That is the right boundary for an operator to own.
For the VCF deployment angle on running this stack within VMware Cloud Foundation, see the Private AI Series where the NIC passthrough and SR-IOV considerations in a vSphere context are covered separately.
Questions about your specific cluster topology? Leave a comment below with your NIC model and workload type.
References
- NVIDIA Network Operator v26.1.1 Documentation
- GPUDirect RDMA and GPUDirect Storage – NVIDIA GPU Operator Docs
- NVIDIA Network Operator Deployment Guide with Kubernetes
- NVIDIA Network Operator GitHub (Mellanox/network-operator)
- Customization Options and CRDs – NVIDIA Network Operator



