Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NVIDIA Network Operator on Kubernetes: RDMA, SR-IOV, and the Accelerated Fabric (NVIDIA AI Series, Part 13)

The NVIDIA Network Operator provisions MOFED drivers, RDMA shared device plugin, SR-IOV VFs, and Multus secondary networks to Kubernetes pods. This is how GPUDirect RDMA actually works at scale on ConnectX-7 and NDR InfiniBand clusters.

NVIDIA AI Series · Part 13 of 30
TL;DR: The NVIDIA Network Operator is the Kubernetes controller that provisions the entire accelerated-fabric stack to pods: MOFED/DOCA drivers, RDMA shared device plugin, SR-IOV VFs, Multus secondary networks, and IB Kubernetes for GUID assignment. It pairs tightly with the GPU Operator so that GPUDirect RDMA works without manual kernel-module gymnastics. If your cluster has no RDMA-capable NICs or no distributed training/inference workloads, skip it entirely. If you do run H100/H200/B200 nodes with ConnectX-7 or NDR InfiniBand, the Network Operator is the only sane way to manage that fabric at scale. The hard part is not installing it; it is getting SR-IOV VF counts, MOFED versions, and secondary-network annotations right the first time.
Who this is for: Platform engineers and ML infrastructure architects who run or plan to run GPU clusters on Kubernetes with ConnectX NICs, InfiniBand, or RoCE fabrics. You should already be comfortable with Kubernetes operators, Helm, and basic networking concepts. This picks up directly from Part 12 (GPU Operator). If you have not deployed the GPU Operator first, do that before touching this.

You deploy the GPU Operator, pods can see the GPUs, and then you try to run a multi-node training job. NCCL hangs at the AllReduce barrier. The NICs are visible to the OS, ibstat shows the ports, but the pod never gets an RDMA device. The problem is that Kubernetes does not know how to hand an InfiniBand Virtual Function to a pod. That is precisely what the NVIDIA Network Operator solves, and it does it without you manually managing MOFED across 32 nodes.

What the Network Operator Actually Does

The Network Operator is an open-source Kubernetes operator (GitHub: Mellanox/network-operator) that manages the lifecycle of every component in the accelerated-fabric stack on a Kubernetes node. Its control-plane object is a single CRD called NicClusterPolicy. You write one manifest; the operator reconciles MOFED driver DaemonSets, device plugin DaemonSets, Multus CNI, SR-IOV operator integration, and IB Kubernetes across every matching node.

As of v26.1.1 (April 2026) [VERIFY exact April release date], the operator ships these sub-components, each independently toggled in the NicClusterPolicy spec:

  • ofedDriver (MOFED / DOCA-OFED): containerised kernel driver for ConnectX NICs, compiled and loaded per kernel version.
  • rdmaSharedDevicePlugin: exposes RDMA devices as Kubernetes extended resources (rdma/rdma_shared_dev_a) so pods can request them via resources.limits.
  • sriovDevicePlugin: works alongside the SR-IOV Network Operator to expose SR-IOV Virtual Functions as allocatable resources.
  • secondaryNetwork: deploys Multus CNI (the meta-plugin), container-networking plugins (macvlan, ipvlan, host-device), and IPAM (whereabouts or nv-ipam).
  • ibKubernetes: a daemon that reads pod network annotations and programs GUID assignments on InfiniBand subnet managers.
  • nvIpam: NVIDIA IP Address Manager, a CNI IPAM plugin with pool-based allocation across nodes.
NicClusterPolicy Component Stack Kubernetes Network Operator architecture Network Operator NicClusterPolicy CRD ofedDriver MOFED / DOCA kernel driver rdmaSharedDevPlugin RDMA extended resources sriovDevicePlugin SR-IOV VF allocation secondaryNetwork Multus + CNI plugins ibKubernetes IB GUID assignment nvIpam IP pool per node whereabouts Cluster-wide IPAM ConnectX-7 / NDR InfiniBand / Spectrum-X NIC Hardware RDMA / RoCE / IB substrate exposed to pods via device plugin
Figure 1: NicClusterPolicy drives six independent sub-component DaemonSets down to the NIC hardware layer.

The MOFED Container and Why It Matters

MOFED (Mellanox OpenFabrics Enterprise Distribution), now also shipped as DOCA-OFED, is the kernel-level driver suite for ConnectX NICs. It provides the InfiniBand verbs stack, the RoCE acceleration, and the nv_peer_mem (or nvidia-peermem) module that enables GPUDirect RDMA by allowing the NIC to DMA directly into GPU memory without touching the CPU.

Installing MOFED on bare metal used to mean: download a tarball, run an installer, cross-compile against your kernel headers, pray the kernel did not update. The Network Operator replaces that with a containerised driver DaemonSet. The ofedDriver init container compiles the driver against the host kernel at startup, loads it into the host kernel via hostPath, and registers a node label (network.operator.nvidia.com/mofed-driver-upgraded) when complete. The rest of the DaemonSets wait on that label before starting. If the kernel updates, the operator redoes the compile on the next node restart.

Practical gotcha: MOFED version pinning is non-negotiable. The ofedDriver.version field in NicClusterPolicy must match the version the GPU Operator was told to expect. When the two diverge, nvidia-peermem fails to load, GPUDirect RDMA silently falls back to CPU-path transfers, and your NCCL bandwidth drops from 400 Gb/s NDR to effectively CPU PCIe throughput. The symptom looks exactly like a slow network rather than a driver mismatch.

Gotcha: If you see NCCL WARN Connect to 0.0.0.0 failed or AllReduce stalls on multi-node jobs immediately after a node OS update, check lsmod | grep nvidia_peermem on each node before blaming the network. A kernel update that invalidated the compiled MOFED module is the most common cause. The operator should recompile automatically on pod restart; if it does not, check the ofedDriver DaemonSet pod logs for compile errors related to missing kernel headers.

RDMA Shared Device Plugin vs SR-IOV: Which Mode to Use

The Network Operator supports two distinct modes for exposing RDMA to pods, and choosing the wrong one wastes either performance or flexibility.

RDMA Shared Device Plugin

The shared device plugin exposes the physical NIC RDMA device as a countable Kubernetes resource (e.g., rdma/rdma_shared_dev_a). Multiple pods on the same node can share the underlying physical device. This is the simpler path: no SR-IOV hardware partitioning, no VF count configuration, no firmware settings. It works when your pods each need RDMA access but do not need guaranteed, isolated NIC bandwidth. A typical use case is a small research cluster where 2-3 distributed training pods share a ConnectX-7 port without strict bandwidth guarantees.

SR-IOV VF Mode

SR-IOV creates hardware-level Virtual Functions on the NIC. Each VF is a dedicated slice of the physical NIC with isolated queues, bandwidth, and RDMA context. The pod gets a VF as if it owned a physical NIC. This is the right choice for production multi-tenant clusters where workloads must not interfere with each other, or where you need deterministic latency on InfiniBand. The constraint: the physical NIC must support SR-IOV, the BIOS must have IOMMU enabled, and the number of VFs is set in firmware (common range: 2-127 per port; ConnectX-7 supports up to 127 [VERIFY exact max per port]). You burn VF count at deployment time. Running a 64-GPU node with 8 training pods each needing 2 VFs means you need 16 VFs configured before the Network Operator runs.

SR-IOV VF Allocation Physical NIC partitioned into VFs; each pod gets an isolated slice Physical Function (PF) ConnectX-7 400Gb/s NDR InfiniBand VF 0 Pod A – Training VF 1 Pod B – Training VF 2 Pod C – Inference VF 3..N Max 127 VFs per port SR-IOV Network Device Plugin Advertises VFs as Kubernetes allocatable resources
Figure 2: SR-IOV partitions the physical NIC into isolated VFs. Each pod gets one VF as its dedicated RDMA device.
Dimension RDMA Shared Device Plugin SR-IOV VF Mode
Hardware requirement ConnectX NIC with RDMA support ConnectX NIC + IOMMU + SR-IOV in firmware
Isolation Shared; pods compete for NIC queues Hardware-isolated per VF
Setup complexity Low; one resource config in NicClusterPolicy High; firmware VF count + SriovNetworkNodePolicy CRD
Bandwidth guarantee None Yes, per VF
Multi-tenant suitability Research / single team Production multi-tenant
GPUDirect RDMA Works (shared path) Works (preferred for deterministic perf)

Multus and Secondary Networks: How Pods Get an RDMA Interface

Kubernetes has one primary network interface per pod, managed by the cluster CNI (Calico, Cilium, Flannel, etc.). RDMA workloads need a second interface wired directly to the accelerated fabric. That is where Multus comes in.

Multus is a CNI meta-plugin: it calls your primary CNI as normal, then reads a pod annotation to attach additional network interfaces using any secondary CNI. The Network Operator installs Multus and the secondary CNI plugins as part of the secondaryNetwork spec in NicClusterPolicy. The secondary interface types supported are:

  • Host Device Network: passes the entire physical NIC (or a VF) into the pod network namespace. Used for both SR-IOV VF attachment and RDMA shared mode.
  • MacVLAN Network: layer-2 sub-interface sharing the parent NIC MAC space. Typically used for RoCE in Ethernet fabrics.
  • IPoIB Network: IP over InfiniBand. Required for InfiniBand fabrics where you need TCP/IP alongside RDMA verbs.

The pod declares the secondary interface via a k8s.v1.cni.cncf.io/networks annotation referencing a NetworkAttachmentDefinition object. That NAD is created once per cluster and describes which CNI plugin to call and with what parameters. The RDMA resource request goes in resources.limits separately.

Pod-to-RDMA-NIC Data Path Primary network (Calico/Cilium) + secondary RDMA network via Multus Training Pod eth0 (primary CNI) net1 (secondary RDMA) GPU memory via GPUDirect Primary CNI Calico / Cilium / Flannel eth0 cluster network RDMA VF / Device net1 via Multus ConnectX-7 SR-IOV VF Multus CNI meta-plugin; calls primary then secondary CNI NDR InfiniBand / RoCE Fabric 400 Gb/s per link
Figure 3: Multus attaches two network interfaces to the pod. The primary interface (eth0) goes through the normal cluster CNI. The secondary interface (net1) bypasses the primary CNI and lands directly on the RDMA NIC VF.

GPUDirect RDMA: Network Operator and GPU Operator Together

GPUDirect RDMA is the technology that lets a ConnectX NIC DMA data directly into or out of GPU HBM without involving the CPU. At the kernel level, this requires the nvidia-peermem module, which acts as the bridge between the NVIDIA GPU driver and the MOFED RDMA stack. Both drivers must be loaded and must agree on the ABI version.

The GPU Operator manages the GPU driver side. When you install it with --set driver.rdma.enabled=true --set driver.useOpenKernelModules=true, it will install and load nvidia-peermem after the MOFED module is present. The Network Operator manages MOFED. These two operators need to talk to each other only implicitly: the GPU Operator waits for a node label that confirms MOFED is loaded before starting the peermem DaemonSet. That label is set by the ofedDriver DaemonSet.

The deployment order in practice: install the Network Operator first so MOFED loads, then install (or restart) the GPU Operator. If you reverse this, the GPU Operator driver pod starts, finds no MOFED, skips peermem, and finishes. By the time MOFED loads later, peermem is absent. A driver pod restart fixes it, but it is an easy trap when you are doing a fresh cluster bring-up under time pressure.

In Practice

On a 16-node H100 cluster with ConnectX-7 and NDR InfiniBand, after a correct install sequence, kubectl describe node on a GPU+NIC node should show both nvidia.com/gpu: 8 and rdma/rdma_shared_dev_a: 1 (or an SR-IOV VF resource like nvidia.com/cx7_sriov_rdma: 8) under Allocatable. If only GPUs appear, the Network Operator DaemonSets have not completed. Check the ofedDriver pod first; it is almost always the blocker.

Operational Artifact: NicClusterPolicy and Pod Annotation

A minimal but realistic NicClusterPolicy for an InfiniBand cluster using RDMA shared device plugin, with DOCA-OFED driver pinned to v25.7 [VERIFY exact shipping version string for 2026], looks like this:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: "25.07-0.6.1.0-0"   # [VERIFY exact tag]
    upgradePolicy:
      autoUpgrade: true
      maxParallelUpgrades: 1
      safeLoad: true
  rdmaSharedDevicePlugin:
    image: k8s-rdma-shared-dev-plugin
    repository: ghcr.io/mellanox
    version: v1.5.1   # [VERIFY]
    config: |
      {
        "configList": [{
          "resourceName": "rdma_shared_dev_a",
          "rdmaHcaMax": 63,
          "selectors": {
            "ifNames": ["ens1f0"]
          }
        }]
      }
  secondaryNetwork:
    cniPlugins:
      image: plugins
      repository: ghcr.io/k8snetworkplumbingwg
      version: v1.5.0   # [VERIFY]
    multus:
      image: multus-cni
      repository: ghcr.io/k8snetworkplumbingwg
      version: v4.0.2   # [VERIFY]
    ipamPlugin:
      image: whereabouts
      repository: ghcr.io/k8snetworkplumbingwg
      version: v0.7.0   # [VERIFY]

A NetworkAttachmentDefinition for the RDMA secondary network:

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: rdma-net
  namespace: gpu-workloads
  annotations:
    k8s.v1.cni.cncf.io/resourceName: rdma/rdma_shared_dev_a
spec:
  config: '{
    "cniVersion": "0.3.1",
    "type": "host-device",
    "device": "ens1f0"
  }'

The training pod requests the RDMA resource and attaches the secondary network:

apiVersion: v1
kind: Pod
metadata:
  name: nccl-training-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: rdma-net
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/nemo:24.12   # [VERIFY latest NeMo tag]
    resources:
      limits:
        nvidia.com/gpu: "8"
        rdma/rdma_shared_dev_a: "1"
    env:
    - name: NCCL_IB_HCA
      value: "mlx5_0"
    - name: NCCL_DEBUG
      value: "INFO"

Expected result: pod starts with two network interfaces (eth0 cluster network, net1 RDMA network); kubectl exec -- ibstat inside the pod shows the ConnectX port in Active state; NCCL all-reduce bus bandwidth approaches the physical link speed.

Common failure mode 1 (missing VFs): pod scheduling fails with Insufficient rdma/rdma_shared_dev_a. Cause: either the ifName in NicClusterPolicy does not match the actual interface name on the node (ip link to verify), or the device plugin DaemonSet has not reconciled yet. Check kubectl get pods -n nvidia-network-operator for DaemonSets still in Init state.

Common failure mode 2 (MOFED mismatch): pod starts, NCCL reports WARN No interface found in cudaDevices. Check kubectl logs -n nvidia-network-operator ds/mofed-ubuntu22.04-ds -c mofed-container for version string. Cross-reference against what ofed_info reports on the host after the module loads. If they diverge, update ofedDriver.version in NicClusterPolicy and delete the stale DaemonSet pods to force a reconcile.

InfiniBand vs RoCE: What Changes in the Operator Config

The Network Operator handles both InfiniBand and RoCE, but the configuration paths diverge in two places.

For InfiniBand, you need ibKubernetes enabled in NicClusterPolicy. This daemon watches pod annotations and programs GUID assignments on the InfiniBand subnet manager (MLNX-OS or UFM) so that each pod port gets a unique GUID for routing. Without it, multiple pods on the same node share a GUID, causing routing ambiguity and dropped traffic. For IPoIB (TCP over InfiniBand), the secondaryNetwork must include the ipoib CNI type in the NAD config. NCCL uses RDMA verbs directly and does not need IPoIB, but some MPI runtimes fall back to TCP and require it.

For RoCE (RDMA over Converged Ethernet), the NIC must be in Ethernet mode and PFC (Priority Flow Control) plus ECN (Explicit Congestion Notification) must be configured on the switch fabric. The Network Operator does not configure the switch; that is still a day-0 network-team task. On the node side, you use the macvlan or host-device secondary CNI type rather than ipoib, and the NCCL env var changes from NCCL_IB_HCA to the RoCE interface name. RoCE is the right choice for Spectrum-X Ethernet fabrics from Part 8 of this series.

Config Element InfiniBand RoCE / Ethernet
ibKubernetes Required (GUID management) Not used
Secondary CNI type ipoib or host-device macvlan or host-device
Switch requirement IB subnet manager (UFM / MLNX-OS) PFC + ECN on Ethernet switches
NCCL env hint NCCL_IB_HCA=mlx5_0 NCCL_SOCKET_IFNAME=net1
Latency profile Lower baseline (sub-1 microsecond) [VERIFY] Slightly higher; switch-dependent
NVL72 / DGX platform default NDR IB (400 Gb/s per port) Spectrum-X (optional)
Multus Secondary Network Attach Flow From pod annotation to net1 interface in the pod namespace Pod Annotation k8s.v1.cni.cncf.io /networks: rdma-net Multus CNI reads NAD object rdma-net in ns host-device CNI moves VF into pod netns net1 in Pod RDMA-capable interface NCCL IB verbs ready IPAM (whereabouts) assigns IP to net1 across all nodes SR-IOV Dev Plugin allocates VF to pod limits.rdma or VF res IPAM and device plugin operate in parallel during pod admission
Figure 4: The Multus secondary network attach flow. The pod annotation triggers Multus, which calls the secondary CNI; IPAM and the device plugin operate in parallel during pod admission.

When NOT to Deploy the Network Operator

The Network Operator adds real operational weight. Before you install it, answer these four questions honestly:

  • Do your nodes have ConnectX NICs? If your GPU nodes use commodity Ethernet (Intel, Broadcom) with no RDMA support, the Network Operator cannot provision anything useful. MOFED will refuse to load.
  • Do you run distributed training or inference across nodes? Single-node GPU workloads (even 8xH100) do not need RDMA; NVLink handles intra-node communication. The Network Operator is a multi-node tool.
  • Is your Kubernetes version compatible? Kubernetes 1.28 or later is required for current Network Operator versions [VERIFY minimum K8s version for v26.x].
  • Do you have SR-IOV in your BIOS and IOMMU enabled? Without this, you can use shared RDMA mode but not SR-IOV VFs, which limits multi-tenant isolation significantly.

If you answered no to the first two, skip it. A cluster running single-node fine-tuning jobs on H100s with standard CNI is simpler, faster to debug, and cheaper to operate. Add the Network Operator when the workloads demand it.

My Take: In conversations with ML platform teams running training clusters in production, the Network Operator earns its complexity overhead when you hit two or more nodes per job and want InfiniBand running cleanly under Kubernetes. The alternative is manual MOFED installs, custom device plugin scripts, and a bespoke DaemonSet for every kernel update. That path decays fast. Where I have seen teams skip the Network Operator and regret it: when they scaled from 4 nodes to 32 nodes mid-quarter and had to retroactively wire RDMA into a cluster that was never designed for it. The operator is much easier to deploy on a fresh cluster than to bolt onto an existing one. [AUTHOR: add specific cluster-size tipping point anecdote]

Day-2 Operations: Upgrades, Debugging, and Node Drain

The Network Operator includes a built-in upgrade controller for MOFED via the upgradePolicy block in NicClusterPolicy. With autoUpgrade: true and maxParallelUpgrades: 1, the operator drains one node at a time, unloads MOFED, loads the new version, and cordons the node back in. This is the same rolling-drain pattern used for node OS updates.

One subtlety: safeLoad: true in the upgradePolicy tells the operator to load the new MOFED version inside a temporary namespace before committing it to the host. If the load fails (missing kernel headers, module conflict), the node stays on the old version and an event is emitted. Without safeLoad, a bad MOFED version can leave a node with no functional NIC driver.

For debugging, these are the four commands you run in order when RDMA resources do not appear in kubectl describe node:

  1. kubectl get pods -n nvidia-network-operator -o wide: confirms all DaemonSet pods are Running.
  2. kubectl logs -n nvidia-network-operator -l app=mofed-ubuntu22.04-ds -c mofed-container --tail=50: MOFED compile log; look for kernel header errors.
  3. kubectl logs -n nvidia-network-operator -l app=rdma-shared-dp-ds --tail=30: device plugin registration log; confirms resource name is published.
  4. kubectl get node <nodename> -o json | jq '.status.allocatable': confirms the RDMA resource appears and has a non-zero quantity.

Worked Example

A 32-node NDR InfiniBand cluster running B200 GPUs with ConnectX-7 in SR-IOV mode. Each node has 8 GPUs and 2 NIC ports. Target: 8 SR-IOV VFs per port, so 16 VFs per node for up to 16 training pod slots. Firmware configured with mstconfig -d <PCI_BDF> set SRIOV_EN=1 NUM_OF_VFS=16 [VERIFY mstconfig flag names]. NicClusterPolicy enables sriovDevicePlugin and secondaryNetwork with host-device CNI. Each training pod requests nvidia.com/cx7_sriov_rdma: 1 and annotates k8s.v1.cni.cncf.io/networks: sriov-rdma-net. Result after correct deploy: kubectl describe node shows nvidia.com/cx7_sriov_rdma: 16 allocatable per node. Eight 4-pod distributed training jobs can run in parallel across the cluster with full RDMA isolation between jobs.

The Verdict

Deploy the Network Operator when you have ConnectX NICs, InfiniBand or RoCE fabric, and multi-node GPU workloads. In that scenario it is not optional; managing MOFED and device plugins at scale without it is a maintenance trap that gets worse with every Kubernetes or kernel upgrade.

Use SR-IOV VF mode for production multi-tenant clusters where bandwidth isolation between jobs is a requirement. Use RDMA shared device plugin for research clusters with a single team and no need for strict guarantees. Do not mix the two modes on the same node without fully understanding which resource name maps to which path.

What to validate before going production: confirm MOFED version is pinned and matches the GPU Operator expectation; run ib_write_bw between two pods on different nodes and measure against your link speed; verify that a node drain plus MOFED upgrade completes cleanly without leaving orphaned VFs; and confirm that nvidia-peermem is loaded on every GPU node after the boot sequence.

The Network Operator does one thing well: it makes accelerated fabric on Kubernetes repeatable and version-controlled. The day-2 work is still there (switch config, firmware, BIOS), but at least the Kubernetes side stays in a declarative manifest. That is the right boundary for an operator to own.

For the VCF deployment angle on running this stack within VMware Cloud Foundation, see the Private AI Series where the NIC passthrough and SR-IOV considerations in a vSphere context are covered separately.

Questions about your specific cluster topology? Leave a comment below with your NIC model and workload type.

Disclaimer: The manifests and configuration snippets in this article are illustrative examples based on publicly available NVIDIA Network Operator documentation. CRD field names, image tags, and version strings change between releases. Always consult the official NVIDIA Network Operator documentation for your target version before applying any configuration to a production cluster. Test in a non-production environment first. Items marked [VERIFY] should be confirmed against the current release notes before use.
NVIDIA AI Series · Part 13 of 30
« Previous: Part 12  |  NVIDIA AI Guide  |  Next: Part 14 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading