Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

NVIDIA NIM Operator on VMware Private AI: The Reference Architecture for Declarative Model Serving (Private AI Series, Part 25)

The NIM Operator is the Kubernetes-native control plane for model serving on VMware Private AI. Here is how its CRDs, caching and autoscaling actually fit together, and the vGPU constraint that bites multi-GPU models.

VMware Private AI Series · Part 25 of 30

TL;DR · Key Takeaways

  • The NIM Operator is the declarative control plane for inference on Private AI. You describe model serving in YAML, the operator reconciles pods, caches, scaling and routing.
  • Four custom resources do the work: NIMCache, NIMService, NIMPipeline and NIMBuild.
  • Private AI Services 2.1 ships GPU Operator 25.10.1 and NIM Operator 3.1.0 by default. Pin those versions, do not float to latest.
  • Pre-warm the cache with NIMCache or your first autoscale event waits on a multi-GB model pull. This is the single biggest day-1 mistake.
  • The constraint that bites: on vGPU, a model needing more than one GPU requires A100 or H100 with NVLink. L40S with vGPU cannot serve multi-GPU models at all.
Who this is for: architects and platform engineers running model serving on VMware Private AI Foundation with NVIDIA on VKS clusters.  Prerequisites: a GPU-enabled VKS cluster with the NVIDIA GPU Operator installed, an NGC API key, and a default storage class that supports ReadWriteMany or ReadWriteOnce PVCs.

Most teams meet the NIM Operator the same way: they deployed a single NIM container by hand from the NIM microservices walkthrough, it worked, and then they tried to run six models with autoscaling and rolling updates and the hand-built approach fell apart. The operator exists for exactly that second moment. It turns model serving into Kubernetes objects you can version, review and reconcile, instead of a pile of helm install commands nobody remembers. This post is the reference architecture: what the pieces are, how they connect, how they size, and where the sharp edges are.

Where the NIM Operator sits in the Private AI stack

The operator is not infrastructure. It is an application-layer controller that sits on top of everything you built in the deployment phase. Physical GPUs live on the ESXi hosts. The GPU Operator makes those GPUs schedulable inside a VKS cluster by managing the vGPU guest driver, the container toolkit and device plugin. Only then does the NIM Operator have something to schedule against. Get the layering wrong and you will spend an afternoon debugging a NIMService that cannot find a GPU when the real problem is two layers down.

The serving stack, top to bottom Each layer depends on the one below it being healthy first NIM Operator 3.1.0 NIMCache, NIMService, NIMPipeline, NIMBuild reconciled into pods and routes NVIDIA GPU Operator 25.10.1 vGPU guest driver, container toolkit, device plugin, DCGM exporter VKS cluster (worker VMs) GPU-backed worker nodes provisioned from the AI Kubernetes catalog item vSphere Supervisor & namespaces Tenant isolation and resource quotas per AI project ESXi GPU hosts Physical L40S / H100 / H200 GPUs in vGPU or passthrough mode
The NIM Operator is the top layer. Everything under it must be green before a NIMService can schedule.

The four custom resources, and what each one owns

The whole operator is four CRDs. Learn what each one owns and the rest of the design follows. The important relationship is that caching and serving are separate objects on purpose: you populate a cache once, then point many services at it without re-downloading anything.

Cache once, serve many NGC / HF Hub model profiles NIMCache pulls profiles to a PVC, filtered by engine + TP NIMService A llama-3.1-8b NIMService B embedding NIM NIM Pipeline groups Why the split matters A cold NIMService with no cache downloads its model on first start. For a 70B model that is tens of minutes, and it repeats on every scale-out event. NIMCache pays that cost once, on a PVC, so a new replica mounts the existing profile and is serving in under a minute. Filter the cache by engine (vLLM, SGLang or TensorRT-LLM) and tensor parallelism so you do not pull profiles you will never run. NIMBuild handles the case where you compile a TRT-LLM engine yourself.
NIMCache and NIMService are deliberately separate objects so one cache backs many services.
Custom resourceWhat it ownsKey spec fieldsUse it when
NIMCacheModel artifacts on a PVC, downloaded oncesource.ngc / source.hf, model.engine, model.tensorParallelism, storage.pvcAlways, for any model you scale or restart
NIMServiceA running inference endpoint and its podsimage, storage, scale.hpa, expose.router, affinityEvery model you serve
NIMPipelineA group of NIMServices managed as one unitservices[] (each an embedded NIMService spec)RAG or agent stacks with several models
NIMBuildA locally compiled TensorRT-LLM enginesource profile, target GPU, output cacheYou need an engine tuned to your exact GPU

A minimal cache-then-serve pair looks like this. Note that the service references the cache by name, and that with autoscaling enabled you do not set spec.replicas at all, the HPA owns it.

# Install the operator (pin the version, do not float to latest)
helm install nim-operator nvidia/k8s-nim-operator 
  --version 3.1.0 -n nim-operator --create-namespace

# NIMService with the cache referenced and HPA owning replicas
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-1-8b
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: "1.8.0"
  storage:
    nimCache:
      name: llama-3-1-8b-cache   # mount the warm cache, no re-download
  scale:
    enabled: true                # HPA on; do NOT also set spec.replicas
    hpa:
      minReplicas: 1
      maxReplicas: 4
  expose:
    router:
      gateway:                   # Gateway API routing, new default in 3.1
        name: inference-gw

The request path and how autoscaling actually triggers

Autoscaling is where people expect magic and get confused. The operator wires a standard Kubernetes HPA to your NIMService, but the metric it scales on matters. CPU utilization is useless for inference, a GPU pod can be saturated at 12 percent CPU. You want to scale on GPU utilization from DCGM, or better, on NIM-specific request metrics like queue depth and time-to-first-token. Version 3.1 also moved routing to the Kubernetes Gateway API by default, with the old ingress field deprecated. If you templated against spec.expose.ingress, migrate it to spec.expose.router.ingress before you upgrade.

Request path and the scaling loop client Gateway API route + TLS NIM pod 1 NIM pod 2 DCGM + NIM metrics HPA min 1 / max 4 Scale on GPU utilization or queue depth, never on CPU.
Metrics from DCGM and the NIM itself drive the HPA. CPU-based scaling will never fire correctly for inference.

Design decisions that actually bite

Four choices separate a deployment that survives contact with production from one that pages you at 2am. Walk the decision tree below before you write a single manifest.

Four decisions before you deploy Will it scale or restart? Yes: NIMCache on a PVC. No: emptyDir is fine for a demo. One model or many? Many small models: Multi-LLM NIM. One hot model: LLM-specific NIM. Native or KServe? Already on KServe: use it. Greenfield: native is simpler. Needs more than one GPU? On vGPU you need A100/H100 with NVLink. L40S + vGPU cannot do it. The one that surprises people Tensor-parallel models split across GPUs over NVLink. In a vGPU VKS cluster, that path is only certified on A100 or H100 connected by NVLink or an NVLink Switch. Put a 70B model on L40S with vGPU and it simply will not start. The fix is a hardware decision made at procurement, not a YAML flag you can flip later. Size the GPU to the largest model you intend to run, with headroom. See the GPU selection and partitioning posts for the hardware view.
The multi-GPU vGPU constraint is a procurement decision, not a config flag. Plan for it early.

The storage choice is the one I see fumbled most. NIMService now supports emptyDir and hostPath in addition to PVCs, and the convenience is tempting. Resist it for anything real. emptyDir data evaporates on pod restart, so every restart re-pulls the model, and hostPath ties a pod to one node and breaks the moment the scheduler moves it. Both are fine for a proof of concept and wrong for production. Use a NIMCache on a proper PVC and move on. For the design rationale behind the cluster shape this all runs on, the reference architecture and sizing post is the companion read.


Disclaimer: validate the operator and GPU Operator versions against the Private AI Services interoperability matrix before deploying. Pin chart versions, confirm your NGC entitlement covers the NIM containers you intend to run, pre-stage caches in a non-production namespace, and test a scale-out and a rolling upgrade before you point real traffic at it.

My take

If you are serving more than one or two models, the NIM Operator is not optional, it is the only sane way to run inference on Private AI. The hand-built path does not survive the first autoscaling requirement. That said, two features in the 3.x line are still maturing and I would not bet production on them yet: DRA-based GPU allocation is a technology preview, and the Dynamo and Kata sandbox integrations are explicitly experimental. Use the boring, stable core, NIMCache plus NIMService with HPA on GPU metrics, and treat the rest as a roadmap. The biggest operational win is cultural, not technical: once serving is YAML in a git repo, your AI platform finally gets reviewed, rolled back and audited like the rest of your infrastructure.

Next in the series we leave the serving layer and tackle the network underneath it. What is the one model you most wish the operator handled better today? Tell me in the comments.

References

VMware Private AI Series · Part 25 of 30
« Previous: Part 24  |  VMware Private AI Complete Guide  |  Next: Part 26 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading