NVIDIA NIM Operator on VMware Private AI: The Reference Architecture for Declarative Model Serving (Private AI Series, Part 25)

The NIM Operator is the Kubernetes-native control plane for model serving on VMware Private AI. Here is how its CRDs, caching and autoscaling actually fit together, and the vGPU constraint that bites multi-GPU models.

by

Dr. Pranay Jha

June 17, 2026

No comments

8 minutes

Read Time

VMware Private AI Series · Part 25 of 30

TL;DR · Key Takeaways

The NIM Operator is the declarative control plane for inference on Private AI. You describe model serving in YAML, the operator reconciles pods, caches, scaling and routing.
Four custom resources do the work: NIMCache, NIMService, NIMPipeline and NIMBuild.
Private AI Services 2.1 ships GPU Operator 25.10.1 and NIM Operator 3.1.0 by default. Pin those versions, do not float to latest.
Pre-warm the cache with NIMCache or your first autoscale event waits on a multi-GB model pull. This is the single biggest day-1 mistake.
The constraint that bites: on vGPU, a model needing more than one GPU requires A100 or H100 with NVLink. L40S with vGPU cannot serve multi-GPU models at all.

Who this is for: architects and platform engineers running model serving on VMware Private AI Foundation with NVIDIA on VKS clusters. Prerequisites: a GPU-enabled VKS cluster with the NVIDIA GPU Operator installed, an NGC API key, and a default storage class that supports ReadWriteMany or ReadWriteOnce PVCs.

Most teams meet the NIM Operator the same way: they deployed a single NIM container by hand from the NIM microservices walkthrough, it worked, and then they tried to run six models with autoscaling and rolling updates and the hand-built approach fell apart. The operator exists for exactly that second moment. It turns model serving into Kubernetes objects you can version, review and reconcile, instead of a pile of helm install commands nobody remembers. This post is the reference architecture: what the pieces are, how they connect, how they size, and where the sharp edges are.

Where the NIM Operator sits in the Private AI stack

The operator is not infrastructure. It is an application-layer controller that sits on top of everything you built in the deployment phase. Physical GPUs live on the ESXi hosts. The GPU Operator makes those GPUs schedulable inside a VKS cluster by managing the vGPU guest driver, the container toolkit and device plugin. Only then does the NIM Operator have something to schedule against. Get the layering wrong and you will spend an afternoon debugging a NIMService that cannot find a GPU when the real problem is two layers down.

The NIM Operator is the top layer. Everything under it must be green before a NIMService can schedule.

The four custom resources, and what each one owns

The whole operator is four CRDs. Learn what each one owns and the rest of the design follows. The important relationship is that caching and serving are separate objects on purpose: you populate a cache once, then point many services at it without re-downloading anything.

NIMCache and NIMService are deliberately separate objects so one cache backs many services.

Custom resource	What it owns	Key spec fields	Use it when
NIMCache	Model artifacts on a PVC, downloaded once	source.ngc / source.hf, model.engine, model.tensorParallelism, storage.pvc	Always, for any model you scale or restart
NIMService	A running inference endpoint and its pods	image, storage, scale.hpa, expose.router, affinity	Every model you serve
NIMPipeline	A group of NIMServices managed as one unit	services[] (each an embedded NIMService spec)	RAG or agent stacks with several models
NIMBuild	A locally compiled TensorRT-LLM engine	source profile, target GPU, output cache	You need an engine tuned to your exact GPU

A minimal cache-then-serve pair looks like this. Note that the service references the cache by name, and that with autoscaling enabled you do not set spec.replicas at all, the HPA owns it.

# Install the operator (pin the version, do not float to latest)
helm install nim-operator nvidia/k8s-nim-operator 
  --version 3.1.0 -n nim-operator --create-namespace

# NIMService with the cache referenced and HPA owning replicas
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-1-8b
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: "1.8.0"
  storage:
    nimCache:
      name: llama-3-1-8b-cache   # mount the warm cache, no re-download
  scale:
    enabled: true                # HPA on; do NOT also set spec.replicas
    hpa:
      minReplicas: 1
      maxReplicas: 4
  expose:
    router:
      gateway:                   # Gateway API routing, new default in 3.1
        name: inference-gw

The request path and how autoscaling actually triggers

Autoscaling is where people expect magic and get confused. The operator wires a standard Kubernetes HPA to your NIMService, but the metric it scales on matters. CPU utilization is useless for inference, a GPU pod can be saturated at 12 percent CPU. You want to scale on GPU utilization from DCGM, or better, on NIM-specific request metrics like queue depth and time-to-first-token. Version 3.1 also moved routing to the Kubernetes Gateway API by default, with the old ingress field deprecated. If you templated against spec.expose.ingress, migrate it to spec.expose.router.ingress before you upgrade.

Metrics from DCGM and the NIM itself drive the HPA. CPU-based scaling will never fire correctly for inference.

Design decisions that actually bite

Four choices separate a deployment that survives contact with production from one that pages you at 2am. Walk the decision tree below before you write a single manifest.

The multi-GPU vGPU constraint is a procurement decision, not a config flag. Plan for it early.

The storage choice is the one I see fumbled most. NIMService now supports emptyDir and hostPath in addition to PVCs, and the convenience is tempting. Resist it for anything real. emptyDir data evaporates on pod restart, so every restart re-pulls the model, and hostPath ties a pod to one node and breaks the moment the scheduler moves it. Both are fine for a proof of concept and wrong for production. Use a NIMCache on a proper PVC and move on. For the design rationale behind the cluster shape this all runs on, the reference architecture and sizing post is the companion read.

Disclaimer: validate the operator and GPU Operator versions against the Private AI Services interoperability matrix before deploying. Pin chart versions, confirm your NGC entitlement covers the NIM containers you intend to run, pre-stage caches in a non-production namespace, and test a scale-out and a rolling upgrade before you point real traffic at it.

My take

If you are serving more than one or two models, the NIM Operator is not optional, it is the only sane way to run inference on Private AI. The hand-built path does not survive the first autoscaling requirement. That said, two features in the 3.x line are still maturing and I would not bet production on them yet: DRA-based GPU allocation is a technology preview, and the Dynamo and Kata sandbox integrations are explicitly experimental. Use the boring, stable core, NIMCache plus NIMService with HPA on GPU metrics, and treat the rest as a roadmap. The biggest operational win is cultural, not technical: once serving is YAML in a git repo, your AI platform finally gets reviewed, rolled back and audited like the rest of your infrastructure.

Next in the series we leave the serving layer and tackle the network underneath it. What is the one model you most wish the operator handled better today? Tell me in the comments.

References

VMware Private AI Series · Part 25 of 30
« Previous: Part 24 | VMware Private AI Complete Guide | Next: Part 26 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts