TL;DR · Key Takeaways
- The NIM Operator is the declarative control plane for inference on Private AI. You describe model serving in YAML, the operator reconciles pods, caches, scaling and routing.
- Four custom resources do the work:
NIMCache,NIMService,NIMPipelineandNIMBuild. - Private AI Services 2.1 ships GPU Operator 25.10.1 and NIM Operator 3.1.0 by default. Pin those versions, do not float to latest.
- Pre-warm the cache with NIMCache or your first autoscale event waits on a multi-GB model pull. This is the single biggest day-1 mistake.
- The constraint that bites: on vGPU, a model needing more than one GPU requires A100 or H100 with NVLink. L40S with vGPU cannot serve multi-GPU models at all.
Most teams meet the NIM Operator the same way: they deployed a single NIM container by hand from the NIM microservices walkthrough, it worked, and then they tried to run six models with autoscaling and rolling updates and the hand-built approach fell apart. The operator exists for exactly that second moment. It turns model serving into Kubernetes objects you can version, review and reconcile, instead of a pile of helm install commands nobody remembers. This post is the reference architecture: what the pieces are, how they connect, how they size, and where the sharp edges are.
Where the NIM Operator sits in the Private AI stack
The operator is not infrastructure. It is an application-layer controller that sits on top of everything you built in the deployment phase. Physical GPUs live on the ESXi hosts. The GPU Operator makes those GPUs schedulable inside a VKS cluster by managing the vGPU guest driver, the container toolkit and device plugin. Only then does the NIM Operator have something to schedule against. Get the layering wrong and you will spend an afternoon debugging a NIMService that cannot find a GPU when the real problem is two layers down.
The four custom resources, and what each one owns
The whole operator is four CRDs. Learn what each one owns and the rest of the design follows. The important relationship is that caching and serving are separate objects on purpose: you populate a cache once, then point many services at it without re-downloading anything.
| Custom resource | What it owns | Key spec fields | Use it when |
|---|---|---|---|
| NIMCache | Model artifacts on a PVC, downloaded once | source.ngc / source.hf, model.engine, model.tensorParallelism, storage.pvc | Always, for any model you scale or restart |
| NIMService | A running inference endpoint and its pods | image, storage, scale.hpa, expose.router, affinity | Every model you serve |
| NIMPipeline | A group of NIMServices managed as one unit | services[] (each an embedded NIMService spec) | RAG or agent stacks with several models |
| NIMBuild | A locally compiled TensorRT-LLM engine | source profile, target GPU, output cache | You need an engine tuned to your exact GPU |
A minimal cache-then-serve pair looks like this. Note that the service references the cache by name, and that with autoscaling enabled you do not set spec.replicas at all, the HPA owns it.
# Install the operator (pin the version, do not float to latest)
helm install nim-operator nvidia/k8s-nim-operator
--version 3.1.0 -n nim-operator --create-namespace
# NIMService with the cache referenced and HPA owning replicas
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: llama-3-1-8b
spec:
image:
repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
tag: "1.8.0"
storage:
nimCache:
name: llama-3-1-8b-cache # mount the warm cache, no re-download
scale:
enabled: true # HPA on; do NOT also set spec.replicas
hpa:
minReplicas: 1
maxReplicas: 4
expose:
router:
gateway: # Gateway API routing, new default in 3.1
name: inference-gw
The request path and how autoscaling actually triggers
Autoscaling is where people expect magic and get confused. The operator wires a standard Kubernetes HPA to your NIMService, but the metric it scales on matters. CPU utilization is useless for inference, a GPU pod can be saturated at 12 percent CPU. You want to scale on GPU utilization from DCGM, or better, on NIM-specific request metrics like queue depth and time-to-first-token. Version 3.1 also moved routing to the Kubernetes Gateway API by default, with the old ingress field deprecated. If you templated against spec.expose.ingress, migrate it to spec.expose.router.ingress before you upgrade.
Design decisions that actually bite
Four choices separate a deployment that survives contact with production from one that pages you at 2am. Walk the decision tree below before you write a single manifest.
The storage choice is the one I see fumbled most. NIMService now supports emptyDir and hostPath in addition to PVCs, and the convenience is tempting. Resist it for anything real. emptyDir data evaporates on pod restart, so every restart re-pulls the model, and hostPath ties a pod to one node and breaks the moment the scheduler moves it. Both are fine for a proof of concept and wrong for production. Use a NIMCache on a proper PVC and move on. For the design rationale behind the cluster shape this all runs on, the reference architecture and sizing post is the companion read.
My take
If you are serving more than one or two models, the NIM Operator is not optional, it is the only sane way to run inference on Private AI. The hand-built path does not survive the first autoscaling requirement. That said, two features in the 3.x line are still maturing and I would not bet production on them yet: DRA-based GPU allocation is a technology preview, and the Dynamo and Kata sandbox integrations are explicitly experimental. Use the boring, stable core, NIMCache plus NIMService with HPA on GPU metrics, and treat the rest as a roadmap. The biggest operational win is cultural, not technical: once serving is YAML in a git repo, your AI platform finally gets reviewed, rolled back and audited like the rest of your infrastructure.
Next in the series we leave the serving layer and tackle the network underneath it. What is the one model you most wish the operator handled better today? Tell me in the comments.
References
- NVIDIA NIM Operator release notes (v3.1.0 and earlier)
- Managing NIM Services, NVIDIA NIM Operator documentation
- Deploying VMware Cloud Foundation Private AI Services, VCF blog
« Previous: Part 24 | VMware Private AI Complete Guide | Next: Part 26 »








