gpu_cache_usage_perc or request-queue depth via Prometheus HPA or KEDA. Never scale on CPU utilization. Cold-start latency during scale-up is the real production risk; shared-storage caching is the fix. A blue-green rollout via replicated NIMService resources avoids zero-downtime model swaps. My baseline: 70% KV-cache target, minReplicas=2, a 12-minute startup probe window, and KEDA over native HPA for multi-metric fanout.
A NIM container cold-starting on an unscheduled node while your SLA clock is ticking is not a theoretical problem. I have watched a Llama 3.1 70B NIM take 14 minutes to pull model weights over the network on scale-up, at which point the HPA had already fired another replica and the queue had blown past the latency budget. The NIM Operator exists to break that failure loop. This post is about how to deploy it correctly, wire the right autoscaling signals, and avoid the three production failure modes I see most often: scaling on the wrong metric, cold-starting without a cache, and zero-downtime rollout attempts that are not actually zero-downtime.
The NIM Operator: What It Actually Manages
The NIM Operator is a standard Kubernetes operator installed via Helm from the NGC catalog (nvidia/k8s-nim-operator). It introduces two CRDs: NIMCache and NIMService. Think of NIMCache as the model artifact manager and NIMService as the deployment lifecycle manager. They are separate objects on purpose: you cache once per model version, then deploy many service instances against that cache without each replica pulling weights independently.
The operator reconcile loop works like this. When you apply a NIMCache, the operator spins up a Job that pulls the model container from NGC and extracts the model weights to a PVC. Once the NIMCache status transitions to Ready, that PVC is the golden copy. Every NIMService pod that references this NIMCache mounts the PVC read-only, bypassing the network pull entirely. This is the architectural reason you should never run NIM in production without a NIMCache, even if it feels like an extra step during initial setup.
Model Profile Selection and the Profile Miss
Each NIMCache can contain multiple optimized model profiles, for example vllm-fp8-tp1-pp1 for a single H100 with FP8 or vllm-fp16-tp2-pp1 for tensor-parallel across two GPUs. You select a profile in the NIMService via spec.storage.nimCache.profile. The silent failure mode here: if the profile string you specify does not match a name in the cached manifest, the pod starts, loads a default profile (often a heavier FP16), and you consume more GPU memory than you planned for, which breaks your packing math. Always run a NIMService dry-run or inspect the NIMCache status to enumerate available profiles before committing to a value in production.
Helm vs. NIM Operator: Which Path to Take
You have two deployment paths. Raw Helm (the NIM LLM chart from NGC) gives you speed for a single-model proof of concept. The NIM Operator gives you lifecycle management, multi-model support, and integrated autoscaling for anything you intend to run past a sprint demo. The table below makes the trade-off concrete.
| Dimension | Raw Helm (NIM chart) | NIM Operator (CRDs) |
|---|---|---|
| Setup complexity | Low — one Helm install | Medium — GPU Operator + Operator install first |
| Model caching | Manual PVC + initContainers | NIMCache CRD handles it natively |
| Autoscaling wiring | Manual HPA + Prometheus Adapter | HPA spec in NIMService; KEDA alongside |
| Multi-model clusters | One chart per model, no shared lifecycle | Multiple NIMService objects, shared CRD schema |
| Rollout strategy | Helm upgrade (rolling; can stall on GPU alloc) | Blue-green via NIMService duplication + traffic switch |
| Air-gap / private registry | Requires manual image mirroring | NIMCache spec supports private registry overrides |
| GitOps readiness | HelmRelease in Flux/Argo | CRDs as first-class GitOps objects |
Autoscaling on the Right Signals
This is where most NIM deployments go wrong. CPU utilization is the wrong signal for an LLM inference workload. The GPU is doing all the compute, and the CPU is largely idle during forward passes; a CPU-based HPA will sit at 5% and never scale out even as your GPUs are saturated. The right signals are inference-specific metrics that NIM exposes on its Prometheus endpoint by default.
The three metrics that matter in production, in order of preference:
- gpu_cache_usage_perc — percentage of the GPU KV-cache that is occupied. This is the most direct measure of how saturated the model is. Scale out when this exceeds 70% on average; scale in below 30%.
- num_requests_running — the live concurrency count. Useful as a secondary trigger to catch burst traffic before the KV-cache saturates.
- time_to_first_token_seconds (P90) — latency-oriented scaling. If P90 TTFT exceeds your SLA, you need more capacity regardless of cache state. Wire this only if you have a hard latency contract and can accept the additional replica overhead it implies.
APIService registration and tends to be brittle on cert rotation. KEDA is operationally simpler: deploy the KEDA controller, write a ScaledObject pointing at your Prometheus source, and KEDA manages the HPA object for you. You still get standard HPA behavior (stabilization windows, min/max replica bounds), but without the Adapter plumbing. The trade-off is an additional controller in your cluster; it is worth it for anything running more than two NIM models.
Cold-Start: The Scaling Tax Nobody Budgets For
Every new NIM pod must load the model into GPU VRAM before it can serve requests. For a 70B parameter model at FP8, that is roughly 35 GB going from storage to GPU memory. Over NFS or a standard ReadWriteOnce PVC, that can run 8 to 14 minutes. The NIM Operator configures a startup probe on /v1/health/ready with a default failureThreshold of 120 and periodSeconds of 10, giving new pods up to 20 minutes before the kubelet kills them. Do not shorten this window. A probe timeout that kills a loading pod restarts the scale-up process and can put you in a crash loop during a traffic surge.
The mitigation stack, in order of effectiveness: (1) Use a NIMCache with a high-throughput PVC, ideally a parallel file system such as GPFS/Weka that can saturate GPU memory bandwidth. (2) Set minReplicas: 2 so you always have warm capacity. (3) Set KEDA or HPA stabilization windows: scale-out should respond fast (30s), scale-in should be slow (10 to 15 minutes). Scaling in before a new pod is warm is another common way to create a capacity hole.
The Production Artifact: NIMService + KEDA ScaledObject
Below is a production-ready NIMService manifest paired with a KEDA ScaledObject keyed on gpu_cache_usage_perc. Read the inline comments carefully; the failure modes are annotated where they occur.
---
# NIMCache: run this FIRST and wait for status.state == Ready
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: llama-3-1-70b-cache
namespace: nim-prod
spec:
model:
engine: vllm
repoName: meta/llama-3.1-70b-instruct
# VERIFY: check NGC for latest tag before pinning
tag: "1.3.0"
precision: fp8
# Select the single-GPU FP8 profile explicitly.
# If this string mismatches the manifest, the pod
# falls back to a heavier default -- check NIMCache status.
profiles:
- vllm-fp8-tp1-pp1
storage:
pvc:
create: true
storageClass: "fast-parallel-fs"
size: "250Gi"
accessMode: ReadWriteMany
resources:
limits:
nvidia.com/gpu: 1
---
# NIMService: deploy AFTER NIMCache status is Ready
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: llama-3-1-70b
namespace: nim-prod
spec:
image:
repository: nvcr.io/nim/meta/llama-3.1-70b-instruct
# VERIFY: pin to same tag as NIMCache
tag: "1.3.0"
pullPolicy: IfNotPresent
pullSecrets:
- name: ngc-registry-secret
storage:
nimCache:
name: llama-3-1-70b-cache
profile: vllm-fp8-tp1-pp1
replicas: 2 # minReplicas floor; KEDA will override
resources:
limits:
nvidia.com/gpu: 1
memory: "80Gi"
readinessProbe:
httpGet:
path: /v1/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 5
# Startup probe: give the model up to 20 min to load into VRAM.
# Do NOT shorten failureThreshold -- a killed loading pod
# restarts the clock and creates a capacity hole under load.
startupProbe:
httpGet:
path: /v1/health/ready
port: 8000
failureThreshold: 120
periodSeconds: 10
expose:
service:
type: ClusterIP
port: 8000
metrics:
enabled: true # enables /metrics on the same port
---
# KEDA ScaledObject: scale on gpu_cache_usage_perc
# WRONG signal: do NOT use cpu or memory here
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llama-3-1-70b-scaler
namespace: nim-prod
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
# NIMService creates a Deployment with this name
name: llama-3-1-70b
minReplicaCount: 2
maxReplicaCount: 8
pollingInterval: 30
cooldownPeriod: 600 # 10 min: let new pods warm up before scaling in
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
metricName: nim_gpu_cache_usage_perc
# Target 70% KV-cache utilization per replica.
# Above this threshold KEDA fires a scale-out event.
threshold: "70"
query: >
avg(gpu_cache_usage_perc{namespace="nim-prod",
pod=~"llama-3-1-70b.*"}) * 100
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
metricName: nim_requests_running
# Secondary trigger: scale if avg concurrency exceeds 40 req/replica.
threshold: "40"
query: >
avg(num_requests_running{namespace="nim-prod",
pod=~"llama-3-1-70b.*"})
Expected behavior: at steady state, two pods serve traffic. When gpu_cache_usage_perc averages above 70%, KEDA fires a scale-out within 30 seconds; the new pod mounts the NIMCache PVC and begins warming; the startup probe keeps it out of rotation until ready. Scale-in does not fire for 10 minutes after traffic drops, protecting against premature warm-pod termination. Common failure mode: setting cooldownPeriod too short (under 300s) and terminating a pod that is still loading — the next traffic spike finds fewer warm replicas than expected.
Request Queue, Batching, and Why It Shapes Your Scaling Thresholds
NIM LLMs run continuous batching by default through vLLM. Requests arriving faster than the decode rate are queued in memory, not rejected. This is good for throughput but means num_requests_running can spike before gpu_cache_usage_perc catches up, because the queued requests have not yet been scheduled into the KV-cache. Setting both triggers in the ScaledObject above handles this: the concurrency trigger fires fast, the cache-usage trigger fires on sustained load.
What breaks the batching model: very long prompt tokens. A single 100K-token context request can occupy 60 to 80% of a GPU KV-cache by itself, triggering an autoscale event for what is effectively one user. If your workload mixes long-context and short-context requests, consider per-deployment request limits or a separate NIMService instance sized for long-context traffic with a higher cache threshold target (85 to 90%).
Rollout Strategy: Blue-Green Over Rolling Update
A standard Kubernetes rolling update works reasonably well for stateless applications. For NIM, it creates a window where you have pods running two different model versions simultaneously, which is fine for patch releases but problematic for model version changes where output distribution differences matter. It also stalls if GPU capacity is tight, because the new pod cannot schedule until an old one is terminated, and terminating an old one under load drops throughput.
The pattern I use: deploy a second NIMService (llama-3-1-70b-green) referencing the new NIMCache profile, run both behind an NGINX Ingress or a Kubernetes Gateway API HTTPRoute with weighted traffic splitting, shift traffic 10% at a time while watching TTFT and error rate, then delete the blue NIMService. The GPU node capacity requirement doubles during the transition, but the rollout is controlled and reversible at any split percentage.
Multi-Model Packing: Sharing GPU Nodes Across NIMServices
If you run more than one NIM model on the same Kubernetes cluster, GPU node packing becomes a scheduling problem. Each NIMService pod requests a discrete number of GPUs via nvidia.com/gpu limits. Kubernetes does not fractionate GPU devices unless you use MIG or time-slicing (covered in the NVIDIA AI Guide). For NIM, the practical packing rule is: one model per GPU device for production LLMs at full precision. Smaller models (8B at FP8) can share a GPU with MIG partitioning, but this is only worth the operational complexity if you have high GPU cost pressure and predictably uneven traffic across models.
For the VCF-side deployment of NIM Operator with a Private Cloud AI reference architecture, see the Private AI NIM Operator post at nvidia-nim-operator-reference-architecture, which covers the VMware Cloud Foundation lens in detail. Here the focus stays on the NVIDIA stack mechanics.
Autoscaling Thresholds by Request Profile
| Workload Profile | KV-Cache Target | Concurrency Trigger | Scale-in Cooldown | Notes |
|---|---|---|---|---|
| Short-context chat (avg 512 tokens) | 70% | 40 req/replica | 600s | Fast decode; cache turns over quickly |
| Long-context RAG (avg 8K tokens) | 85% | 10 req/replica | 900s | Single request saturates; higher cache target ok |
| Batch summarization (offline) | 90% | None (cache only) | 300s | Throughput over latency; fill the cache |
| Mixed (chat + RAG) | 70% | 20 req/replica | 900s | Consider two separate NIMService deployments |
| Real-time coding assistant (<200 tokens) | 60% | 60 req/replica | 480s | TTFT latency critical; scale earlier |
Worked example
Target: short-context enterprise chat assistant, Llama 3.1 70B at FP8, H100 80GB nodes, SLA of 800ms P90 TTFT.
Starting config: 2 replicas (minReplicas=2), 1 H100 per pod, KV-cache target 70%, concurrency trigger 40 req/replica, cooldown 600s.
Traffic pattern: 80 concurrent users at peak, 12 during off-hours.
Scale-out: At 80 users with ~40 req/replica average, KEDA fires. New pod mounts NIMCache PVC (shared NFS-over-RDMA, weight load ~4 minutes). Startup probe holds the pod out of rotation during load. Effective capacity gap: 4 minutes. Mitigation: minReplicas=2 absorbs the first burst; the third pod is warm within the typical burst duration.
Observed throughput: 2 replicas at 70B FP8 on H100 deliver approximately 1,800 to 2,200 tokens/second aggregate decode at 40 concurrent users, with P90 TTFT around 650ms [VERIFY: benchmark depends on prompt length distribution and vLLM version].
Scale-in: Off-hours at 12 users, KV-cache drops to 15%. KEDA waits 600s, then scales to minReplicas=2. Never goes below 2 warm pods regardless of traffic.
In-Practice Observations
gpu_cache_usage_perc. Second: NIMCache omitted because the team thought the model pull was a one-time cost. On the first scale-out event, 14 minutes of cold start under load. Fix: NIMCache with ReadWriteMany PVC is non-optional for production. Third: startup probe shortened to 60 seconds because 20 minutes sounded excessive. The pod gets killed at 55 seconds, the kubelet restarts it, and the cycle repeats while traffic queues. Fix: leave failureThreshold=120 with periodSeconds=10 as the baseline and tune upward only if your storage is demonstrably faster. [AUTHOR: add anecdote from a specific customer engagement if possible]
The Verdict
The NIM Operator is the production baseline for NIM on Kubernetes. Not because it eliminates operational complexity, but because it wraps the complexity that is genuinely hard to get right (model caching, startup sequencing, HPA wiring) in a declarative interface that integrates cleanly with GitOps workflows. The CRD schema is stable enough to version-control alongside your application manifests.
When I would NOT use the NIM Operator: single-model experiments with a fixed replica count and no autoscaling requirement, or environments where you cannot install CRDs (some managed Kubernetes services with restricted API server access). In those cases, raw Helm with manual PVC management and a hand-written HPA is acceptable, but you are on your own for the startup probe configuration and cache invalidation on model updates.
What to validate before you call a NIM deployment production-ready: (1) NIMCache status shows Ready and the PVC is mounted read-only on all pods. (2) Startup probe failureThreshold is set to allow at least 15 minutes of model loading time. (3) Autoscaler is wired to gpu_cache_usage_perc or num_requests_running, not CPU. (4) KEDA cooldownPeriod is at least 600 seconds. (5) You have load-tested the scale-out path under synthetic burst traffic and confirmed new pods reach Ready without being killed mid-load.
For what comes next after you have NIM running and scaling correctly, Part 18 covers TensorRT-LLM optimization and quantization — the layer below NIM where you can squeeze additional throughput out of the same GPU budget before adding hardware.
If you are building this on VMware Cloud Foundation, the Private AI NIM Operator reference architecture at nvidia-nim-operator-reference-architecture covers the VCF-specific storage, networking, and supervisor cluster considerations that are out of scope for this post.
If you have a KV-cache saturation or cold-start problem that is not covered by the patterns above, drop a comment with your model size, storage type, and KEDA configuration. The details matter.
References
- NVIDIA NIM Operator Documentation (docs.nvidia.com)
- Managing NIM Services — NIMService CRD Reference (docs.nvidia.com)
- Caching NIM Models — NIMCache CRD Reference (docs.nvidia.com)
- Horizontal Autoscaling of NVIDIA NIM Microservices on Kubernetes (NVIDIA Technical Blog)
- Deploy Scalable AI Inference with NVIDIA NIM Operator 3.0.0 (NVIDIA Technical Blog)
- NIM Operator Deployment for LLMs (docs.nvidia.com)
- GPU Autoscaling on Kubernetes with KEDA (CNCF, May 2026)



