Deploying and Autoscaling NIM in Production on Kubernetes (NVIDIA AI Series, Part 17)

How to deploy NVIDIA NIM in production using the NIM Operator and Helm, wire autoscaling on the right GPU and KV-cache signals instead of CPU, handle cold-start model load, and run blue-green rollouts without dropping throughput.

by

Dr. Pranay Jha

June 22, 2026

No comments

16 minutes

Read Time

NVIDIA AI Series · Part 17 of 30

TL;DR: The NIM Operator is the right path for production Kubernetes deployments. Use a NIMCache to pre-stage model weights, then scale on gpu_cache_usage_perc or request-queue depth via Prometheus HPA or KEDA. Never scale on CPU utilization. Cold-start latency during scale-up is the real production risk; shared-storage caching is the fix. A blue-green rollout via replicated NIMService resources avoids zero-downtime model swaps. My baseline: 70% KV-cache target, minReplicas=2, a 12-minute startup probe window, and KEDA over native HPA for multi-metric fanout.

Who this is for: Platform engineers and AI infrastructure architects who have NIMs running in a lab and now need to harden them for production traffic. You should already know what a NIM is (covered in Part 16). You need GPU Operator installed, Prometheus running in the cluster, and access to the NVIDIA NGC registry. Familiarity with Kubernetes HPA and PersistentVolumeClaims assumed.

A NIM container cold-starting on an unscheduled node while your SLA clock is ticking is not a theoretical problem. I have watched a Llama 3.1 70B NIM take 14 minutes to pull model weights over the network on scale-up, at which point the HPA had already fired another replica and the queue had blown past the latency budget. The NIM Operator exists to break that failure loop. This post is about how to deploy it correctly, wire the right autoscaling signals, and avoid the three production failure modes I see most often: scaling on the wrong metric, cold-starting without a cache, and zero-downtime rollout attempts that are not actually zero-downtime.

The NIM Operator: What It Actually Manages

The NIM Operator is a standard Kubernetes operator installed via Helm from the NGC catalog (nvidia/k8s-nim-operator). It introduces two CRDs: NIMCache and NIMService. Think of NIMCache as the model artifact manager and NIMService as the deployment lifecycle manager. They are separate objects on purpose: you cache once per model version, then deploy many service instances against that cache without each replica pulling weights independently.

The operator reconcile loop works like this. When you apply a NIMCache, the operator spins up a Job that pulls the model container from NGC and extracts the model weights to a PVC. Once the NIMCache status transitions to Ready, that PVC is the golden copy. Every NIMService pod that references this NIMCache mounts the PVC read-only, bypassing the network pull entirely. This is the architectural reason you should never run NIM in production without a NIMCache, even if it feels like an extra step during initial setup.

NIM Operator reconcile flow: NIMCache populates a shared PVC; NIMService pods mount it read-only; autoscaler drives replica count without re-pulling weights.

Model Profile Selection and the Profile Miss

Each NIMCache can contain multiple optimized model profiles, for example vllm-fp8-tp1-pp1 for a single H100 with FP8 or vllm-fp16-tp2-pp1 for tensor-parallel across two GPUs. You select a profile in the NIMService via spec.storage.nimCache.profile. The silent failure mode here: if the profile string you specify does not match a name in the cached manifest, the pod starts, loads a default profile (often a heavier FP16), and you consume more GPU memory than you planned for, which breaks your packing math. Always run a NIMService dry-run or inspect the NIMCache status to enumerate available profiles before committing to a value in production.

Helm vs. NIM Operator: Which Path to Take

You have two deployment paths. Raw Helm (the NIM LLM chart from NGC) gives you speed for a single-model proof of concept. The NIM Operator gives you lifecycle management, multi-model support, and integrated autoscaling for anything you intend to run past a sprint demo. The table below makes the trade-off concrete.

Dimension	Raw Helm (NIM chart)	NIM Operator (CRDs)
Setup complexity	Low — one Helm install	Medium — GPU Operator + Operator install first
Model caching	Manual PVC + initContainers	NIMCache CRD handles it natively
Autoscaling wiring	Manual HPA + Prometheus Adapter	HPA spec in NIMService; KEDA alongside
Multi-model clusters	One chart per model, no shared lifecycle	Multiple NIMService objects, shared CRD schema
Rollout strategy	Helm upgrade (rolling; can stall on GPU alloc)	Blue-green via NIMService duplication + traffic switch
Air-gap / private registry	Requires manual image mirroring	NIMCache spec supports private registry overrides
GitOps readiness	HelmRelease in Flux/Argo	CRDs as first-class GitOps objects

My take: Use raw Helm only for a single-model POC or a pipeline that is purely CI-driven and has no expectation of growth. For anything running against real users, the Operator overhead pays back in two sprints. The GitOps story is cleaner, the cache is built-in, and the startup probe configuration alone saves you from the silent failure where a pod is live-but-not-ready and your load balancer is routing cold requests to it.

Autoscaling on the Right Signals

This is where most NIM deployments go wrong. CPU utilization is the wrong signal for an LLM inference workload. The GPU is doing all the compute, and the CPU is largely idle during forward passes; a CPU-based HPA will sit at 5% and never scale out even as your GPUs are saturated. The right signals are inference-specific metrics that NIM exposes on its Prometheus endpoint by default.

The three metrics that matter in production, in order of preference:

gpu_cache_usage_perc — percentage of the GPU KV-cache that is occupied. This is the most direct measure of how saturated the model is. Scale out when this exceeds 70% on average; scale in below 30%.
num_requests_running — the live concurrency count. Useful as a secondary trigger to catch burst traffic before the KV-cache saturates.
time_to_first_token_seconds (P90) — latency-oriented scaling. If P90 TTFT exceeds your SLA, you need more capacity regardless of cache state. Wire this only if you have a hard latency contract and can accept the additional replica overhead it implies.

Autoscaling signal path: NIM Prometheus metrics flow through KEDA or Prometheus Adapter to the HPA. CPU-based HPA is the most common misconfiguration in GPU inference clusters.

Gotcha: The Prometheus Adapter approach (native HPA with custom metrics) requires a separate APIService registration and tends to be brittle on cert rotation. KEDA is operationally simpler: deploy the KEDA controller, write a ScaledObject pointing at your Prometheus source, and KEDA manages the HPA object for you. You still get standard HPA behavior (stabilization windows, min/max replica bounds), but without the Adapter plumbing. The trade-off is an additional controller in your cluster; it is worth it for anything running more than two NIM models.

Cold-Start: The Scaling Tax Nobody Budgets For

Every new NIM pod must load the model into GPU VRAM before it can serve requests. For a 70B parameter model at FP8, that is roughly 35 GB going from storage to GPU memory. Over NFS or a standard ReadWriteOnce PVC, that can run 8 to 14 minutes. The NIM Operator configures a startup probe on /v1/health/ready with a default failureThreshold of 120 and periodSeconds of 10, giving new pods up to 20 minutes before the kubelet kills them. Do not shorten this window. A probe timeout that kills a loading pod restarts the scale-up process and can put you in a crash loop during a traffic surge.

The mitigation stack, in order of effectiveness: (1) Use a NIMCache with a high-throughput PVC, ideally a parallel file system such as GPFS/Weka that can saturate GPU memory bandwidth. (2) Set minReplicas: 2 so you always have warm capacity. (3) Set KEDA or HPA stabilization windows: scale-out should respond fast (30s), scale-in should be slow (10 to 15 minutes). Scaling in before a new pod is warm is another common way to create a capacity hole.

The Production Artifact: NIMService + KEDA ScaledObject

Below is a production-ready NIMService manifest paired with a KEDA ScaledObject keyed on gpu_cache_usage_perc. Read the inline comments carefully; the failure modes are annotated where they occur.

---
# NIMCache: run this FIRST and wait for status.state == Ready
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: llama-3-1-70b-cache
  namespace: nim-prod
spec:
  model:
    engine: vllm
    repoName: meta/llama-3.1-70b-instruct
    # VERIFY: check NGC for latest tag before pinning
    tag: "1.3.0"
    precision: fp8
    # Select the single-GPU FP8 profile explicitly.
    # If this string mismatches the manifest, the pod
    # falls back to a heavier default -- check NIMCache status.
    profiles:
      - vllm-fp8-tp1-pp1
  storage:
    pvc:
      create: true
      storageClass: "fast-parallel-fs"
      size: "250Gi"
      accessMode: ReadWriteMany
  resources:
    limits:
      nvidia.com/gpu: 1
---
# NIMService: deploy AFTER NIMCache status is Ready
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-1-70b
  namespace: nim-prod
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-70b-instruct
    # VERIFY: pin to same tag as NIMCache
    tag: "1.3.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - name: ngc-registry-secret
  storage:
    nimCache:
      name: llama-3-1-70b-cache
      profile: vllm-fp8-tp1-pp1
  replicas: 2          # minReplicas floor; KEDA will override
  resources:
    limits:
      nvidia.com/gpu: 1
      memory: "80Gi"
  readinessProbe:
    httpGet:
      path: /v1/health/ready
      port: 8000
    initialDelaySeconds: 30
    periodSeconds: 10
    failureThreshold: 5
  # Startup probe: give the model up to 20 min to load into VRAM.
  # Do NOT shorten failureThreshold -- a killed loading pod
  # restarts the clock and creates a capacity hole under load.
  startupProbe:
    httpGet:
      path: /v1/health/ready
      port: 8000
    failureThreshold: 120
    periodSeconds: 10
  expose:
    service:
      type: ClusterIP
      port: 8000
  metrics:
    enabled: true        # enables /metrics on the same port
---
# KEDA ScaledObject: scale on gpu_cache_usage_perc
# WRONG signal: do NOT use cpu or memory here
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-3-1-70b-scaler
  namespace: nim-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    # NIMService creates a Deployment with this name
    name: llama-3-1-70b
  minReplicaCount: 2
  maxReplicaCount: 8
  pollingInterval: 30
  cooldownPeriod: 600    # 10 min: let new pods warm up before scaling in
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-operated.monitoring.svc:9090
        metricName: nim_gpu_cache_usage_perc
        # Target 70% KV-cache utilization per replica.
        # Above this threshold KEDA fires a scale-out event.
        threshold: "70"
        query: >
          avg(gpu_cache_usage_perc{namespace="nim-prod",
          pod=~"llama-3-1-70b.*"}) * 100
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-operated.monitoring.svc:9090
        metricName: nim_requests_running
        # Secondary trigger: scale if avg concurrency exceeds 40 req/replica.
        threshold: "40"
        query: >
          avg(num_requests_running{namespace="nim-prod",
          pod=~"llama-3-1-70b.*"})

Expected behavior: at steady state, two pods serve traffic. When gpu_cache_usage_perc averages above 70%, KEDA fires a scale-out within 30 seconds; the new pod mounts the NIMCache PVC and begins warming; the startup probe keeps it out of rotation until ready. Scale-in does not fire for 10 minutes after traffic drops, protecting against premature warm-pod termination. Common failure mode: setting cooldownPeriod too short (under 300s) and terminating a pod that is still loading — the next traffic spike finds fewer warm replicas than expected.

Request Queue, Batching, and Why It Shapes Your Scaling Thresholds

NIM LLMs run continuous batching by default through vLLM. Requests arriving faster than the decode rate are queued in memory, not rejected. This is good for throughput but means num_requests_running can spike before gpu_cache_usage_perc catches up, because the queued requests have not yet been scheduled into the KV-cache. Setting both triggers in the ScaledObject above handles this: the concurrency trigger fires fast, the cache-usage trigger fires on sustained load.

What breaks the batching model: very long prompt tokens. A single 100K-token context request can occupy 60 to 80% of a GPU KV-cache by itself, triggering an autoscale event for what is effectively one user. If your workload mixes long-context and short-context requests, consider per-deployment request limits or a separate NIMService instance sized for long-context traffic with a higher cache threshold target (85 to 90%).

Queue depth spikes before KV-cache saturates. Wiring both triggers in KEDA catches burst traffic early and sustained load late.

Rollout Strategy: Blue-Green Over Rolling Update

A standard Kubernetes rolling update works reasonably well for stateless applications. For NIM, it creates a window where you have pods running two different model versions simultaneously, which is fine for patch releases but problematic for model version changes where output distribution differences matter. It also stalls if GPU capacity is tight, because the new pod cannot schedule until an old one is terminated, and terminating an old one under load drops throughput.

The pattern I use: deploy a second NIMService (llama-3-1-70b-green) referencing the new NIMCache profile, run both behind an NGINX Ingress or a Kubernetes Gateway API HTTPRoute with weighted traffic splitting, shift traffic 10% at a time while watching TTFT and error rate, then delete the blue NIMService. The GPU node capacity requirement doubles during the transition, but the rollout is controlled and reversible at any split percentage.

Blue-green NIM rollout: run two NIMService objects with weighted traffic. Validate green at 10%, shift incrementally, delete blue. Requires double GPU capacity during the transition window.

Multi-Model Packing: Sharing GPU Nodes Across NIMServices

If you run more than one NIM model on the same Kubernetes cluster, GPU node packing becomes a scheduling problem. Each NIMService pod requests a discrete number of GPUs via nvidia.com/gpu limits. Kubernetes does not fractionate GPU devices unless you use MIG or time-slicing (covered in the NVIDIA AI Guide). For NIM, the practical packing rule is: one model per GPU device for production LLMs at full precision. Smaller models (8B at FP8) can share a GPU with MIG partitioning, but this is only worth the operational complexity if you have high GPU cost pressure and predictably uneven traffic across models.

For the VCF-side deployment of NIM Operator with a Private Cloud AI reference architecture, see the Private AI NIM Operator post at nvidia-nim-operator-reference-architecture, which covers the VMware Cloud Foundation lens in detail. Here the focus stays on the NVIDIA stack mechanics.

Autoscaling Thresholds by Request Profile

Workload Profile	KV-Cache Target	Concurrency Trigger	Scale-in Cooldown	Notes
Short-context chat (avg 512 tokens)	70%	40 req/replica	600s	Fast decode; cache turns over quickly
Long-context RAG (avg 8K tokens)	85%	10 req/replica	900s	Single request saturates; higher cache target ok
Batch summarization (offline)	90%	None (cache only)	300s	Throughput over latency; fill the cache
Mixed (chat + RAG)	70%	20 req/replica	900s	Consider two separate NIMService deployments
Real-time coding assistant (<200 tokens)	60%	60 req/replica	480s	TTFT latency critical; scale earlier

Worked example

Target: short-context enterprise chat assistant, Llama 3.1 70B at FP8, H100 80GB nodes, SLA of 800ms P90 TTFT.

Starting config: 2 replicas (minReplicas=2), 1 H100 per pod, KV-cache target 70%, concurrency trigger 40 req/replica, cooldown 600s.

Traffic pattern: 80 concurrent users at peak, 12 during off-hours.

Scale-out: At 80 users with ~40 req/replica average, KEDA fires. New pod mounts NIMCache PVC (shared NFS-over-RDMA, weight load ~4 minutes). Startup probe holds the pod out of rotation during load. Effective capacity gap: 4 minutes. Mitigation: minReplicas=2 absorbs the first burst; the third pod is warm within the typical burst duration.

Observed throughput: 2 replicas at 70B FP8 on H100 deliver approximately 1,800 to 2,200 tokens/second aggregate decode at 40 concurrent users, with P90 TTFT around 650ms [VERIFY: benchmark depends on prompt length distribution and vLLM version].

Scale-in: Off-hours at 12 users, KV-cache drops to 15%. KEDA waits 600s, then scales to minReplicas=2. Never goes below 2 warm pods regardless of traffic.

In-Practice Observations

In practice: The three failures I see most in production NIM deployments. First: autoscaling configured on CPU utilization because it was the default and nobody changed it. The cluster looks fine, GPUs are at 95%, and the HPA is reporting 4% CPU and doing nothing. Fix: delete the CPU-based HPA, deploy the Prometheus Adapter or KEDA, and wire gpu_cache_usage_perc. Second: NIMCache omitted because the team thought the model pull was a one-time cost. On the first scale-out event, 14 minutes of cold start under load. Fix: NIMCache with ReadWriteMany PVC is non-optional for production. Third: startup probe shortened to 60 seconds because 20 minutes sounded excessive. The pod gets killed at 55 seconds, the kubelet restarts it, and the cycle repeats while traffic queues. Fix: leave failureThreshold=120 with periodSeconds=10 as the baseline and tune upward only if your storage is demonstrably faster. [AUTHOR: add anecdote from a specific customer engagement if possible]

The Verdict

The NIM Operator is the production baseline for NIM on Kubernetes. Not because it eliminates operational complexity, but because it wraps the complexity that is genuinely hard to get right (model caching, startup sequencing, HPA wiring) in a declarative interface that integrates cleanly with GitOps workflows. The CRD schema is stable enough to version-control alongside your application manifests.

When I would NOT use the NIM Operator: single-model experiments with a fixed replica count and no autoscaling requirement, or environments where you cannot install CRDs (some managed Kubernetes services with restricted API server access). In those cases, raw Helm with manual PVC management and a hand-written HPA is acceptable, but you are on your own for the startup probe configuration and cache invalidation on model updates.

What to validate before you call a NIM deployment production-ready: (1) NIMCache status shows Ready and the PVC is mounted read-only on all pods. (2) Startup probe failureThreshold is set to allow at least 15 minutes of model loading time. (3) Autoscaler is wired to gpu_cache_usage_perc or num_requests_running, not CPU. (4) KEDA cooldownPeriod is at least 600 seconds. (5) You have load-tested the scale-out path under synthetic burst traffic and confirmed new pods reach Ready without being killed mid-load.

For what comes next after you have NIM running and scaling correctly, Part 18 covers TensorRT-LLM optimization and quantization — the layer below NIM where you can squeeze additional throughput out of the same GPU budget before adding hardware.

If you are building this on VMware Cloud Foundation, the Private AI NIM Operator reference architecture at nvidia-nim-operator-reference-architecture covers the VCF-specific storage, networking, and supervisor cluster considerations that are out of scope for this post.

If you have a KV-cache saturation or cold-start problem that is not covered by the patterns above, drop a comment with your model size, storage type, and KEDA configuration. The details matter.

Disclaimer: The manifests and configuration values in this post are reference baselines derived from NVIDIA documentation and field experience. Test all configurations in a non-production environment before applying to live inference clusters. Verify NIM container tags, NIMCache profile strings, and KEDA trigger metric names against current NVIDIA documentation before deployment, as these can change across operator versions. Items marked [VERIFY] require confirmation against your specific NIM version and hardware configuration.

NVIDIA AI Series · Part 17 of 30
« Previous: Part 16 | NVIDIA AI Guide | Next: Part 18 »

References

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts

Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Tags: inference, Kubernetes, nim, nvidia, NVIDIA AI Series

Dr. Pranay Jha