TL;DR · Key Takeaways
- NVIDIA NIM is the model-serving layer of Private AI: prebuilt inference microservices that expose an OpenAI-compatible endpoint, running as pods on a VKS cluster on top of your GPU workload domain.
- The NIM Operator (v3.1.0) manages the lifecycle through four custom resources: NIMCache, NIMService, NIMPipeline and NIM Build. Get the NIMCache right and most bring-up pain disappears.
- Pre-cache every model to a shared ReadWriteMany PVC before you create the NIMService, or you will hit 20 minute pod startups and a fresh download on every restart.
- GPU placement is the design decision that bites: a multi-GPU model needs NVLink-connected A100 or H100 GPUs, and L40S with vGPU cannot run multi-GPU models at all.
- For a single production model, pin an LLM-specific container to one TensorRT-LLM profile. Keep Multi-LLM NIM for dev and consolidation, not latency-sensitive production.
Most Private AI deployments stall in the same place. The GPUs are installed, the workload domain is healthy, the GPU Operator reports every driver green, and then someone tries to actually serve a model and burns three days on pod restarts, half-downloaded weights, and an endpoint that returns 503 under the lightest load. The platform was fine. The serving layer was the afterthought. NVIDIA NIM is that serving layer, and on Private AI it is run by the NIM Operator. This part is about designing it on purpose instead of discovering its sharp edges in production.
If the platform underneath is not built yet, start with preparing the GPU workload domain and installing the GPU Operator and vGPU drivers. Everything here assumes that layer already exists and focuses on what sits on top of it.
Where NIM Sits in the Private AI Stack
A NIM (NVIDIA Inference Microservice) is a container, part of NVIDIA AI Enterprise, that bundles an optimized inference engine with a model and exposes a standard REST and gRPC API. The point of NIM is that you do not hand-build a serving stack. You pull a container, give it a GPU, and you get an OpenAI-compatible endpoint that a RAG pipeline or Agent Builder can call without caring what runs underneath.
On Private AI, NIMs do not run as standalone vSphere VMs. They run as Kubernetes pods inside a VKS (vSphere Kubernetes Service) cluster, and the VKS worker VMs attach to physical GPUs on the ESXi hosts through vGPU or passthrough. VCF Private AI Services sits a layer above and consumes the endpoints, but the NIM is the actual engine. Understanding that boundary matters: when an endpoint is slow or unstable, the fix is almost always in the NIM and its GPU placement, not in the services consuming it.
The NIM Operator and Its Four Custom Resources
You can run a NIM as a raw Kubernetes Deployment. Do not. On a platform you intend to operate, use the NIM Operator, which is the Kubernetes controller that manages the NIM lifecycle: caching, deployment, health probes, service exposure, scaling, and GPU scheduling. As of v3.1.0 it exposes four custom resources, and the design of your serving layer is really the design of how you use them.
- NIMCache pulls model artifacts once and stores them on a persistent volume, so the weights are downloaded during cache population and reused across pod restarts and scale events. Sources can be NVIDIA NGC, the NeMo Data Store, or the Hugging Face Hub.
- NIMService deploys and runs the inference pod: image, GPU resources, startup and readiness probes, the Kubernetes Service, and autoscaling.
- NIMPipeline groups related services (for example an embedding NIM plus a reranking NIM) so they are managed as one unit.
- NIM Build, added in v3.0.0, builds and caches TensorRT-LLM engines from buildable profiles when you want an engine optimized for your exact GPU rather than a generic prebuilt one.
The order is the whole game. Create the NIMCache first and let it finish populating, then create the NIMService that references it. Skip the cache and the NIMService still works, but it downloads weights on first start, which is why the operator sets a 20 minute startup probe timeout for un-cached pods. That timeout hides the problem rather than solving it: every restart, every scale-up, and every reschedule pays the download cost again.
A minimal pair of manifests for an 8B model looks like this. The NIMCache pulls and pins a TensorRT-LLM profile to a shared volume, and the NIMService consumes it with autoscaling enabled.
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: llama-3-1-8b-instruct
spec:
source:
ngc:
modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
pullSecret: ngc-secret
authSecret: ngc-api-secret
model:
engine: tensorrt_llm
tensorParallelism: "1"
storage:
pvc:
create: true
storageClass: vsan-default
size: "50Gi"
volumeAccessMode: ReadWriteMany
---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: llama-3-1-8b-instruct
spec:
image:
repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
tag: latest
pullSecrets:
- ngc-secret
authSecret: ngc-api-secret
storage:
nimCache:
name: llama-3-1-8b-instruct
resources:
limits:
nvidia.com/gpu: 1
scale:
enabled: true
hpa:
minReplicas: 1
maxReplicas: 4
expose:
service:
type: ClusterIP
port: 8000
One subtlety that catches people: when scale.enabled is true, you cannot also set spec.replicas. The HPA owns the replica count through minReplicas and maxReplicas. And if you are on v3.1.0, note that spec.expose.ingress is deprecated in favor of spec.expose.router, which now supports the Kubernetes Gateway API alongside traditional ingress.
GPU Placement: The Sizing That Actually Matters
The hardest part of designing a NIM endpoint is not the YAML. It is deciding how the model maps onto GPUs, because that decision is constrained by the GPU you chose back in the vGPU, MIG and passthrough discussion and the topology you set in the reference architecture. Three patterns cover almost every case.
Here is the design trap that costs people a week. A multi-GPU model needs the GPUs to talk to each other at full bandwidth, which means NVLink. With vGPU software on vSphere, a model that needs more than one GPU requires A100 or H100 GPUs connected by NVLink or an NVLink Switch. L40S GPUs with vGPU software do not support multi-GPU models at all. If you sized a fleet of L40S boxes expecting to run a 70B model across two cards per VM, that plan does not work with vGPU. Your options become passthrough, NVLink-equipped H100 hosts, or a smaller model. Validate this against the model you actually intend to serve before you buy hardware.
The NIM Operator 3.x line also added Kubernetes Dynamic Resource Allocation (DRA) for GPU allocation, using the NVIDIA DRA driver (v25.12.0 in the v3.1.0 release, on the resource.k8s.io/v1 API). DRA gives you device classes and resource-based requests instead of the blunt nvidia.com/gpu: 1 count. It is the better long-term model, but treat it as the newer path: validate it on your VKS and driver versions before standardizing on it.
| Model size | GPU and profile | Engine | Cache PVC | Placement note |
|---|---|---|---|---|
| Up to 8B | 1x L40S 48GB or 1 vGPU profile | TensorRT-LLM | ~50 GB | Single GPU, single pod |
| 13B to 34B | 1x H100 80GB | TensorRT-LLM or vLLM | ~80 to 120 GB | Single host; fits one large card |
| 70B | 2 to 4x H100 NVLink | TensorRT-LLM, TP 2 to 4 | ~150 GB | NVLink required; not L40S+vGPU |
| 100B and above | 8x H100 or H200 across nodes | vLLM or TensorRT-LLM | 400 GB and up | Multi-node LeaderWorkerSets |
These are planning starting points, not guarantees. Cache sizes vary with quantization and the number of profiles you keep, and the GPU memory you actually need depends on context length and batch size. Size the PVC generously: running out of cache space mid-download is a slow, confusing failure.
Container Type and Engine: Choosing by Design
NIM Operator 3.0.0 added the Multi-LLM NIM container, which serves a range of models from one container and can pull them from NGC, the NeMo Data Store, or Hugging Face. It is genuinely useful, and it is also over-applied. The newer feature is not automatically the right default.
My take: for a single model serving production traffic, use the LLM-specific container with one pinned TensorRT-LLM profile. You get predictable memory, predictable latency, and an artifact you can pre-cache cleanly. Reach for Multi-LLM NIM when you are consolidating a dev environment, A/B testing models, or running a long tail of low-traffic models where the operational savings of one container outweigh the per-model tuning you give up. On engines, TensorRT-LLM is the right call when the hardware is fixed and latency is the priority; vLLM and SGLang earn their place when you need a model that does not have a prebuilt TensorRT-LLM profile yet, or when you value flexibility over the last few percent of throughput.
What Breaks in the Field
A handful of failures account for most NIM support cases. None of them are exotic, and all of them are avoidable with a little design.
- Endless pod startups. An un-cached NIMService downloads weights on first start and leans on a 20 minute startup probe. Pre-cache with NIMCache and the pod starts in seconds. This is the single highest-value habit on the platform.
- Caching fails on big models. For models around 49 billion parameters and larger, the cache job can hit a
Too many open fileserror during download. The fix is to raise the open-file limit on the container runtime of the VKS worker. - Multi-GPU on the wrong card. L40S with vGPU cannot run a multi-GPU model. You need NVLink-connected A100 or H100, or passthrough.
- Image pull failures. The NGC pull secret and the NGC API key are two separate secrets (
ngc-secretfor the registry,ngc-api-secretfor model access). Miss one and you get a pull back-off that looks like a network problem.
The open-files fix, applied to the containerd runtime on the VKS worker node, looks like this:
sudo mkdir -p /etc/systemd/system/containerd.service.d
echo "[Service]" | sudo tee /etc/systemd/system/containerd.service.d/override.conf
echo "LimitNOFILE=65536" | sudo tee -a /etc/systemd/system/containerd.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl restart containerd
sudo systemctl restart kubelet
What I’d Do
For the large majority of Private AI deployments, the right design is boring and reliable: serve models through the NIM Operator on a VKS cluster, always populate a NIMCache on shared ReadWriteMany storage before deploying the NIMService, pin an LLM-specific container to a single TensorRT-LLM profile, and let the HPA handle load between sensible min and max replicas. Choose your GPU placement from the smallest pattern the model fits, and validate the NVLink requirement against your actual hardware before you commit. Save Multi-LLM NIM, DRA, and multi-node serving for the cases that genuinely need them, after you have proven them on your versions. The teams that struggle with Private AI are almost never the ones who got the model wrong. They are the ones who treated serving as a deploy step instead of a design.
What is the first model you plan to put behind a NIM endpoint, and does your current GPU layout actually support it? That single question surfaces most of the design work this part is about.
References
- NVIDIA NIM Operator documentation
- NVIDIA NIM Operator release notes (v3.1.0)
- Deploying VCF Private AI Services: Supervisor architectures with and without NSX
- VMware Private AI Foundation with NVIDIA 9.1 (Broadcom TechDocs)
« Previous: Part 10 | VMware Private AI Complete Guide | Next: Part 12 »



