NVIDIA NIM Microservices on VMware Private AI: The Model-Serving Layer Explained (Private AI Series, Part 11)

NVIDIA NIM is the model-serving layer of VMware Private AI. A reference-architecture look at the NIM Operator, NIMCache and NIMService, GPU placement, and the design choices that decide whether your endpoints survive production.

by

Dr. Pranay Jha

June 15, 2026

No comments

12 minutes

Read Time

VMware Private AI Series · Part 11 of 24

TL;DR · Key Takeaways

NVIDIA NIM is the model-serving layer of Private AI: prebuilt inference microservices that expose an OpenAI-compatible endpoint, running as pods on a VKS cluster on top of your GPU workload domain.
The NIM Operator (v3.1.0) manages the lifecycle through four custom resources: NIMCache, NIMService, NIMPipeline and NIM Build. Get the NIMCache right and most bring-up pain disappears.
Pre-cache every model to a shared ReadWriteMany PVC before you create the NIMService, or you will hit 20 minute pod startups and a fresh download on every restart.
GPU placement is the design decision that bites: a multi-GPU model needs NVLink-connected A100 or H100 GPUs, and L40S with vGPU cannot run multi-GPU models at all.
For a single production model, pin an LLM-specific container to one TensorRT-LLM profile. Keep Multi-LLM NIM for dev and consolidation, not latency-sensitive production.

Who this is for: VCF architects and platform teams designing the inference layer of VMware Private AI Foundation. Prerequisites: a GPU-enabled VI workload domain, the NVIDIA GPU Operator running on a VKS cluster, and an NGC API key with an NVIDIA AI Enterprise entitlement.

Most Private AI deployments stall in the same place. The GPUs are installed, the workload domain is healthy, the GPU Operator reports every driver green, and then someone tries to actually serve a model and burns three days on pod restarts, half-downloaded weights, and an endpoint that returns 503 under the lightest load. The platform was fine. The serving layer was the afterthought. NVIDIA NIM is that serving layer, and on Private AI it is run by the NIM Operator. This part is about designing it on purpose instead of discovering its sharp edges in production.

If the platform underneath is not built yet, start with preparing the GPU workload domain and installing the GPU Operator and vGPU drivers. Everything here assumes that layer already exists and focuses on what sits on top of it.

Where NIM Sits in the Private AI Stack

A NIM (NVIDIA Inference Microservice) is a container, part of NVIDIA AI Enterprise, that bundles an optimized inference engine with a model and exposes a standard REST and gRPC API. The point of NIM is that you do not hand-build a serving stack. You pull a container, give it a GPU, and you get an OpenAI-compatible endpoint that a RAG pipeline or Agent Builder can call without caring what runs underneath.

On Private AI, NIMs do not run as standalone vSphere VMs. They run as Kubernetes pods inside a VKS (vSphere Kubernetes Service) cluster, and the VKS worker VMs attach to physical GPUs on the ESXi hosts through vGPU or passthrough. VCF Private AI Services sits a layer above and consumes the endpoints, but the NIM is the actual engine. Understanding that boundary matters: when an endpoint is slow or unstable, the fix is almost always in the NIM and its GPU placement, not in the services consuming it.

The serving stack on Private AI. The two red-outlined layers are what the NIM Operator owns.

The NIM Operator and Its Four Custom Resources

You can run a NIM as a raw Kubernetes Deployment. Do not. On a platform you intend to operate, use the NIM Operator, which is the Kubernetes controller that manages the NIM lifecycle: caching, deployment, health probes, service exposure, scaling, and GPU scheduling. As of v3.1.0 it exposes four custom resources, and the design of your serving layer is really the design of how you use them.

NIMCache pulls model artifacts once and stores them on a persistent volume, so the weights are downloaded during cache population and reused across pod restarts and scale events. Sources can be NVIDIA NGC, the NeMo Data Store, or the Hugging Face Hub.
NIMService deploys and runs the inference pod: image, GPU resources, startup and readiness probes, the Kubernetes Service, and autoscaling.
NIMPipeline groups related services (for example an embedding NIM plus a reranking NIM) so they are managed as one unit.
NIM Build, added in v3.0.0, builds and caches TensorRT-LLM engines from buildable profiles when you want an engine optimized for your exact GPU rather than a generic prebuilt one.

The order is the whole game. Create the NIMCache first and let it finish populating, then create the NIMService that references it. Skip the cache and the NIMService still works, but it downloads weights on first start, which is why the operator sets a 20 minute startup probe timeout for un-cached pods. That timeout hides the problem rather than solving it: every restart, every scale-up, and every reschedule pays the download cost again.

Cache first, serve second. The arrow direction is also the order you should apply the manifests.

A minimal pair of manifests for an 8B model looks like this. The NIMCache pulls and pins a TensorRT-LLM profile to a shared volume, and the NIMService consumes it with autoscaling enabled.

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: llama-3-1-8b-instruct
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: vsan-default
      size: "50Gi"
      volumeAccessMode: ReadWriteMany
---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-1-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: latest
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: llama-3-1-8b-instruct
  resources:
    limits:
      nvidia.com/gpu: 1
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 4
  expose:
    service:
      type: ClusterIP
      port: 8000

One subtlety that catches people: when scale.enabled is true, you cannot also set spec.replicas. The HPA owns the replica count through minReplicas and maxReplicas. And if you are on v3.1.0, note that spec.expose.ingress is deprecated in favor of spec.expose.router, which now supports the Kubernetes Gateway API alongside traditional ingress.

GPU Placement: The Sizing That Actually Matters

The hardest part of designing a NIM endpoint is not the YAML. It is deciding how the model maps onto GPUs, because that decision is constrained by the GPU you chose back in the vGPU, MIG and passthrough discussion and the topology you set in the reference architecture. Three patterns cover almost every case.

Pick the smallest pattern the model fits in. Each step right adds cost and operational complexity.

Here is the design trap that costs people a week. A multi-GPU model needs the GPUs to talk to each other at full bandwidth, which means NVLink. With vGPU software on vSphere, a model that needs more than one GPU requires A100 or H100 GPUs connected by NVLink or an NVLink Switch. L40S GPUs with vGPU software do not support multi-GPU models at all. If you sized a fleet of L40S boxes expecting to run a 70B model across two cards per VM, that plan does not work with vGPU. Your options become passthrough, NVLink-equipped H100 hosts, or a smaller model. Validate this against the model you actually intend to serve before you buy hardware.

The NIM Operator 3.x line also added Kubernetes Dynamic Resource Allocation (DRA) for GPU allocation, using the NVIDIA DRA driver (v25.12.0 in the v3.1.0 release, on the resource.k8s.io/v1 API). DRA gives you device classes and resource-based requests instead of the blunt nvidia.com/gpu: 1 count. It is the better long-term model, but treat it as the newer path: validate it on your VKS and driver versions before standardizing on it.

Model size	GPU and profile	Engine	Cache PVC	Placement note
Up to 8B	1x L40S 48GB or 1 vGPU profile	TensorRT-LLM	~50 GB	Single GPU, single pod
13B to 34B	1x H100 80GB	TensorRT-LLM or vLLM	~80 to 120 GB	Single host; fits one large card
70B	2 to 4x H100 NVLink	TensorRT-LLM, TP 2 to 4	~150 GB	NVLink required; not L40S+vGPU
100B and above	8x H100 or H200 across nodes	vLLM or TensorRT-LLM	400 GB and up	Multi-node LeaderWorkerSets

These are planning starting points, not guarantees. Cache sizes vary with quantization and the number of profiles you keep, and the GPU memory you actually need depends on context length and batch size. Size the PVC generously: running out of cache space mid-download is a slow, confusing failure.

Container Type and Engine: Choosing by Design

NIM Operator 3.0.0 added the Multi-LLM NIM container, which serves a range of models from one container and can pull them from NGC, the NeMo Data Store, or Hugging Face. It is genuinely useful, and it is also over-applied. The newer feature is not automatically the right default.

Container type first, then GPU placement, then engine. Most production endpoints land in the left branch.

My take: for a single model serving production traffic, use the LLM-specific container with one pinned TensorRT-LLM profile. You get predictable memory, predictable latency, and an artifact you can pre-cache cleanly. Reach for Multi-LLM NIM when you are consolidating a dev environment, A/B testing models, or running a long tail of low-traffic models where the operational savings of one container outweigh the per-model tuning you give up. On engines, TensorRT-LLM is the right call when the hardware is fixed and latency is the priority; vLLM and SGLang earn their place when you need a model that does not have a prebuilt TensorRT-LLM profile yet, or when you value flexibility over the last few percent of throughput.

What Breaks in the Field

A handful of failures account for most NIM support cases. None of them are exotic, and all of them are avoidable with a little design.

Endless pod startups. An un-cached NIMService downloads weights on first start and leans on a 20 minute startup probe. Pre-cache with NIMCache and the pod starts in seconds. This is the single highest-value habit on the platform.
Caching fails on big models. For models around 49 billion parameters and larger, the cache job can hit a Too many open files error during download. The fix is to raise the open-file limit on the container runtime of the VKS worker.
Multi-GPU on the wrong card. L40S with vGPU cannot run a multi-GPU model. You need NVLink-connected A100 or H100, or passthrough.
Image pull failures. The NGC pull secret and the NGC API key are two separate secrets (ngc-secret for the registry, ngc-api-secret for model access). Miss one and you get a pull back-off that looks like a network problem.

The open-files fix, applied to the containerd runtime on the VKS worker node, looks like this:

sudo mkdir -p /etc/systemd/system/containerd.service.d
echo "[Service]" | sudo tee /etc/systemd/system/containerd.service.d/override.conf
echo "LimitNOFILE=65536" | sudo tee -a /etc/systemd/system/containerd.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl restart containerd
sudo systemctl restart kubelet

Disclaimer: Changing the container runtime config on a VKS worker is a node-level change. Validate it against your VKS version, confirm GPU Operator and driver interoperability, test on a non-production cluster, and roll it out through your node lifecycle process rather than editing live workers by hand. Confirm model licensing and your NVIDIA AI Enterprise entitlement before pulling images.

What I’d Do

For the large majority of Private AI deployments, the right design is boring and reliable: serve models through the NIM Operator on a VKS cluster, always populate a NIMCache on shared ReadWriteMany storage before deploying the NIMService, pin an LLM-specific container to a single TensorRT-LLM profile, and let the HPA handle load between sensible min and max replicas. Choose your GPU placement from the smallest pattern the model fits, and validate the NVLink requirement against your actual hardware before you commit. Save Multi-LLM NIM, DRA, and multi-node serving for the cases that genuinely need them, after you have proven them on your versions. The teams that struggle with Private AI are almost never the ones who got the model wrong. They are the ones who treated serving as a deploy step instead of a design.

What is the first model you plan to put behind a NIM endpoint, and does your current GPU layout actually support it? That single question surfaces most of the design work this part is about.

References

VMware Private AI Series · Part 11 of 30
« Previous: Part 10 | VMware Private AI Complete Guide | Next: Part 12 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts

Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Tags: Model Serving, nim, NIM Operator, PAIF, Private AI Series, VKS, VMware Private AI

June 17, 2026

Dr. Pranay Jha