Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

NVIDIA NIM Microservices on VMware Private AI: The Model-Serving Layer Explained (Private AI Series, Part 11)

NVIDIA NIM is the model-serving layer of VMware Private AI. A reference-architecture look at the NIM Operator, NIMCache and NIMService, GPU placement, and the design choices that decide whether your endpoints survive production.

VMware Private AI Series · Part 11 of 24

TL;DR · Key Takeaways

  • NVIDIA NIM is the model-serving layer of Private AI: prebuilt inference microservices that expose an OpenAI-compatible endpoint, running as pods on a VKS cluster on top of your GPU workload domain.
  • The NIM Operator (v3.1.0) manages the lifecycle through four custom resources: NIMCache, NIMService, NIMPipeline and NIM Build. Get the NIMCache right and most bring-up pain disappears.
  • Pre-cache every model to a shared ReadWriteMany PVC before you create the NIMService, or you will hit 20 minute pod startups and a fresh download on every restart.
  • GPU placement is the design decision that bites: a multi-GPU model needs NVLink-connected A100 or H100 GPUs, and L40S with vGPU cannot run multi-GPU models at all.
  • For a single production model, pin an LLM-specific container to one TensorRT-LLM profile. Keep Multi-LLM NIM for dev and consolidation, not latency-sensitive production.
Who this is for: VCF architects and platform teams designing the inference layer of VMware Private AI Foundation.  Prerequisites: a GPU-enabled VI workload domain, the NVIDIA GPU Operator running on a VKS cluster, and an NGC API key with an NVIDIA AI Enterprise entitlement.

Most Private AI deployments stall in the same place. The GPUs are installed, the workload domain is healthy, the GPU Operator reports every driver green, and then someone tries to actually serve a model and burns three days on pod restarts, half-downloaded weights, and an endpoint that returns 503 under the lightest load. The platform was fine. The serving layer was the afterthought. NVIDIA NIM is that serving layer, and on Private AI it is run by the NIM Operator. This part is about designing it on purpose instead of discovering its sharp edges in production.

If the platform underneath is not built yet, start with preparing the GPU workload domain and installing the GPU Operator and vGPU drivers. Everything here assumes that layer already exists and focuses on what sits on top of it.

Where NIM Sits in the Private AI Stack

A NIM (NVIDIA Inference Microservice) is a container, part of NVIDIA AI Enterprise, that bundles an optimized inference engine with a model and exposes a standard REST and gRPC API. The point of NIM is that you do not hand-build a serving stack. You pull a container, give it a GPU, and you get an OpenAI-compatible endpoint that a RAG pipeline or Agent Builder can call without caring what runs underneath.

On Private AI, NIMs do not run as standalone vSphere VMs. They run as Kubernetes pods inside a VKS (vSphere Kubernetes Service) cluster, and the VKS worker VMs attach to physical GPUs on the ESXi hosts through vGPU or passthrough. VCF Private AI Services sits a layer above and consumes the endpoints, but the NIM is the actual engine. Understanding that boundary matters: when an endpoint is slow or unstable, the fix is almost always in the NIM and its GPU placement, not in the services consuming it.

The Private AI Serving Stack Where NIM and the NIM Operator fit, top to bottom ConsumersRAG pipelines, Agent Builder, apps calling the API NIM model endpointOpenAI-compatible REST / gRPC API on port 8000 NIM microservice podEngine inside the container: TensorRT-LLM, vLLM or SGLang NIM OperatorControllers for NIMCache, NIMService, NIMPipeline, NIM Build NVIDIA GPU OperatorvGPU guest driver, container toolkit, device plugin VKS worker VMsGuest OS and containerd, scheduled by the Supervisor ESXi hosts + physical GPUL40S, H100 or H200 exposed via vGPU or passthrough
The serving stack on Private AI. The two red-outlined layers are what the NIM Operator owns.

The NIM Operator and Its Four Custom Resources

You can run a NIM as a raw Kubernetes Deployment. Do not. On a platform you intend to operate, use the NIM Operator, which is the Kubernetes controller that manages the NIM lifecycle: caching, deployment, health probes, service exposure, scaling, and GPU scheduling. As of v3.1.0 it exposes four custom resources, and the design of your serving layer is really the design of how you use them.

  • NIMCache pulls model artifacts once and stores them on a persistent volume, so the weights are downloaded during cache population and reused across pod restarts and scale events. Sources can be NVIDIA NGC, the NeMo Data Store, or the Hugging Face Hub.
  • NIMService deploys and runs the inference pod: image, GPU resources, startup and readiness probes, the Kubernetes Service, and autoscaling.
  • NIMPipeline groups related services (for example an embedding NIM plus a reranking NIM) so they are managed as one unit.
  • NIM Build, added in v3.0.0, builds and caches TensorRT-LLM engines from buildable profiles when you want an engine optimized for your exact GPU rather than a generic prebuilt one.

The order is the whole game. Create the NIMCache first and let it finish populating, then create the NIMService that references it. Skip the cache and the NIMService still works, but it downloads weights on first start, which is why the operator sets a 20 minute startup probe timeout for un-cached pods. That timeout hides the problem rather than solving it: every restart, every scale-up, and every reschedule pays the download cost again.

NIM Bring-Up Sequence Populate the cache, then deploy the service RegistryNGC / HF /NeMo Data Store NIMCachepopulate PVC(ReadWriteMany) NIMServicepod + probes+ GPU schedule EndpointService Client In practiceAn empty cache means a full model download on every restart and scale-up. Populate NIMCache once, and the NIMService starts in seconds instead of minutes.
Cache first, serve second. The arrow direction is also the order you should apply the manifests.

A minimal pair of manifests for an 8B model looks like this. The NIMCache pulls and pins a TensorRT-LLM profile to a shared volume, and the NIMService consumes it with autoscaling enabled.

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: llama-3-1-8b-instruct
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: vsan-default
      size: "50Gi"
      volumeAccessMode: ReadWriteMany
---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-1-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: latest
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: llama-3-1-8b-instruct
  resources:
    limits:
      nvidia.com/gpu: 1
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 4
  expose:
    service:
      type: ClusterIP
      port: 8000

One subtlety that catches people: when scale.enabled is true, you cannot also set spec.replicas. The HPA owns the replica count through minReplicas and maxReplicas. And if you are on v3.1.0, note that spec.expose.ingress is deprecated in favor of spec.expose.router, which now supports the Kubernetes Gateway API alongside traditional ingress.


GPU Placement: The Sizing That Actually Matters

The hardest part of designing a NIM endpoint is not the YAML. It is deciding how the model maps onto GPUs, because that decision is constrained by the GPU you chose back in the vGPU, MIG and passthrough discussion and the topology you set in the reference architecture. Three patterns cover almost every case.

Three GPU Placement Patterns A. Single GPU 1 VKS worker VM NIM pod 1 GPU Models up to ~8BL40S or 1 vGPU profiletensorParallelism: 1 B. Multi-GPU, one host 1 VKS worker VM NIM pod Models 13B to 70B2 to 4 H100 with NVLinktensorParallelism: 2 to 4 C. Multi-node worker 1leader worker 2worker Models 100B+LeaderWorkerSets8+ GPUs across nodes
Pick the smallest pattern the model fits in. Each step right adds cost and operational complexity.

Here is the design trap that costs people a week. A multi-GPU model needs the GPUs to talk to each other at full bandwidth, which means NVLink. With vGPU software on vSphere, a model that needs more than one GPU requires A100 or H100 GPUs connected by NVLink or an NVLink Switch. L40S GPUs with vGPU software do not support multi-GPU models at all. If you sized a fleet of L40S boxes expecting to run a 70B model across two cards per VM, that plan does not work with vGPU. Your options become passthrough, NVLink-equipped H100 hosts, or a smaller model. Validate this against the model you actually intend to serve before you buy hardware.

The NIM Operator 3.x line also added Kubernetes Dynamic Resource Allocation (DRA) for GPU allocation, using the NVIDIA DRA driver (v25.12.0 in the v3.1.0 release, on the resource.k8s.io/v1 API). DRA gives you device classes and resource-based requests instead of the blunt nvidia.com/gpu: 1 count. It is the better long-term model, but treat it as the newer path: validate it on your VKS and driver versions before standardizing on it.

Model sizeGPU and profileEngineCache PVCPlacement note
Up to 8B1x L40S 48GB or 1 vGPU profileTensorRT-LLM~50 GBSingle GPU, single pod
13B to 34B1x H100 80GBTensorRT-LLM or vLLM~80 to 120 GBSingle host; fits one large card
70B2 to 4x H100 NVLinkTensorRT-LLM, TP 2 to 4~150 GBNVLink required; not L40S+vGPU
100B and above8x H100 or H200 across nodesvLLM or TensorRT-LLM400 GB and upMulti-node LeaderWorkerSets

These are planning starting points, not guarantees. Cache sizes vary with quantization and the number of profiles you keep, and the GPU memory you actually need depends on context length and batch size. Size the PVC generously: running out of cache space mid-download is a slow, confusing failure.

Container Type and Engine: Choosing by Design

NIM Operator 3.0.0 added the Multi-LLM NIM container, which serves a range of models from one container and can pull them from NGC, the NeMo Data Store, or Hugging Face. It is genuinely useful, and it is also over-applied. The newer feature is not automatically the right default.

Choosing Container Type and Engine One model in production,latency-sensitive? yes no / many LLM-specific NIMpin one TensorRT-LLM profile Multi-LLM NIMvLLM, on-demand caching Then size the placement Fits one GPU?single pod, TP 1 Needs 2 to 4 GPUs?NVLink, tensor parallel Bigger than one host?multi-node LWS Engine: TensorRT-LLM for pinned hardware and lowest latency; vLLM or SGLang for newer models and flexibility.
Container type first, then GPU placement, then engine. Most production endpoints land in the left branch.

My take: for a single model serving production traffic, use the LLM-specific container with one pinned TensorRT-LLM profile. You get predictable memory, predictable latency, and an artifact you can pre-cache cleanly. Reach for Multi-LLM NIM when you are consolidating a dev environment, A/B testing models, or running a long tail of low-traffic models where the operational savings of one container outweigh the per-model tuning you give up. On engines, TensorRT-LLM is the right call when the hardware is fixed and latency is the priority; vLLM and SGLang earn their place when you need a model that does not have a prebuilt TensorRT-LLM profile yet, or when you value flexibility over the last few percent of throughput.

What Breaks in the Field

A handful of failures account for most NIM support cases. None of them are exotic, and all of them are avoidable with a little design.

  • Endless pod startups. An un-cached NIMService downloads weights on first start and leans on a 20 minute startup probe. Pre-cache with NIMCache and the pod starts in seconds. This is the single highest-value habit on the platform.
  • Caching fails on big models. For models around 49 billion parameters and larger, the cache job can hit a Too many open files error during download. The fix is to raise the open-file limit on the container runtime of the VKS worker.
  • Multi-GPU on the wrong card. L40S with vGPU cannot run a multi-GPU model. You need NVLink-connected A100 or H100, or passthrough.
  • Image pull failures. The NGC pull secret and the NGC API key are two separate secrets (ngc-secret for the registry, ngc-api-secret for model access). Miss one and you get a pull back-off that looks like a network problem.

The open-files fix, applied to the containerd runtime on the VKS worker node, looks like this:

sudo mkdir -p /etc/systemd/system/containerd.service.d
echo "[Service]" | sudo tee /etc/systemd/system/containerd.service.d/override.conf
echo "LimitNOFILE=65536" | sudo tee -a /etc/systemd/system/containerd.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl restart containerd
sudo systemctl restart kubelet
Disclaimer: Changing the container runtime config on a VKS worker is a node-level change. Validate it against your VKS version, confirm GPU Operator and driver interoperability, test on a non-production cluster, and roll it out through your node lifecycle process rather than editing live workers by hand. Confirm model licensing and your NVIDIA AI Enterprise entitlement before pulling images.

What I’d Do

For the large majority of Private AI deployments, the right design is boring and reliable: serve models through the NIM Operator on a VKS cluster, always populate a NIMCache on shared ReadWriteMany storage before deploying the NIMService, pin an LLM-specific container to a single TensorRT-LLM profile, and let the HPA handle load between sensible min and max replicas. Choose your GPU placement from the smallest pattern the model fits, and validate the NVLink requirement against your actual hardware before you commit. Save Multi-LLM NIM, DRA, and multi-node serving for the cases that genuinely need them, after you have proven them on your versions. The teams that struggle with Private AI are almost never the ones who got the model wrong. They are the ones who treated serving as a deploy step instead of a design.

What is the first model you plan to put behind a NIM endpoint, and does your current GPU layout actually support it? That single question surfaces most of the design work this part is about.

References


VMware Private AI Series · Part 11 of 30
« Previous: Part 10  |  VMware Private AI Complete Guide  |  Next: Part 12 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading