Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

VMware Private AI Services: Deploying Models with the Model Store and Model Runtime (Private AI Series, Part 12)

A hands-on runbook for Private AI Services 2.1: stand up a Harbor model gallery, validate and push models with the vcf pais CLI, then serve them as endpoints through Model Runtime and the ML API Gateway.

VMware Private AI Series · Part 12 of 24

TL;DR · Key Takeaways

  • Private AI Services 2.1 splits model delivery into two jobs: the Model Store (a Harbor model gallery) holds versioned model artifacts, and Model Runtime serves them as API endpoints behind the ML API Gateway.
  • The gallery is plain Harbor and OCI. Your existing project RBAC, replication, and image scanning apply directly. There is no bespoke registry to learn.
  • You push and pull with the vcf pais CLI from a Deep Learning VM. Validate the model first: hash check, malware scan, serialization-attack scan. That validation is the entire reason a private gallery exists.
  • Model Runtime runs completion models on vLLM (GPU), embedding models on Infinity (GPU or CPU), and either on llama.cpp (CPU, GGUF only). Pick the engine by model type and where the silicon is.
  • Apps target a stable routing name like openai/gpt-oss-20b over the OpenAI API. Ops swaps the model underneath without touching the app. That decoupling is the point.
Who this is for: MLOps engineers and platform architects standing up model-as-a-service on VCF.  Prerequisites: a VCF 9 GPU workload domain with Private AI Services 2.1 activated in a namespace, a private Harbor registry, and at least one Deep Learning VM you can SSH into.

By Part 11 you had NIM microservices serving the hero models you care most about. So why does VMware Private AI Foundation ship a second serving path at all? Because not every model you run is a packaged NVIDIA microservice. You will have open-weights models you fine-tuned yourself, a pile of embedding models for retrieval, and small CPU-friendly models that should never touch a GPU. Private AI Services 2.1 gives those models a home: a place to store them, and a runtime to serve them. This post is the runbook for both.

What the Model Store and Model Runtime actually do

Private AI Services is installed as a package on top of the Private AI Foundation core, separately from the platform itself, and activated per namespace. It bundles six modules: the Model Store (a model gallery in Harbor), Model Runtime, Data Indexing and Retrieval (the pgvector knowledge base from Part 13), the MCP Servers and Tool Gallery, Agent Builder, and Observability through Prometheus and Grafana. This post is about the first two, because they are the load-bearing pieces every other module sits on. No stored model, no endpoint. No endpoint, no agent.

Where Model Store and Model Runtime sit Private AI Services 2.1 modules layered on the Foundation core Agent Builder · Generative AI apps consume endpoints over the OpenAI API Model Store Harbor model gallery (OCI) Model Runtime vLLM / Infinity / llama.cpp + gateway Data Indexing pgvector knowledge bases MCP Tool Gallery Observability Prometheus, Grafana, traces Private AI Foundation core Deep Learning VMs · GPU Operator · vGPU drivers · NIM microservices VCF 9 platform vSphere Supervisor · vSAN · NSX · NVIDIA GPUs Red blocks are the two modules covered in this runbook. Everything above them depends on a stored, served model.
Model Store and Model Runtime are the foundation the rest of Private AI Services builds on.

Step 1: Stand up the model gallery in Harbor

A model gallery is not a new product. It is a Harbor project with user access configured, and it stores models as OCI artifacts. That single fact is worth internalising, because it means the model gallery inherits everything Harbor already does: project-level RBAC, Trivy scanning, replication between sites, and quota. The mapping is clean. A gallery is a Harbor project. A model is an OCI repository. A revision is an immutable OCI artifact identified by content digest. A file inside the model is an OCI layer or blob.

How a model maps to Harbor and OCI Gallery to project, model to repository, revision to immutable digest Model gallery = Harbor project (RBAC, quota) Model = OCI repository, e.g. baai/bge-small-en-v1.5 Revision = immutable OCI artifact identified by content digest Why immutability matters Same data pushed twice = one revision. Changed data = a new digest. You can never silently mutate a model behind a tag. Integrity comes from the registry, not trust. Gotcha: the latest tag Unlike a container ecosystem, latest is NOT the default tag on pull. Scripts that omit a tag will not get newest. Tag deliberately (approved, prod) and pull by tag or digest, always.
The model gallery is OCI all the way down, which is exactly why it is trustworthy.

You drive the gallery with the vcf pais CLI, a plug-in for the VCF Consumption CLI. It is already embedded in Deep Learning VM images, which is the easiest place to work from. On any other machine, install it explicitly.

# Install and verify the pais plug-in for the VCF Consumption CLI
vcf plugin install pais
vcf plugin list

# Trust the Harbor issuing CA on the Deep Learning VM, then restart docker
sudo cp my-harbor-issuing-ca.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
sudo systemctl restart docker

# Authenticate to the registry
docker login -u my_harbor_user my-harbor-repo.mycompany.com

Step 2: Validate before you publish

This is the step teams skip, and skipping it throws away the only reason to run a private gallery in the first place. A model file is executable data. A pickle or unsafe serialization format can run arbitrary code the moment it loads. Pulling a model straight from a public hub into a production endpoint is the model-serving equivalent of running a container image you never scanned. Do the validation on a Deep Learning VM acting as a sandbox, then promote only what passes.

Disclaimer: Pushing and serving models is a production change. Validate the model integrity by hash, scan for malware and serialization attacks, confirm inference works on a Triton sandbox, check that your inference engine actually supports the model and file format, and verify GPU capacity in the target namespace before you promote anything to a shared gallery.

Download the model into the sandbox VM (from NGC, Hugging Face, or your own hub), validate it, then push. If you pull from Hugging Face, use the huggingface-cli with the --local-dir flag so the files land without symlinks, which the pais CLI needs.

# Push a validated model into the gallery, tagged deliberately
cd ./baai/bge-small-en-v1.5
vcf pais models push 
  --modelName baai/bge-small-en-v1.5 
  --modelStore my-harbor-repo.mycompany.com/dev-models 
  --tag approved

# Confirm it landed
vcf pais models list --modelStore my-harbor-repo.mycompany.com/dev-models

# Later, on the serving side, pull by the same explicit tag
vcf pais models pull 
  --modelName baai/bge-small-en-v1.5 
  --modelStore my-harbor-repo.mycompany.com/dev-models 
  --tag approved

Every push to the same modelName with changed data creates a new revision with its own digest, so a gallery doubles as your model version history. Tag the promotion stages you actually use (approved, prod) and lean on Harbor project access to keep dev galleries separate from production ones.


Step 3: Serve it as an endpoint with Model Runtime

A model endpoint is an API endpoint for inference, running in a container on a preferably GPU-enabled VM and deployed as a Kubernetes resource in your namespace. Model Runtime gives you three open-source inference engines, and the right one is not a preference, it is dictated by the model type and where the compute is. Completion models on GPU run on vLLM. Embedding models run on Infinity, on GPU or CPU. Anything on CPU-only can run on llama.cpp, but only if the model is in GGUF format.

Which inference engine? Driven by model type and available silicon, not preference Completion or embedding? completion embedding GPU available? Infinity GPU or CPU yes no vLLM GPU completion llama.cpp CPU, GGUF only CPU-only fallback llama.cpp, GGUF Practical rule: do not burn a vGPU profile on a small embedding model. Infinity on CPU is plenty.
Engine selection is mechanical once you know the model type and where the GPU is.
EngineModel typeRuns onFile format
vLLMCompletionGPUStandard weights
InfinityEmbeddingGPU or CPUStandard weights
llama.cppCompletion or embeddingCPUGGUF required

Deploy the endpoint two ways. The Model Runtime UI inside VCF Automation is the click path for one-off endpoints. For anything you want repeatable, deploy with kubectl so the endpoint is a declarative Kubernetes object you can put in Git. Either way you set three things that matter: the model URL pointing at your gallery revision, the model type and engine, and the routing name. The routing name follows a provider_name/model_id convention, for example openai/gpt-oss-20b or google/gemma-3-1b. That name is the contract your applications bind to.

Step 4: Consume it as a service

Endpoints sit behind the ML API Gateway and speak the OpenAI API. This is the piece that turns a model into model-as-a-service. Applications and agents call a stable routing name over a standard OpenAI-compatible API. The platform team can update model versions or horizontally scale replicas up and down behind that name, and the application never notices. The model becomes an operational concern, decoupled from the app lifecycle. That is the design win, and it is why you bothered with a gallery and a runtime instead of just running vLLM in a pod by hand.

From raw weights to a callable service DL VM download + validate Model Store vcf pais push, tagged revision Model Runtime vLLM endpoint on K8s ML API Gateway routing name, OpenAI API App / Agent calls stable routing name Ops swaps the model or scales replicas behind the gateway. The routing name the app binds to never changes.
The gateway and the OpenAI API are what make this model-as-a-service rather than a hand-run pod.

Model Runtime or NIM? When to use which

You now have two serving paths, and the temptation is to standardise on one. Resist it. NIM, covered in Part 11 on NVIDIA NIM microservices, is the right call for your hero models: the handful of large language models where you want NVIDIA’s tuned throughput, TensorRT-LLM optimisation, and vendor support. Model Runtime is the right call for the long tail, the open-weights models you fine-tuned yourself, the swarm of embedding models retrieval needs, and the small CPU-friendly models that should never occupy a GPU. My take: route your top one or two models through NIM and everything else through Model Runtime. Trying to force every embedding model into a NIM, or every fine-tuned variant into a hand-built container, is how platforms get brittle. Use the gallery as the common substrate underneath both, because both pull from it.

For the architecture context that frames both serving paths, see Part 7 on the Private AI reference architecture and sizing, and for the sandbox where validation happens, see Part 10 on Deep Learning VMs.

What I’d Do

Treat the gallery as the source of truth on day one, before you deploy a single endpoint. Get tag discipline and validation right while the catalogue is small, because retrofitting governance onto a sprawling set of untracked models is miserable. Stand up one dev gallery and one prod gallery, wire Harbor scanning, and make the validate-then-push flow the only way models enter. Then let Model Runtime and NIM both draw from it. Which model are you serving by hand today that should really be a tagged revision behind a routing name?

References


VMware Private AI Series · Part 12 of 30
« Previous: Part 11  |  VMware Private AI Complete Guide  |  Next: Part 13 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading