TL;DR · Key Takeaways
- Private AI Services 2.1 splits model delivery into two jobs: the Model Store (a Harbor model gallery) holds versioned model artifacts, and Model Runtime serves them as API endpoints behind the ML API Gateway.
- The gallery is plain Harbor and OCI. Your existing project RBAC, replication, and image scanning apply directly. There is no bespoke registry to learn.
- You push and pull with the
vcf paisCLI from a Deep Learning VM. Validate the model first: hash check, malware scan, serialization-attack scan. That validation is the entire reason a private gallery exists. - Model Runtime runs completion models on vLLM (GPU), embedding models on Infinity (GPU or CPU), and either on llama.cpp (CPU, GGUF only). Pick the engine by model type and where the silicon is.
- Apps target a stable routing name like
openai/gpt-oss-20bover the OpenAI API. Ops swaps the model underneath without touching the app. That decoupling is the point.
By Part 11 you had NIM microservices serving the hero models you care most about. So why does VMware Private AI Foundation ship a second serving path at all? Because not every model you run is a packaged NVIDIA microservice. You will have open-weights models you fine-tuned yourself, a pile of embedding models for retrieval, and small CPU-friendly models that should never touch a GPU. Private AI Services 2.1 gives those models a home: a place to store them, and a runtime to serve them. This post is the runbook for both.
What the Model Store and Model Runtime actually do
Private AI Services is installed as a package on top of the Private AI Foundation core, separately from the platform itself, and activated per namespace. It bundles six modules: the Model Store (a model gallery in Harbor), Model Runtime, Data Indexing and Retrieval (the pgvector knowledge base from Part 13), the MCP Servers and Tool Gallery, Agent Builder, and Observability through Prometheus and Grafana. This post is about the first two, because they are the load-bearing pieces every other module sits on. No stored model, no endpoint. No endpoint, no agent.
Step 1: Stand up the model gallery in Harbor
A model gallery is not a new product. It is a Harbor project with user access configured, and it stores models as OCI artifacts. That single fact is worth internalising, because it means the model gallery inherits everything Harbor already does: project-level RBAC, Trivy scanning, replication between sites, and quota. The mapping is clean. A gallery is a Harbor project. A model is an OCI repository. A revision is an immutable OCI artifact identified by content digest. A file inside the model is an OCI layer or blob.
You drive the gallery with the vcf pais CLI, a plug-in for the VCF Consumption CLI. It is already embedded in Deep Learning VM images, which is the easiest place to work from. On any other machine, install it explicitly.
# Install and verify the pais plug-in for the VCF Consumption CLI
vcf plugin install pais
vcf plugin list
# Trust the Harbor issuing CA on the Deep Learning VM, then restart docker
sudo cp my-harbor-issuing-ca.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
sudo systemctl restart docker
# Authenticate to the registry
docker login -u my_harbor_user my-harbor-repo.mycompany.com
Step 2: Validate before you publish
This is the step teams skip, and skipping it throws away the only reason to run a private gallery in the first place. A model file is executable data. A pickle or unsafe serialization format can run arbitrary code the moment it loads. Pulling a model straight from a public hub into a production endpoint is the model-serving equivalent of running a container image you never scanned. Do the validation on a Deep Learning VM acting as a sandbox, then promote only what passes.
Download the model into the sandbox VM (from NGC, Hugging Face, or your own hub), validate it, then push. If you pull from Hugging Face, use the huggingface-cli with the --local-dir flag so the files land without symlinks, which the pais CLI needs.
# Push a validated model into the gallery, tagged deliberately
cd ./baai/bge-small-en-v1.5
vcf pais models push
--modelName baai/bge-small-en-v1.5
--modelStore my-harbor-repo.mycompany.com/dev-models
--tag approved
# Confirm it landed
vcf pais models list --modelStore my-harbor-repo.mycompany.com/dev-models
# Later, on the serving side, pull by the same explicit tag
vcf pais models pull
--modelName baai/bge-small-en-v1.5
--modelStore my-harbor-repo.mycompany.com/dev-models
--tag approved
Every push to the same modelName with changed data creates a new revision with its own digest, so a gallery doubles as your model version history. Tag the promotion stages you actually use (approved, prod) and lean on Harbor project access to keep dev galleries separate from production ones.
Step 3: Serve it as an endpoint with Model Runtime
A model endpoint is an API endpoint for inference, running in a container on a preferably GPU-enabled VM and deployed as a Kubernetes resource in your namespace. Model Runtime gives you three open-source inference engines, and the right one is not a preference, it is dictated by the model type and where the compute is. Completion models on GPU run on vLLM. Embedding models run on Infinity, on GPU or CPU. Anything on CPU-only can run on llama.cpp, but only if the model is in GGUF format.
| Engine | Model type | Runs on | File format |
|---|---|---|---|
| vLLM | Completion | GPU | Standard weights |
| Infinity | Embedding | GPU or CPU | Standard weights |
| llama.cpp | Completion or embedding | CPU | GGUF required |
Deploy the endpoint two ways. The Model Runtime UI inside VCF Automation is the click path for one-off endpoints. For anything you want repeatable, deploy with kubectl so the endpoint is a declarative Kubernetes object you can put in Git. Either way you set three things that matter: the model URL pointing at your gallery revision, the model type and engine, and the routing name. The routing name follows a provider_name/model_id convention, for example openai/gpt-oss-20b or google/gemma-3-1b. That name is the contract your applications bind to.
Step 4: Consume it as a service
Endpoints sit behind the ML API Gateway and speak the OpenAI API. This is the piece that turns a model into model-as-a-service. Applications and agents call a stable routing name over a standard OpenAI-compatible API. The platform team can update model versions or horizontally scale replicas up and down behind that name, and the application never notices. The model becomes an operational concern, decoupled from the app lifecycle. That is the design win, and it is why you bothered with a gallery and a runtime instead of just running vLLM in a pod by hand.
Model Runtime or NIM? When to use which
You now have two serving paths, and the temptation is to standardise on one. Resist it. NIM, covered in Part 11 on NVIDIA NIM microservices, is the right call for your hero models: the handful of large language models where you want NVIDIA’s tuned throughput, TensorRT-LLM optimisation, and vendor support. Model Runtime is the right call for the long tail, the open-weights models you fine-tuned yourself, the swarm of embedding models retrieval needs, and the small CPU-friendly models that should never occupy a GPU. My take: route your top one or two models through NIM and everything else through Model Runtime. Trying to force every embedding model into a NIM, or every fine-tuned variant into a hand-built container, is how platforms get brittle. Use the gallery as the common substrate underneath both, because both pull from it.
For the architecture context that frames both serving paths, see Part 7 on the Private AI reference architecture and sizing, and for the sandbox where validation happens, see Part 10 on Deep Learning VMs.
What I’d Do
Treat the gallery as the source of truth on day one, before you deploy a single endpoint. Get tag discipline and validation right while the catalogue is small, because retrofitting governance onto a sprawling set of untracked models is miserable. Stand up one dev gallery and one prod gallery, wire Harbor scanning, and make the validate-then-push flow the only way models enter. Then let Model Runtime and NIM both draw from it. Which model are you serving by hand today that should really be a tagged revision behind a routing name?
References
- Broadcom TechDocs: Delivering Generative AI Applications by Using Private AI Services (2.1)
- Broadcom TechDocs: Storing ML Models (the model gallery and vcf pais CLI)
- Broadcom TechDocs: Running Completion or Embedding Models by Using Model Endpoints
- VMware Cloud Foundation Blog: Activate VCF Private AI Services
« Previous: Part 11 | VMware Private AI Complete Guide | Next: Part 13 »








