VMware Private AI Foundation Architecture and Components, Layer by Layer (Private AI Series, Part 2)

VMware Private AI Foundation is five layers, not one box. A reference walk through every PAIF component on VCF 9.1: the GPU platform, DSM with pgvector, Private AI Services, and the single-zone limit you have to design around.

by

Dr. Pranay Jha

June 15, 2026

No comments

11 minutes

Read Time

VMware Private AI Series · Part 2 of 24

TL;DR · Key Takeaways

PAIF is not an appliance you install on VCF. It is GPU and AI capability woven into VCF Operations, VCF Automation, and Data Services Manager, with Private AI Services running on top.
Think in five planes: infrastructure (VCF plus GPU hosts), the GPU platform (Supervisor, VKS, GPU Operator, NVIDIA AI Enterprise), data (DSM with PostgreSQL 16.8 and pgvector 0.8.0), Private AI Services (Model Gallery, Model Runtime, Data Indexing and Retrieval, Agent Builder), and the control plane.
Model Gallery stores artifacts in Harbor; Model Runtime serves them through vLLM and Infinity behind OpenAI-compatible endpoints.
The vector database is an external PostgreSQL with pgvector, provisioned and lifecycle-managed by Data Services Manager, not by Private AI Services itself.
The sharpest limit: Model Runtime and the VKS control plane deploy as single instances per zone, so a model endpoint has no cross-zone fault tolerance. Design for it now.

Who this is for: VMware architects and consultants scoping a PAIF design on VCF 9.0 or 9.1. Prerequisites: comfort with VCF workload domains, vSphere Supervisor and VKS, and basic GPU virtualization concepts.

Most PAIF designs fail their first review for the same reason. The team draws one big box labeled Private AI, connects it to a GPU host, and calls it an architecture. Then a reviewer asks three questions. Where does the vector database actually live? Who owns model versioning and rollback? What happens to a running inference endpoint when an availability zone goes dark? The single box has no answer. VMware Private AI Foundation with NVIDIA (PAIF) is a layered system built into VMware Cloud Foundation, not a monolith you bolt on. If you cannot name the layer that owns each component, you cannot size it, secure it, or troubleshoot it at 2am.

This is Part 2. Part 1 covered what PAIF is and why it exists. Here we open the box and walk every layer, component by component, with the design decisions and the one limitation that catches most teams.

PAIF is five planes, not one product

PAIF 9.1 (released 12 May 2026, tracking VCF 9.1) does not ship as a separate appliance. It adds GPU and AI-aware capability into three places you already run: VCF Operations, VCF Automation, and VMware Data Services Manager. On top of that substrate sits Private AI Services, the part most people picture when they say the AI. The cleanest mental model is five planes, each owning a distinct job.

The five planes of a PAIF deployment, top to bottom.

Read it top down. The consumption plane is how a team actually requests capacity and calls a model. The services plane is Private AI Services. The data plane is the vector store and its lifecycle. The GPU platform plane is Kubernetes plus the NVIDIA software that turns a card into a schedulable resource. The infrastructure plane is plain VCF. Every troubleshooting session you ever run starts by deciding which plane the symptom belongs to, so commit this to memory before anything else.

Infrastructure and the GPU platform

At the bottom is ordinary VCF 9.1: vSphere for compute, vSAN for storage, NSX for the network, deployed as a workload domain. Nothing here is AI-specific, and that is the point. PAIF inherits VCF HA, DRS, lifecycle, and the security model rather than reinventing them.

The interesting work begins one layer up, where a GPU stops being a PCI device and becomes something Kubernetes can schedule. Three sharing modes exist, and choosing among them is the first real design decision:

vGPU time-slicing with C-series profiles when you want density and many small inference or notebook workloads sharing a card.
MIG (Multi-Instance GPU) on A100, H100, or H200 class cards when you need hard, partitioned isolation between tenants on the same physical GPU.
DirectPath / passthrough when a single VM needs the entire card. VCF 9.1 adds DirectPath Enablement for exclusive GPU access and Enhanced DirectPath I/O for NVIDIA ConnectX-6 DX, ConnectX-7, and BlueField-3 NICs, with GPUDirect RDMA for multi-node training fabrics.

GPU hosts feed the Supervisor, which runs the VKS clusters where AI workloads actually execute.

The NVIDIA software that makes this work is delivered through NVIDIA AI Enterprise (NVAIE), the licensed and supported runtime stack. Inside Kubernetes, the GPU Operator automates the host driver, container toolkit, and device plugin so you are not hand-installing drivers on every node. vSphere Supervisor provides the control plane, and VKS clusters are where GPU workloads and the AI services run. In the field, the most common bring-up failure is a version mismatch between the vGPU host driver and the guest driver the GPU Operator installs. Pin both to the versions in the NVAIE release you are licensed for, and validate before you scale out. I cover the exact failure modes in the vGPU mistakes that break PAIF post.

The data plane: where your vectors actually live

This is the layer most architecture diagrams get wrong. Private AI Services does not run its own database. The vector store is an external PostgreSQL instance with the pgvector extension, provisioned and lifecycle-managed by VMware Data Services Manager (DSM), not by the AI services. Private AI Services 9.1 targets PostgreSQL 16.8 with pgvector 0.8.0 specifically. Treat those as fixed coordinates in your design, not some Postgres you will sort out later.

Why this seam matters: your RAG corpus, the embeddings that represent your proprietary data, lives in a database your platform team already knows how to back up, patch, and secure through DSM. It also means vector DB sizing, IOPS, and HA are a DSM conversation, separate from GPU sizing. Teams routinely under-size this and are surprised when retrieval latency, not inference, becomes the bottleneck under load. pgvector index choice and the database disk tier matter as much here as the GPU does.

Private AI Services: the four modules

Private AI Services is the layer that turns models and data into something an application can call. It has four modules, and they map cleanly onto the lifecycle of a generative AI request.

Model Gallery (the Model Store): uses the Harbor container registry as the central repository for model artifacts. This is your governance point: model versions, provenance, and the image source for air-gapped sites.
Model Runtime: consumes models from Harbor and serves them through inference engines (vLLM and Infinity), exposing both completion and embedding models behind OpenAI-compatible API endpoints. Because the API is OpenAI-compatible, code written against the OpenAI SDK points at your private endpoint with only a base-URL change.
Data Indexing and Retrieval: connects to the external pgvector database, ingests from Google Drive, Confluence, SharePoint, and S3, splits documents into configurable chunks, runs them through an embedding model, and stores the vectors. This is the RAG engine.
Agent Builder: composes the above into agentic workflows over your own corpus.

A single retrieval-augmented request crosses retrieval, the vector store, and the runtime.

That flow is exactly why the layer model matters for troubleshooting. A bad answer could be a retrieval problem (Data Indexing), an endpoint problem (Model Runtime), or a data freshness problem (the indexing schedule), and those live in different modules with different logs.

A word on Agent Builder, since agentic AI is the loudest hype of 2026. It is genuinely useful for retrieval-augmented assistants over your own corpus, and considerably less mature for autonomous, multi-step agents that take actions on their own. Scope your first project to grounded question-answering, prove the retrieval quality, and add agency later. Do not let a slide deck talk you into autonomous agents on day one.

Control and consumption: how people actually use it

Two VCF components make the platform self-service rather than a ticket queue. VCF Automation exposes a catalog from which a team requests either a Deep Learning VM (a preconfigured, GPU-enabled image with the AI frameworks baked in) or a GPU-enabled VKS cluster, sized to the workload. VCF Operations provides the GPU and AI-aware monitoring, capacity, and cost views, so you can see GPU utilization and chargeback instead of flying blind. This is where PAIF earns its keep over a pile of bare GPU servers: the GPU becomes a governed, measured, requestable resource with a lifecycle, not a pet. For the broader VCF 9 deployment path that sits under all of this, see the PAIF deployment walkthrough on VCF 9.

Component reference, at a glance

Component	Plane / owner	Job	Key spec
VCF (vSphere, vSAN, NSX)	Infrastructure	Compute, storage, network substrate	VCF 9.1 + PAIF add-on
vGPU + NVIDIA AI Enterprise	GPU platform	GPU virtualization and supported AI runtime	C-series vGPU, MIG, DirectPath
GPU Operator	GPU platform (K8s)	Automates driver, toolkit, device plugin	Runs inside VKS clusters
vSphere Supervisor + VKS	GPU platform	Kubernetes substrate for services and workloads	VCF 9.x
Data Services Manager	Data	Provisions and manages the vector database	PostgreSQL 16.8, pgvector 0.8.0
Private AI Services	Services	Model Gallery, Model Runtime, Data Indexing and Retrieval, Agent Builder	vLLM and Infinity engines
Harbor	Services	OCI registry for model artifacts and air-gapped images	Model Gallery backend
VCF Automation	Control	Self-service catalog: DLVMs, GPU VKS clusters	VCF 9.1
VCF Operations	Control	GPU and AI monitoring, capacity, cost	VCF 9.1

The one limitation to design around

Here is the part the marketing will not lead with. In the current architecture, the Model Runtime and the VKS cluster control plane deploy as single instances rather than as a distributed, cross-zone cluster. The direct consequence: replicas of the same model endpoint must live in the same availability zone, and a model endpoint has no cross-zone fault tolerance. If the zone hosting an endpoint fails, vSphere HA will restart the VMs, but you do not get active-active inference across zones out of the box.

Endpoint placement decision under the single-zone constraint.

So my verdict for anyone designing PAIF for production inference on 9.1: do not assume the platform hands you highly available endpoints. It gives you infrastructure HA (an HA restart) and a single-zone service. If your SLA needs continuous availability through a zone failure, put a load balancer in front of two endpoints in two zones and own the failover at the application layer. Validate three assumptions before you commit a design: your licensed NVAIE version and its driver pairing, your DSM and PostgreSQL sizing for the vector store, and whether your availability target is compatible with single-zone endpoints. Get those three right and the rest of the build is mechanical.

What I’d Do

If I were scoping this for a client tomorrow, I would draw exactly the five planes above on the whiteboard, assign an owner to each, and refuse to move forward until someone could answer the vector-database and zone-failure questions. The architecture is sound and production-ready for most inference workloads, but its value shows up only when you respect the layer boundaries instead of collapsing them into one box. Part 3 takes the next step into licensing: the PAIF add-on, NVAIE, and what actually lands on the quote.

Which plane are you least sure about in your own design? That is usually the one to prototype first.

References

VMware Private AI Series · Part 2 of 30
« Previous: Part 1 | VMware Private AI Complete Guide | Next: Part 3 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts

Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Tags: NVIDIA AI Enterprise, PAIF, Private AI Series, Private AI Services, Reference Architecture, VCF 9.1, VMware Private AI

June 17, 2026

Dr. Pranay Jha