TL;DR · Key Takeaways
- PAIF is not an appliance you install on VCF. It is GPU and AI capability woven into VCF Operations, VCF Automation, and Data Services Manager, with Private AI Services running on top.
- Think in five planes: infrastructure (VCF plus GPU hosts), the GPU platform (Supervisor, VKS, GPU Operator, NVIDIA AI Enterprise), data (DSM with PostgreSQL 16.8 and pgvector 0.8.0), Private AI Services (Model Gallery, Model Runtime, Data Indexing and Retrieval, Agent Builder), and the control plane.
- Model Gallery stores artifacts in Harbor; Model Runtime serves them through vLLM and Infinity behind OpenAI-compatible endpoints.
- The vector database is an external PostgreSQL with pgvector, provisioned and lifecycle-managed by Data Services Manager, not by Private AI Services itself.
- The sharpest limit: Model Runtime and the VKS control plane deploy as single instances per zone, so a model endpoint has no cross-zone fault tolerance. Design for it now.
Most PAIF designs fail their first review for the same reason. The team draws one big box labeled Private AI, connects it to a GPU host, and calls it an architecture. Then a reviewer asks three questions. Where does the vector database actually live? Who owns model versioning and rollback? What happens to a running inference endpoint when an availability zone goes dark? The single box has no answer. VMware Private AI Foundation with NVIDIA (PAIF) is a layered system built into VMware Cloud Foundation, not a monolith you bolt on. If you cannot name the layer that owns each component, you cannot size it, secure it, or troubleshoot it at 2am.
This is Part 2. Part 1 covered what PAIF is and why it exists. Here we open the box and walk every layer, component by component, with the design decisions and the one limitation that catches most teams.
PAIF is five planes, not one product
PAIF 9.1 (released 12 May 2026, tracking VCF 9.1) does not ship as a separate appliance. It adds GPU and AI-aware capability into three places you already run: VCF Operations, VCF Automation, and VMware Data Services Manager. On top of that substrate sits Private AI Services, the part most people picture when they say the AI. The cleanest mental model is five planes, each owning a distinct job.
Read it top down. The consumption plane is how a team actually requests capacity and calls a model. The services plane is Private AI Services. The data plane is the vector store and its lifecycle. The GPU platform plane is Kubernetes plus the NVIDIA software that turns a card into a schedulable resource. The infrastructure plane is plain VCF. Every troubleshooting session you ever run starts by deciding which plane the symptom belongs to, so commit this to memory before anything else.
Infrastructure and the GPU platform
At the bottom is ordinary VCF 9.1: vSphere for compute, vSAN for storage, NSX for the network, deployed as a workload domain. Nothing here is AI-specific, and that is the point. PAIF inherits VCF HA, DRS, lifecycle, and the security model rather than reinventing them.
The interesting work begins one layer up, where a GPU stops being a PCI device and becomes something Kubernetes can schedule. Three sharing modes exist, and choosing among them is the first real design decision:
- vGPU time-slicing with C-series profiles when you want density and many small inference or notebook workloads sharing a card.
- MIG (Multi-Instance GPU) on A100, H100, or H200 class cards when you need hard, partitioned isolation between tenants on the same physical GPU.
- DirectPath / passthrough when a single VM needs the entire card. VCF 9.1 adds DirectPath Enablement for exclusive GPU access and Enhanced DirectPath I/O for NVIDIA ConnectX-6 DX, ConnectX-7, and BlueField-3 NICs, with GPUDirect RDMA for multi-node training fabrics.
The NVIDIA software that makes this work is delivered through NVIDIA AI Enterprise (NVAIE), the licensed and supported runtime stack. Inside Kubernetes, the GPU Operator automates the host driver, container toolkit, and device plugin so you are not hand-installing drivers on every node. vSphere Supervisor provides the control plane, and VKS clusters are where GPU workloads and the AI services run. In the field, the most common bring-up failure is a version mismatch between the vGPU host driver and the guest driver the GPU Operator installs. Pin both to the versions in the NVAIE release you are licensed for, and validate before you scale out. I cover the exact failure modes in the vGPU mistakes that break PAIF post.
The data plane: where your vectors actually live
This is the layer most architecture diagrams get wrong. Private AI Services does not run its own database. The vector store is an external PostgreSQL instance with the pgvector extension, provisioned and lifecycle-managed by VMware Data Services Manager (DSM), not by the AI services. Private AI Services 9.1 targets PostgreSQL 16.8 with pgvector 0.8.0 specifically. Treat those as fixed coordinates in your design, not some Postgres you will sort out later.
Why this seam matters: your RAG corpus, the embeddings that represent your proprietary data, lives in a database your platform team already knows how to back up, patch, and secure through DSM. It also means vector DB sizing, IOPS, and HA are a DSM conversation, separate from GPU sizing. Teams routinely under-size this and are surprised when retrieval latency, not inference, becomes the bottleneck under load. pgvector index choice and the database disk tier matter as much here as the GPU does.
Private AI Services: the four modules
Private AI Services is the layer that turns models and data into something an application can call. It has four modules, and they map cleanly onto the lifecycle of a generative AI request.
- Model Gallery (the Model Store): uses the Harbor container registry as the central repository for model artifacts. This is your governance point: model versions, provenance, and the image source for air-gapped sites.
- Model Runtime: consumes models from Harbor and serves them through inference engines (vLLM and Infinity), exposing both completion and embedding models behind OpenAI-compatible API endpoints. Because the API is OpenAI-compatible, code written against the OpenAI SDK points at your private endpoint with only a base-URL change.
- Data Indexing and Retrieval: connects to the external pgvector database, ingests from Google Drive, Confluence, SharePoint, and S3, splits documents into configurable chunks, runs them through an embedding model, and stores the vectors. This is the RAG engine.
- Agent Builder: composes the above into agentic workflows over your own corpus.
That flow is exactly why the layer model matters for troubleshooting. A bad answer could be a retrieval problem (Data Indexing), an endpoint problem (Model Runtime), or a data freshness problem (the indexing schedule), and those live in different modules with different logs.
A word on Agent Builder, since agentic AI is the loudest hype of 2026. It is genuinely useful for retrieval-augmented assistants over your own corpus, and considerably less mature for autonomous, multi-step agents that take actions on their own. Scope your first project to grounded question-answering, prove the retrieval quality, and add agency later. Do not let a slide deck talk you into autonomous agents on day one.
Control and consumption: how people actually use it
Two VCF components make the platform self-service rather than a ticket queue. VCF Automation exposes a catalog from which a team requests either a Deep Learning VM (a preconfigured, GPU-enabled image with the AI frameworks baked in) or a GPU-enabled VKS cluster, sized to the workload. VCF Operations provides the GPU and AI-aware monitoring, capacity, and cost views, so you can see GPU utilization and chargeback instead of flying blind. This is where PAIF earns its keep over a pile of bare GPU servers: the GPU becomes a governed, measured, requestable resource with a lifecycle, not a pet. For the broader VCF 9 deployment path that sits under all of this, see the PAIF deployment walkthrough on VCF 9.
Component reference, at a glance
| Component | Plane / owner | Job | Key spec |
|---|---|---|---|
| VCF (vSphere, vSAN, NSX) | Infrastructure | Compute, storage, network substrate | VCF 9.1 + PAIF add-on |
| vGPU + NVIDIA AI Enterprise | GPU platform | GPU virtualization and supported AI runtime | C-series vGPU, MIG, DirectPath |
| GPU Operator | GPU platform (K8s) | Automates driver, toolkit, device plugin | Runs inside VKS clusters |
| vSphere Supervisor + VKS | GPU platform | Kubernetes substrate for services and workloads | VCF 9.x |
| Data Services Manager | Data | Provisions and manages the vector database | PostgreSQL 16.8, pgvector 0.8.0 |
| Private AI Services | Services | Model Gallery, Model Runtime, Data Indexing and Retrieval, Agent Builder | vLLM and Infinity engines |
| Harbor | Services | OCI registry for model artifacts and air-gapped images | Model Gallery backend |
| VCF Automation | Control | Self-service catalog: DLVMs, GPU VKS clusters | VCF 9.1 |
| VCF Operations | Control | GPU and AI monitoring, capacity, cost | VCF 9.1 |
The one limitation to design around
Here is the part the marketing will not lead with. In the current architecture, the Model Runtime and the VKS cluster control plane deploy as single instances rather than as a distributed, cross-zone cluster. The direct consequence: replicas of the same model endpoint must live in the same availability zone, and a model endpoint has no cross-zone fault tolerance. If the zone hosting an endpoint fails, vSphere HA will restart the VMs, but you do not get active-active inference across zones out of the box.
So my verdict for anyone designing PAIF for production inference on 9.1: do not assume the platform hands you highly available endpoints. It gives you infrastructure HA (an HA restart) and a single-zone service. If your SLA needs continuous availability through a zone failure, put a load balancer in front of two endpoints in two zones and own the failover at the application layer. Validate three assumptions before you commit a design: your licensed NVAIE version and its driver pairing, your DSM and PostgreSQL sizing for the vector store, and whether your availability target is compatible with single-zone endpoints. Get those three right and the rest of the build is mechanical.
What I’d Do
If I were scoping this for a client tomorrow, I would draw exactly the five planes above on the whiteboard, assign an owner to each, and refuse to move forward until someone could answer the vector-database and zone-failure questions. The architecture is sound and production-ready for most inference workloads, but its value shows up only when you respect the layer boundaries instead of collapsing them into one box. Part 3 takes the next step into licensing: the PAIF add-on, NVAIE, and what actually lands on the quote.
Which plane are you least sure about in your own design? That is usually the one to prototype first.
References
- VMware Private AI Foundation with NVIDIA 9.1 (Broadcom TechDocs)
- Private AI Services detailed design (Broadcom TechDocs)
- VCF 9.1: The Secure, Cost-Effective Private Cloud Platform for Production AI (VCF blog)
« Previous: Part 1 | VMware Private AI Complete Guide | Next: Part 3 »



