Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

VMware Private AI Foundation Architecture and Components, Layer by Layer (Private AI Series, Part 2)

VMware Private AI Foundation is five layers, not one box. A reference walk through every PAIF component on VCF 9.1: the GPU platform, DSM with pgvector, Private AI Services, and the single-zone limit you have to design around.

VMware Private AI Series · Part 2 of 24

TL;DR · Key Takeaways

  • PAIF is not an appliance you install on VCF. It is GPU and AI capability woven into VCF Operations, VCF Automation, and Data Services Manager, with Private AI Services running on top.
  • Think in five planes: infrastructure (VCF plus GPU hosts), the GPU platform (Supervisor, VKS, GPU Operator, NVIDIA AI Enterprise), data (DSM with PostgreSQL 16.8 and pgvector 0.8.0), Private AI Services (Model Gallery, Model Runtime, Data Indexing and Retrieval, Agent Builder), and the control plane.
  • Model Gallery stores artifacts in Harbor; Model Runtime serves them through vLLM and Infinity behind OpenAI-compatible endpoints.
  • The vector database is an external PostgreSQL with pgvector, provisioned and lifecycle-managed by Data Services Manager, not by Private AI Services itself.
  • The sharpest limit: Model Runtime and the VKS control plane deploy as single instances per zone, so a model endpoint has no cross-zone fault tolerance. Design for it now.
Who this is for: VMware architects and consultants scoping a PAIF design on VCF 9.0 or 9.1.  Prerequisites: comfort with VCF workload domains, vSphere Supervisor and VKS, and basic GPU virtualization concepts.

Most PAIF designs fail their first review for the same reason. The team draws one big box labeled Private AI, connects it to a GPU host, and calls it an architecture. Then a reviewer asks three questions. Where does the vector database actually live? Who owns model versioning and rollback? What happens to a running inference endpoint when an availability zone goes dark? The single box has no answer. VMware Private AI Foundation with NVIDIA (PAIF) is a layered system built into VMware Cloud Foundation, not a monolith you bolt on. If you cannot name the layer that owns each component, you cannot size it, secure it, or troubleshoot it at 2am.

This is Part 2. Part 1 covered what PAIF is and why it exists. Here we open the box and walk every layer, component by component, with the design decisions and the one limitation that catches most teams.

PAIF is five planes, not one product

PAIF 9.1 (released 12 May 2026, tracking VCF 9.1) does not ship as a separate appliance. It adds GPU and AI-aware capability into three places you already run: VCF Operations, VCF Automation, and VMware Data Services Manager. On top of that substrate sits Private AI Services, the part most people picture when they say the AI. The cleanest mental model is five planes, each owning a distinct job.

The PAIF stack: five planes Each plane owns one job. Troubleshooting starts by naming the plane. Consumption plane VCF Automation self-service catalog · VCF Operations (GPU monitoring, capacity, cost) Services plane Private AI Services: Model Gallery · Model Runtime · Data Indexing and Retrieval · Agent Builder · Harbor Data plane Data Services Manager: PostgreSQL 16.8 + pgvector 0.8.0 vector store GPU platform plane vSphere Supervisor · VKS clusters · GPU Operator · NVIDIA AI Enterprise Infrastructure plane VCF 9.1: vSphere · vSAN · NSX · GPU-equipped ESXi hosts
The five planes of a PAIF deployment, top to bottom.

Read it top down. The consumption plane is how a team actually requests capacity and calls a model. The services plane is Private AI Services. The data plane is the vector store and its lifecycle. The GPU platform plane is Kubernetes plus the NVIDIA software that turns a card into a schedulable resource. The infrastructure plane is plain VCF. Every troubleshooting session you ever run starts by deciding which plane the symptom belongs to, so commit this to memory before anything else.

Infrastructure and the GPU platform

At the bottom is ordinary VCF 9.1: vSphere for compute, vSAN for storage, NSX for the network, deployed as a workload domain. Nothing here is AI-specific, and that is the point. PAIF inherits VCF HA, DRS, lifecycle, and the security model rather than reinventing them.

The interesting work begins one layer up, where a GPU stops being a PCI device and becomes something Kubernetes can schedule. Three sharing modes exist, and choosing among them is the first real design decision:

  • vGPU time-slicing with C-series profiles when you want density and many small inference or notebook workloads sharing a card.
  • MIG (Multi-Instance GPU) on A100, H100, or H200 class cards when you need hard, partitioned isolation between tenants on the same physical GPU.
  • DirectPath / passthrough when a single VM needs the entire card. VCF 9.1 adds DirectPath Enablement for exclusive GPU access and Enhanced DirectPath I/O for NVIDIA ConnectX-6 DX, ConnectX-7, and BlueField-3 NICs, with GPUDirect RDMA for multi-node training fabrics.
From silicon to a scheduled GPU How a physical card becomes a Kubernetes-schedulable resource GPU host cluster ESXi host + vGPUNVIDIA GPU, host driver ESXi host + vGPUMIG / DirectPath capable ESXi host + vGPUvSAN, NSX attached vSphereSupervisor Kubernetes controlplane on vSphere VKS cluster GPU Operatordriver, toolkit, device plugin NVIDIA AI Enterprise runtimelicensed, supported stack Model Runtime podsvLLM / Infinity serving Endpoint replicas live in one zone
GPU hosts feed the Supervisor, which runs the VKS clusters where AI workloads actually execute.

The NVIDIA software that makes this work is delivered through NVIDIA AI Enterprise (NVAIE), the licensed and supported runtime stack. Inside Kubernetes, the GPU Operator automates the host driver, container toolkit, and device plugin so you are not hand-installing drivers on every node. vSphere Supervisor provides the control plane, and VKS clusters are where GPU workloads and the AI services run. In the field, the most common bring-up failure is a version mismatch between the vGPU host driver and the guest driver the GPU Operator installs. Pin both to the versions in the NVAIE release you are licensed for, and validate before you scale out. I cover the exact failure modes in the vGPU mistakes that break PAIF post.


The data plane: where your vectors actually live

This is the layer most architecture diagrams get wrong. Private AI Services does not run its own database. The vector store is an external PostgreSQL instance with the pgvector extension, provisioned and lifecycle-managed by VMware Data Services Manager (DSM), not by the AI services. Private AI Services 9.1 targets PostgreSQL 16.8 with pgvector 0.8.0 specifically. Treat those as fixed coordinates in your design, not some Postgres you will sort out later.

Why this seam matters: your RAG corpus, the embeddings that represent your proprietary data, lives in a database your platform team already knows how to back up, patch, and secure through DSM. It also means vector DB sizing, IOPS, and HA are a DSM conversation, separate from GPU sizing. Teams routinely under-size this and are surprised when retrieval latency, not inference, becomes the bottleneck under load. pgvector index choice and the database disk tier matter as much here as the GPU does.

Private AI Services: the four modules

Private AI Services is the layer that turns models and data into something an application can call. It has four modules, and they map cleanly onto the lifecycle of a generative AI request.

  • Model Gallery (the Model Store): uses the Harbor container registry as the central repository for model artifacts. This is your governance point: model versions, provenance, and the image source for air-gapped sites.
  • Model Runtime: consumes models from Harbor and serves them through inference engines (vLLM and Infinity), exposing both completion and embedding models behind OpenAI-compatible API endpoints. Because the API is OpenAI-compatible, code written against the OpenAI SDK points at your private endpoint with only a base-URL change.
  • Data Indexing and Retrieval: connects to the external pgvector database, ingests from Google Drive, Confluence, SharePoint, and S3, splits documents into configurable chunks, runs them through an embedding model, and stores the vectors. This is the RAG engine.
  • Agent Builder: composes the above into agentic workflows over your own corpus.
A grounded answer, step by step One RAG request touches three modules and two data stores 1 App (OpenAI SDK)asks a question 2 Model EndpointOpenAI-compatible API 3 Data Indexing and Retrievalembeds query, searches pgvector storetop-k context returned 4 Model Runtime (vLLM)generates with context 5 Grounded answerreturned to the app
A single retrieval-augmented request crosses retrieval, the vector store, and the runtime.

That flow is exactly why the layer model matters for troubleshooting. A bad answer could be a retrieval problem (Data Indexing), an endpoint problem (Model Runtime), or a data freshness problem (the indexing schedule), and those live in different modules with different logs.

A word on Agent Builder, since agentic AI is the loudest hype of 2026. It is genuinely useful for retrieval-augmented assistants over your own corpus, and considerably less mature for autonomous, multi-step agents that take actions on their own. Scope your first project to grounded question-answering, prove the retrieval quality, and add agency later. Do not let a slide deck talk you into autonomous agents on day one.

Control and consumption: how people actually use it

Two VCF components make the platform self-service rather than a ticket queue. VCF Automation exposes a catalog from which a team requests either a Deep Learning VM (a preconfigured, GPU-enabled image with the AI frameworks baked in) or a GPU-enabled VKS cluster, sized to the workload. VCF Operations provides the GPU and AI-aware monitoring, capacity, and cost views, so you can see GPU utilization and chargeback instead of flying blind. This is where PAIF earns its keep over a pile of bare GPU servers: the GPU becomes a governed, measured, requestable resource with a lifecycle, not a pet. For the broader VCF 9 deployment path that sits under all of this, see the PAIF deployment walkthrough on VCF 9.

Component reference, at a glance

ComponentPlane / ownerJobKey spec
VCF (vSphere, vSAN, NSX)InfrastructureCompute, storage, network substrateVCF 9.1 + PAIF add-on
vGPU + NVIDIA AI EnterpriseGPU platformGPU virtualization and supported AI runtimeC-series vGPU, MIG, DirectPath
GPU OperatorGPU platform (K8s)Automates driver, toolkit, device pluginRuns inside VKS clusters
vSphere Supervisor + VKSGPU platformKubernetes substrate for services and workloadsVCF 9.x
Data Services ManagerDataProvisions and manages the vector databasePostgreSQL 16.8, pgvector 0.8.0
Private AI ServicesServicesModel Gallery, Model Runtime, Data Indexing and Retrieval, Agent BuildervLLM and Infinity engines
HarborServicesOCI registry for model artifacts and air-gapped imagesModel Gallery backend
VCF AutomationControlSelf-service catalog: DLVMs, GPU VKS clustersVCF 9.1
VCF OperationsControlGPU and AI monitoring, capacity, costVCF 9.1

The one limitation to design around

Here is the part the marketing will not lead with. In the current architecture, the Model Runtime and the VKS cluster control plane deploy as single instances rather than as a distributed, cross-zone cluster. The direct consequence: replicas of the same model endpoint must live in the same availability zone, and a model endpoint has no cross-zone fault tolerance. If the zone hosting an endpoint fails, vSphere HA will restart the VMs, but you do not get active-active inference across zones out of the box.

Placing a model endpoint The platform gives single-zone services. Plan availability yourself. Must survive a full zone failurewith no downtime? No Yes Single-zone endpoint One Model Runtime per zone Rely on vSphere HA restart Simplest, lowest cost Accept a short recovery gap Two zones + LB Endpoint in each of two zones Load balancer in front Failover at the app layer You own the HA design
Endpoint placement decision under the single-zone constraint.

So my verdict for anyone designing PAIF for production inference on 9.1: do not assume the platform hands you highly available endpoints. It gives you infrastructure HA (an HA restart) and a single-zone service. If your SLA needs continuous availability through a zone failure, put a load balancer in front of two endpoints in two zones and own the failover at the application layer. Validate three assumptions before you commit a design: your licensed NVAIE version and its driver pairing, your DSM and PostgreSQL sizing for the vector store, and whether your availability target is compatible with single-zone endpoints. Get those three right and the rest of the build is mechanical.

What I’d Do

If I were scoping this for a client tomorrow, I would draw exactly the five planes above on the whiteboard, assign an owner to each, and refuse to move forward until someone could answer the vector-database and zone-failure questions. The architecture is sound and production-ready for most inference workloads, but its value shows up only when you respect the layer boundaries instead of collapsing them into one box. Part 3 takes the next step into licensing: the PAIF add-on, NVAIE, and what actually lands on the quote.

Which plane are you least sure about in your own design? That is usually the one to prototype first.


References

VMware Private AI Series · Part 2 of 30
« Previous: Part 1  |  VMware Private AI Complete Guide  |  Next: Part 3 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading