VMware Private AI Reference Architecture and Sizing: A Practical Blueprint (Private AI Series, Part 7)

How to size a VMware Private AI Foundation build the right way: two-domain design, choosing the deployment model, and working from workload back to GPU hosts and BOM on VCF 9.1.

by

Dr. Pranay Jha

June 15, 2026

No comments

12 minutes

Read Time

Updated for VCF 9.1. Reviewed against VMware Private AI Foundation on VCF 9.1 and Private AI Services 2.1. Version numbers and product names are current as of 2026.

VMware Private AI Series · Part 7 of 24

TL;DR · Key Takeaways

A Private AI Foundation reference design is two domains, not one: a standard management domain and a GPU-enabled VI workload domain with vSphere Supervisor. Size them separately.
GPU memory is the headline sizing axis, but vGPU profile homogeneity, the NSX Edge large-node requirement, and DSM and vector-DB overhead are what actually break a build plan.
Pick the deployment model first (Private AI Services, VKS clusters, or Deep Learning VMs), because it drives the vGPU vs DirectPath I/O decision and therefore the host BOM.
Start from concurrency and model memory footprint, then work back to GPUs per host and host count. Do not start from a GPU SKU you already bought.
Leave real headroom: N+1 at the GPU-host level is not optional once Supervisor and NIM workloads are pinned to specific hosts.

Who this is for: Architects and platform leads sizing a first Private AI Foundation build on VCF 9.0 or 9.1. Prerequisites: You have read Parts 4 to 6 on planning, GPU selection, and partitioning, and you know your candidate models and rough concurrency.

Most Private AI sizing conversations start at the wrong end. Someone has already been quoted a pair of H100 servers, and the question becomes “how do we fit our models onto these.” That is backwards. A reference architecture exists so the hardware falls out of the workload, not the other way around. Get the order wrong and you end up with eight-way H100 hosts running a 7B inference endpoint at four percent GPU utilization, while the team that actually needed memory bandwidth waits another quarter for budget.

This part lays out the reference design for VMware Private AI Foundation with NVIDIA (PAIF) the way I walk a customer through it in a design workshop: domains first, then the deployment model, then GPUs and profiles, then host count with headroom. The product names and versions here track VCF 9.0 and 9.1 as of mid 2026.

The two-domain shape of a PAIF design

PAIF is not a separate platform. It is an add-on that sits on top of VMware Cloud Foundation and extends VCF Operations, VCF Automation, and VMware Data Services Manager (DSM) with GPU and AI-aware capabilities. That single fact shapes the whole topology. You are designing a normal VCF instance and then bolting GPU-specific behaviour onto a workload domain inside it.

The clean reference shape is two domains. The management domain is ordinary: vCenter, NSX Manager, the VCF management appliances, SDDC lifecycle, and now the Private AI control surfaces. It carries no GPUs. Then one or more VI workload domains hold the GPU-enabled ESXi hosts, an NSX Edge cluster, and vSphere Supervisor, which is what turns the cluster into a platform for running AI workloads as VMs or as Kubernetes workloads. People sometimes try to collapse this into a single cluster to save hosts. Resist that on anything heading for production. The Supervisor control plane and the NSX Edge nodes have their own placement and availability needs, and you do not want them competing with a NIM pod for a GPU host during a failover.

The reference shape: a GPU-free management domain feeding a GPU-enabled workload domain with Supervisor and an Edge cluster.

Pick the deployment model before you pick the GPU

PAIF gives you three ways to run workloads, and the one you choose changes the host BOM. Decide this first.

Private AI Services is the LLM-centric path: a Model Store and Model Runtime that serve model endpoints, with the vector database and RAG plumbing managed for you. In VCF 9.1, organization administrators enable and manage Private AI Services inside namespaces straight from the VCF Automation UI. This is the right default when the use case is conversational AI, RAG, indexing and retrieval, or content generation, and when you want consumers to get an endpoint rather than a cluster.

VKS clusters give a team a GPU-accelerated Kubernetes cluster. The validated recommendation is to deploy them through the VCF Automation catalog, which pre-installs the NVIDIA GPU Operator and a NIM template so an ML engineer never touches a Helm chart from NGC. Drop to manual kubectl deployment only when you genuinely need infrastructure-as-code, custom cluster topology, or the NVIDIA Network Operator for distributed inference and training across hosts. Starting in VCF 9.1, both the catalog and the manual path support vGPU and DirectPath I/O, so you no longer give up self-service to get passthrough.

Deep Learning VMs are the simplest unit: a prebuilt image with the CUDA stack, frameworks, and drivers, handed to a data scientist who wants a GPU box and root. Good for experimentation and bespoke training. Weak for multi-tenant self-service, because you are back to managing VMs by hand.

Workload shape decides the model; the model narrows the sensible GPU virtualization mode.

Sizing from the workload back to the host

The sizing mistake I see most often is treating GPU memory as the only number that matters. It is the first gate, not the last. A 7B model in FP16 wants roughly 14 to 16 GB just for weights, and you then add KV cache that grows with context length and concurrency. That is why two users hitting a long-context endpoint can blow a budget that looked fine on paper. Work the problem in this order: model memory footprint, then per-endpoint concurrency and the KV cache it implies, then the vGPU profile that covers both, then how many of those profiles a physical GPU holds, then hosts.

Profile homogeneity is the constraint people forget. On a time-sliced GPU, every vGPU carved from one physical card must use the same profile. An L40S with 48 GB hands you six L40S-8Q instances when they are homogeneous, but only four if you try to mix profiles on the same card. So you cannot quietly put one big endpoint and several small ones on the same GPU. Plan one profile per card, and if you need mixed sizes, spread them across cards or reach for MIG on the GPUs that support it. That single rule changes host counts more than any benchmark number.

Homogeneous profiles maximise density; mixing profiles or skipping host overhead quietly costs you instances.

The table below is the kind of starting matrix I hand a customer. Treat the numbers as planning anchors, not guarantees: validate against the current PAIF Sizing Guide and your own benchmark, because KV cache and quantization move these figures around.

Workload tier	Sensible GPU	Mode	Endpoints / GPU	Design note
Small LLM (up to 8B), RAG	L40S 48 GB	vGPU time-slice	Up to 6x 8Q (homogeneous)	Best density per dollar for inference
Mid LLM (13B to 34B)	H100 80 GB	MIG or full vGPU	1 to 2 per GPU	MIG gives hard isolation for tenants
Large LLM (70B+)	H200 141 GB / Blackwell	DirectPath I/O, multi-GPU	Model spans GPUs	Needs NVLink and the Network Operator
Distributed training	H100 / H200 multi-host	DirectPath I/O	Whole GPUs	RDMA fabric is the real bottleneck
Mixed dev / experimentation	L40S / A100	vGPU on DL VMs	Varies	Cap profiles so one user cannot starve others

The data path you have to size for, not just the GPUs

A reference architecture that stops at GPU hosts is half a design. The request path through a Private AI Services endpoint touches the Model Runtime, the served model on the GPU, and for RAG a vector database, which on PAIF is PostgreSQL with pgvector provisioned through Data Services Manager. Teams routinely under-size that vector tier. Embeddings are not free, the database wants real CPU, memory, and fast storage, and at retrieval-heavy scale it competes for the same vSAN you sized for VM disks. Put the DSM-managed Postgres on its own resourcing plan, not as an afterthought riding on whatever is left.

A RAG endpoint is a chain. Size the vector database tier deliberately, not as leftovers.

Two more placement rules belong in any reference build. First, the NSX Edge cluster in the workload domain must use large-sized Edge nodes, because Supervisor will not deploy against smaller ones. I have watched a bring-up stall for a day on exactly this. Second, account for software interlock. Private AI Services 2.1 ships against NVIDIA GPU Operator 25.10.1, with NVAIE 7 vGPU drivers and NIM Operator 3.1.0 in the current stacks. There is a live known issue in that combination where ModelRuntime pods that request GPUs can fail to start, so your design should pin and test the exact operator and driver versions rather than letting a cluster pull “latest.” Version drift between the host vGPU driver and the GPU Operator is the single most common bring-up failure I see, and it is entirely avoidable with a fixed BOM.

Disclaimer: Sizing figures here are planning anchors. Before you commit a purchase or a build, validate the GPU and server BOM against the VMware Compatibility Guide, confirm host vGPU driver and GPU Operator interoperability for your exact PAIF release, check the current PAIF Sizing Guide, and benchmark with your own models and concurrency. Carry N+1 GPU-host headroom from day one.

My take on the trade-offs

Three positions I will defend. First, default to L40S for inference and RAG, and stop reaching for H100 reflexively. Most enterprise inference is memory-bandwidth and density bound, not raw FLOPS bound, and an L40S host running six homogeneous endpoints will beat a half-idle H100 host on cost per served request every time. Buy H100 and H200 where the model genuinely will not fit or where you are training, not because the slide deck said so.

Second, use the VCF Automation catalog path unless you have a concrete reason not to. The infrastructure-as-code purists will push for manual kubectl on principle. Unless you actually need the Network Operator for distributed work or a custom topology, manual deployment buys you error-prone Helm wrangling and a versioning burden for no real benefit, and VCF 9.1 removed the old excuse by giving the catalog DirectPath I/O too.

Third, the agentic-AI sizing hype is getting ahead of reality. Plenty of designs are now being padded with GPU capacity for “agents” that are really a few orchestrated LLM calls. Size for the inference you can measure today, leave headroom in hosts and power, and add GPUs when a real workload shows up. Speculative capacity sitting idle is the most expensive line item in a private AI build.

If you are still settling the GPU and partitioning choices that feed this design, work back through choosing the right GPU for VMware Private AI and GPU partitioning by design, and confirm host readiness against the planning and prerequisites guide. The full roadmap lives on the VMware Private AI complete guide.

What I’d Do

Lock the two-domain shape, choose the deployment model from the workload, then size GPU memory and concurrency before you touch a server quote. Hold one profile per card, reserve host overhead and an N+1 GPU host, give the vector tier its own plan, and pin the operator and driver versions. Do that and the BOM writes itself, defensibly. Part 8 moves from paper to the rack: preparing the VCF workload domain and the GPU hosts for bring-up. What is the first sizing assumption you plan to challenge on your build?

References

VMware Private AI Series · Part 7 of 30
« Previous: Part 6 | VMware Private AI Complete Guide | Next: Part 8 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts