TL;DR · Key Takeaways
- A Private AI Foundation reference design is two domains, not one: a standard management domain and a GPU-enabled VI workload domain with vSphere Supervisor. Size them separately.
- GPU memory is the headline sizing axis, but vGPU profile homogeneity, the NSX Edge large-node requirement, and DSM and vector-DB overhead are what actually break a build plan.
- Pick the deployment model first (Private AI Services, VKS clusters, or Deep Learning VMs), because it drives the vGPU vs DirectPath I/O decision and therefore the host BOM.
- Start from concurrency and model memory footprint, then work back to GPUs per host and host count. Do not start from a GPU SKU you already bought.
- Leave real headroom: N+1 at the GPU-host level is not optional once Supervisor and NIM workloads are pinned to specific hosts.
Most Private AI sizing conversations start at the wrong end. Someone has already been quoted a pair of H100 servers, and the question becomes “how do we fit our models onto these.” That is backwards. A reference architecture exists so the hardware falls out of the workload, not the other way around. Get the order wrong and you end up with eight-way H100 hosts running a 7B inference endpoint at four percent GPU utilization, while the team that actually needed memory bandwidth waits another quarter for budget.
This part lays out the reference design for VMware Private AI Foundation with NVIDIA (PAIF) the way I walk a customer through it in a design workshop: domains first, then the deployment model, then GPUs and profiles, then host count with headroom. The product names and versions here track VCF 9.0 and 9.1 as of mid 2026.
The two-domain shape of a PAIF design
PAIF is not a separate platform. It is an add-on that sits on top of VMware Cloud Foundation and extends VCF Operations, VCF Automation, and VMware Data Services Manager (DSM) with GPU and AI-aware capabilities. That single fact shapes the whole topology. You are designing a normal VCF instance and then bolting GPU-specific behaviour onto a workload domain inside it.
The clean reference shape is two domains. The management domain is ordinary: vCenter, NSX Manager, the VCF management appliances, SDDC lifecycle, and now the Private AI control surfaces. It carries no GPUs. Then one or more VI workload domains hold the GPU-enabled ESXi hosts, an NSX Edge cluster, and vSphere Supervisor, which is what turns the cluster into a platform for running AI workloads as VMs or as Kubernetes workloads. People sometimes try to collapse this into a single cluster to save hosts. Resist that on anything heading for production. The Supervisor control plane and the NSX Edge nodes have their own placement and availability needs, and you do not want them competing with a NIM pod for a GPU host during a failover.
Pick the deployment model before you pick the GPU
PAIF gives you three ways to run workloads, and the one you choose changes the host BOM. Decide this first.
Private AI Services is the LLM-centric path: a Model Store and Model Runtime that serve model endpoints, with the vector database and RAG plumbing managed for you. In VCF 9.1, organization administrators enable and manage Private AI Services inside namespaces straight from the VCF Automation UI. This is the right default when the use case is conversational AI, RAG, indexing and retrieval, or content generation, and when you want consumers to get an endpoint rather than a cluster.
VKS clusters give a team a GPU-accelerated Kubernetes cluster. The validated recommendation is to deploy them through the VCF Automation catalog, which pre-installs the NVIDIA GPU Operator and a NIM template so an ML engineer never touches a Helm chart from NGC. Drop to manual kubectl deployment only when you genuinely need infrastructure-as-code, custom cluster topology, or the NVIDIA Network Operator for distributed inference and training across hosts. Starting in VCF 9.1, both the catalog and the manual path support vGPU and DirectPath I/O, so you no longer give up self-service to get passthrough.
Deep Learning VMs are the simplest unit: a prebuilt image with the CUDA stack, frameworks, and drivers, handed to a data scientist who wants a GPU box and root. Good for experimentation and bespoke training. Weak for multi-tenant self-service, because you are back to managing VMs by hand.
Sizing from the workload back to the host
The sizing mistake I see most often is treating GPU memory as the only number that matters. It is the first gate, not the last. A 7B model in FP16 wants roughly 14 to 16 GB just for weights, and you then add KV cache that grows with context length and concurrency. That is why two users hitting a long-context endpoint can blow a budget that looked fine on paper. Work the problem in this order: model memory footprint, then per-endpoint concurrency and the KV cache it implies, then the vGPU profile that covers both, then how many of those profiles a physical GPU holds, then hosts.
Profile homogeneity is the constraint people forget. On a time-sliced GPU, every vGPU carved from one physical card must use the same profile. An L40S with 48 GB hands you six L40S-8Q instances when they are homogeneous, but only four if you try to mix profiles on the same card. So you cannot quietly put one big endpoint and several small ones on the same GPU. Plan one profile per card, and if you need mixed sizes, spread them across cards or reach for MIG on the GPUs that support it. That single rule changes host counts more than any benchmark number.
The table below is the kind of starting matrix I hand a customer. Treat the numbers as planning anchors, not guarantees: validate against the current PAIF Sizing Guide and your own benchmark, because KV cache and quantization move these figures around.
| Workload tier | Sensible GPU | Mode | Endpoints / GPU | Design note |
|---|---|---|---|---|
| Small LLM (up to 8B), RAG | L40S 48 GB | vGPU time-slice | Up to 6x 8Q (homogeneous) | Best density per dollar for inference |
| Mid LLM (13B to 34B) | H100 80 GB | MIG or full vGPU | 1 to 2 per GPU | MIG gives hard isolation for tenants |
| Large LLM (70B+) | H200 141 GB / Blackwell | DirectPath I/O, multi-GPU | Model spans GPUs | Needs NVLink and the Network Operator |
| Distributed training | H100 / H200 multi-host | DirectPath I/O | Whole GPUs | RDMA fabric is the real bottleneck |
| Mixed dev / experimentation | L40S / A100 | vGPU on DL VMs | Varies | Cap profiles so one user cannot starve others |
The data path you have to size for, not just the GPUs
A reference architecture that stops at GPU hosts is half a design. The request path through a Private AI Services endpoint touches the Model Runtime, the served model on the GPU, and for RAG a vector database, which on PAIF is PostgreSQL with pgvector provisioned through Data Services Manager. Teams routinely under-size that vector tier. Embeddings are not free, the database wants real CPU, memory, and fast storage, and at retrieval-heavy scale it competes for the same vSAN you sized for VM disks. Put the DSM-managed Postgres on its own resourcing plan, not as an afterthought riding on whatever is left.
Two more placement rules belong in any reference build. First, the NSX Edge cluster in the workload domain must use large-sized Edge nodes, because Supervisor will not deploy against smaller ones. I have watched a bring-up stall for a day on exactly this. Second, account for software interlock. Private AI Services 2.1 ships against NVIDIA GPU Operator 25.10.1, with NVAIE 7 vGPU drivers and NIM Operator 3.1.0 in the current stacks. There is a live known issue in that combination where ModelRuntime pods that request GPUs can fail to start, so your design should pin and test the exact operator and driver versions rather than letting a cluster pull “latest.” Version drift between the host vGPU driver and the GPU Operator is the single most common bring-up failure I see, and it is entirely avoidable with a fixed BOM.
My take on the trade-offs
Three positions I will defend. First, default to L40S for inference and RAG, and stop reaching for H100 reflexively. Most enterprise inference is memory-bandwidth and density bound, not raw FLOPS bound, and an L40S host running six homogeneous endpoints will beat a half-idle H100 host on cost per served request every time. Buy H100 and H200 where the model genuinely will not fit or where you are training, not because the slide deck said so.
Second, use the VCF Automation catalog path unless you have a concrete reason not to. The infrastructure-as-code purists will push for manual kubectl on principle. Unless you actually need the Network Operator for distributed work or a custom topology, manual deployment buys you error-prone Helm wrangling and a versioning burden for no real benefit, and VCF 9.1 removed the old excuse by giving the catalog DirectPath I/O too.
Third, the agentic-AI sizing hype is getting ahead of reality. Plenty of designs are now being padded with GPU capacity for “agents” that are really a few orchestrated LLM calls. Size for the inference you can measure today, leave headroom in hosts and power, and add GPUs when a real workload shows up. Speculative capacity sitting idle is the most expensive line item in a private AI build.
If you are still settling the GPU and partitioning choices that feed this design, work back through choosing the right GPU for VMware Private AI and GPU partitioning by design, and confirm host readiness against the planning and prerequisites guide. The full roadmap lives on the VMware Private AI complete guide.
What I’d Do
Lock the two-domain shape, choose the deployment model from the workload, then size GPU memory and concurrency before you touch a server quote. Hold one profile per card, reserve host overhead and an N+1 GPU host, give the vector tier its own plan, and pin the operator and driver versions. Do that and the BOM writes itself, defensibly. Part 8 moves from paper to the rack: preparing the VCF workload domain and the GPU hosts for bring-up. What is the first sizing assumption you plan to challenge on your build?
References
- VMware Private AI Foundation with NVIDIA 9.1, Broadcom TechDocs
- AI and ML Platform Based on VMware Kubernetes Services Model, VCF 9.1 Design Library
- VCF 9.1: The Secure, Cost-Effective Private Cloud Platform for Production AI, VCF Blog
- Sizing AI Workloads on VMware Cloud Foundation (PAIF Sizing Guide)
« Previous: Part 6 | VMware Private AI Complete Guide | Next: Part 8 »








