GPU Partitioning for VMware Private AI: Choosing Between vGPU, MIG and Passthrough (Private AI Series, Part 6)

Time-sliced vGPU, MIG-backed vGPU, GPU passthrough and the new ESXi 9 Update 1 hybrid mode each fit different Private AI workloads. Here is how to design the split, with a capability matrix and a reference topology.

by

Dr. Pranay Jha

June 15, 2026

No comments

11 minutes

Read Time

VMware Private AI Series · Part 6 of 24

TL;DR · Key Takeaways

There are now four GPU sharing modes on VCF 9, not three: full time-slicing, MIG-backed vGPU, the ESXi 9 Update 1 time-sliced MIG-backed hybrid, and full passthrough (DirectPath I/O).
Two hard constraints decide most designs: NVIDIA NIM cannot run on MIG, and NVLink peer-to-peer only works on full-frame-buffer time-sliced vGPU. Both push you toward time-slicing.
MIG buys hardware isolation, fault containment, and roughly 33% more inference throughput per GPU under load, at the cost of NVLink and a hard cap of 7 instances.
The pragmatic pattern is two pools: a time-sliced cluster for NIM inference and a MIG cluster for isolated training and fine-tuning. ESXi 9 Update 1 lets you fold both onto one GPU.
Read the vGPU profile name: a middle digit (h100xm-1-10c) means MIG-backed; no middle digit (h100xm-20c) means time-sliced.

Who this is for: VCF architects and platform teams designing GPU capacity for Private AI Foundation. Prerequisites: a sized GPU host BOM (Part 5), a VCF 9 workload domain plan, and an NVIDIA AI Enterprise entitlement.

Two architects bring me the same H100 hosts and ask the same question, and they want opposite answers. One is standing up a RAG chatbot on NIM and needs density and NVLink. The other is running isolated fine-tuning jobs for three business units that are not allowed to see each other’s data. Same silicon, same VCF 9 workload domain, and the correct partitioning mode is different for each. Picking it wrong is not a tuning detail you fix later. It dictates which workloads will even run, whether your NVSwitch fabric is reachable, and how big the blast radius is when one tenant’s CUDA process dies.

This part is the design treatment of GPU partitioning under VMware Private AI Foundation with NVIDIA. It follows on from choosing the right GPU and the broader planning and prerequisites work. The hardware is bought; now you decide how to slice it.

Four ways to carve up one GPU

Time-slicing is the original mechanism. The physical GPU is not partitioned; every vGPU on it shares the same engines and the same VRAM pool, and they take turns. While one vGPU runs, it owns the engines for its slice, then yields. You get maximum density and full flexibility, and you keep NVLink peer-to-peer for full-frame-buffer profiles. What you do not get is a wall between tenants. A process that over-allocates memory can trigger out-of-memory errors for its neighbours, and a fatal fault on one vGPU can reset the whole physical GPU and take every co-resident VM down with it.

MIG (Multi-Instance GPU) is the opposite philosophy. On Ampere and newer data center GPUs, the device is physically divided into up to seven instances, each with its own streaming multiprocessors, memory controllers, L2 cache banks, and DRAM. This is silicon partitioning, not a scheduler. Instances run in parallel with zero contention, and a crash in one is contained to that instance. The price is rigidity: a fixed maximum of seven instances in fixed profile shapes, no NVLink peer-to-peer, and you cannot resize a live instance without destroying it and its workload.

ESXi 9 Update 1 (build 9.0.1.0, shipped September 2025) adds a third option that did not exist before: time-sliced MIG-backed vGPU. It puts MIG walls between groups and time-shares within a single MIG instance, giving you a middle tier of isolation plus density on the same card. The fourth option is the bluntest: full GPU passthrough via DirectPath I/O, where one VM owns the entire physical device. Highest raw performance, lowest density, and you give up vMotion, snapshots, and the operational sugar that makes the rest of the platform pleasant to run.

The same physical GPU under each of the four sharing modes available on VCF 9.

The capability matrix

Before the design decisions, line the four modes up against the dimensions that actually drive an architecture review. Note the throughput figures: NVIDIA’s own A100 inference data puts MIG at roughly 1.00 requests per second per GPU against 0.76 for time-slicing under concurrent load, about a 33% gap, because MIG’s dedicated memory controllers remove the contention that time-slicing suffers. Time-slicing wins on single-task latency for bursty work, but that edge collapses the moment the GPU is busy.

Dimension	Time-slicing	MIG-backed	Hybrid	Passthrough
Isolation	Soft, shared pool	Hardware, per instance	Hardware between slices	Dedicated GPU
Max density	Highest	Up to 7	Medium	1 VM
NVLink P2P	Full-FB profiles only	Disabled	Disabled	Yes (non-NVSwitch)
NIM compatible	Yes	No	Time-sliced portion only	Yes
Fault blast radius	Whole physical GPU	One instance	One slice	One VM
Inference throughput	~0.76 req/s (A100)	~1.00 req/s (+33%)	Between the two	Bare-GPU
Min ESXi	All	9 GA (vGPU 19.0+)	9 Update 1	All
Best for	NIM inference, dev/test	Isolated training, multi-tenant	Group isolation + sharing	Max-throughput single workload

The two constraints that decide most designs

Most of the partitioning debate evaporates once you apply two non-negotiable rules. They are the first things I check on a whiteboard, before anyone argues about density.

NIM cannot run on MIG. NVIDIA NIM microservices, the primary inference runtime in Private AI Foundation, are not supported on MIG sharing mode as of VCF 9.0. If NIM serves your RAG pipelines, chatbots, or agentic workloads, those GPUs must be in time-slicing mode. Full stop. This single rule pushes the entire inference path onto time-sliced vGPU, which is why so many real Private AI clusters end up time-sliced even though MIG looks tidier on paper.

NVLink peer-to-peer needs full-frame-buffer time-slicing. Any workload that spans multiple physical GPUs over NVLink, large model training or multi-GPU inference, only gets P2P on time-sliced vGPUs that hold the entire frame buffer. MIG disables NVLink P2P under every ESXi version, with no exception. And there is a sharper edge for HGX-class hosts: NVSwitch fabric is not supported for vGPU on vSphere at all, and passthrough is not supported on NVSwitch systems either. On an 8-way SXM5 HGX box under ESXi, full-frame-buffer time-sliced vGPU is effectively your only way to reach the 900 GB/s mesh you paid for. Put MIG on that box and you are renting a Ferrari to sit in traffic.

My take: if a vendor slide tells you MIG is the default best practice for Private AI, push back. It is the right default only for the training and isolation path. The inference path that most enterprises actually deploy first is a time-slicing story.

Reading the vGPU profile name

The profile name is the fastest way to tell what you are actually allocating, and people get burned because they skim it. The middle digit is the tell.

Profile naming: the middle segment is the difference between a hardware slice and a time-share.

Here is how the options lay out on a single H100 80GB. Time-sliced fractional profiles divide the frame buffer; MIG profiles map to fixed instance shapes. Remember NVLink P2P only survives on the full-frame-buffer time-sliced profile.

Profile	Mode	Frame buffer	Per H100 80GB	NVLink P2P
nvidia_h100xm-80c	Time-sliced, full FB	80 GB	1	Yes
nvidia_h100xm-20c	Time-sliced	20 GB	4	No (fractional)
nvidia_h100xm-3-40c	MIG-backed (3g.40gb)	40 GB	2	No
nvidia_h100xm-1-10c	MIG-backed (1g.10gb)	10 GB	7	No
DirectPath I/O	Passthrough	80 GB	1 VM	Yes (non-NVSwitch)

A reference topology for mixed workloads

Most enterprises do not run a single workload type, so the clean design is two GPU pools inside one VCF 9 workload domain: a time-sliced pool for NIM inference and self-service endpoints, and a MIG pool for isolated training and fine-tuning. Private AI Services and VCF Automation sit above both and place workloads onto the right pool. On ESXi 9 Update 1 you can collapse this onto fewer hosts using the hybrid mode, but keeping the pools as distinct clusters keeps capacity planning and lifecycle simple, and I usually recommend that for the first deployment.

Reference split: NIM inference on a time-sliced pool, training and fine-tuning on a MIG pool.

The decision path

Work the questions in order. Each answer gates the next, and the first few often settle the design before you ever get to a density argument.

Five gating questions, in order. The first yes usually fixes the mode.

Disclaimer: Enabling MIG with nvidia-smi -mig 1, changing vGPU mode, or reconfiguring profiles is a host-level change that destroys running instances. Validate your GPU and ESXi build against the NVIDIA AI Enterprise and Private AI Foundation interop matrix, confirm the vGPU host driver VIB matches the guest driver, drain and back up workloads, and test the mode change on a non-production host before rolling it across the cluster.

Operational details that bite later

Three things consistently surprise teams after the design is signed off. First, DRS automation must be set to Partially Automated or Manual for any cluster running vGPU VMs. Fully Automated DRS cannot reason about GPU profile type, MIG layout, or NVLink topology, so placement stays an operator decision. ESXi 9 does help here with configurable device selection policies, a performance policy that spreads VFs across physical devices and a consolidation policy that packs them to preserve free GPUs, plus DirectPath Profile Pools that give a cluster-wide view of consumed and remaining vGPU capacity.

Second, GPU Reservations, the feature you would want for guaranteeing slots to mission-critical AI workloads, shipped as a tech preview in VCF 9.0. It is not GA. If your SLA model depends on reserved GPU capacity, design around that gap rather than assuming the feature.

Third, observability. NVIDIA DCGM is not supported inside vGPU guest VMs or on hosts running the NVIDIA AI Enterprise vGPU Manager, so SM utilization, tensor core activity, and per-process metrics are not visible through standard DCGM tooling in a vGPU deployment. VCF Operations gives you the hypervisor view (profile allocation, frame buffer, DPP capacity), but plan for a workload-layer overlay if you need deep per-job GPU metrics. Most of the early production pain I see traces back to vGPU mode and profile mismatches, which is the territory the common GPU and vGPU mistakes post walks through in detail.

What I’d Do

For a first Private AI build, design two pools and keep them clean: a time-sliced vGPU cluster for the NIM inference path, where density and NVLink matter and MIG is not even an option, and a MIG-backed cluster for isolated fine-tuning and any regulated multi-tenant training, where hardware isolation and fault containment earn their keep. Reach for the ESXi 9 Update 1 hybrid mode only when you have a genuine need to combine inter-tenant isolation with intra-tenant sharing on scarce hardware, and only after you have it running on 9.0.1.0 or later. Keep passthrough for the rare dedicated, maximum-throughput workload that can live without vMotion. Verdict: do not let the tidier-looking MIG diagram drive the inference cluster. NIM compatibility and NVLink decide that pool, and they decide it in favour of time-slicing.

Which pool is harder to size in your environment, the inference side or the training side? That is usually where the interesting trade-offs live.

References

VMware Private AI Series · Part 6 of 30
« Previous: Part 5 | VMware Private AI Complete Guide | Next: Part 7 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts