Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

GPU Partitioning for VMware Private AI: Choosing Between vGPU, MIG and Passthrough (Private AI Series, Part 6)

Time-sliced vGPU, MIG-backed vGPU, GPU passthrough and the new ESXi 9 Update 1 hybrid mode each fit different Private AI workloads. Here is how to design the split, with a capability matrix and a reference topology.

VMware Private AI Series · Part 6 of 24

TL;DR · Key Takeaways

  • There are now four GPU sharing modes on VCF 9, not three: full time-slicing, MIG-backed vGPU, the ESXi 9 Update 1 time-sliced MIG-backed hybrid, and full passthrough (DirectPath I/O).
  • Two hard constraints decide most designs: NVIDIA NIM cannot run on MIG, and NVLink peer-to-peer only works on full-frame-buffer time-sliced vGPU. Both push you toward time-slicing.
  • MIG buys hardware isolation, fault containment, and roughly 33% more inference throughput per GPU under load, at the cost of NVLink and a hard cap of 7 instances.
  • The pragmatic pattern is two pools: a time-sliced cluster for NIM inference and a MIG cluster for isolated training and fine-tuning. ESXi 9 Update 1 lets you fold both onto one GPU.
  • Read the vGPU profile name: a middle digit (h100xm-1-10c) means MIG-backed; no middle digit (h100xm-20c) means time-sliced.
Who this is for: VCF architects and platform teams designing GPU capacity for Private AI Foundation.  Prerequisites: a sized GPU host BOM (Part 5), a VCF 9 workload domain plan, and an NVIDIA AI Enterprise entitlement.

Two architects bring me the same H100 hosts and ask the same question, and they want opposite answers. One is standing up a RAG chatbot on NIM and needs density and NVLink. The other is running isolated fine-tuning jobs for three business units that are not allowed to see each other’s data. Same silicon, same VCF 9 workload domain, and the correct partitioning mode is different for each. Picking it wrong is not a tuning detail you fix later. It dictates which workloads will even run, whether your NVSwitch fabric is reachable, and how big the blast radius is when one tenant’s CUDA process dies.

This part is the design treatment of GPU partitioning under VMware Private AI Foundation with NVIDIA. It follows on from choosing the right GPU and the broader planning and prerequisites work. The hardware is bought; now you decide how to slice it.

Four ways to carve up one GPU

Time-slicing is the original mechanism. The physical GPU is not partitioned; every vGPU on it shares the same engines and the same VRAM pool, and they take turns. While one vGPU runs, it owns the engines for its slice, then yields. You get maximum density and full flexibility, and you keep NVLink peer-to-peer for full-frame-buffer profiles. What you do not get is a wall between tenants. A process that over-allocates memory can trigger out-of-memory errors for its neighbours, and a fatal fault on one vGPU can reset the whole physical GPU and take every co-resident VM down with it.

MIG (Multi-Instance GPU) is the opposite philosophy. On Ampere and newer data center GPUs, the device is physically divided into up to seven instances, each with its own streaming multiprocessors, memory controllers, L2 cache banks, and DRAM. This is silicon partitioning, not a scheduler. Instances run in parallel with zero contention, and a crash in one is contained to that instance. The price is rigidity: a fixed maximum of seven instances in fixed profile shapes, no NVLink peer-to-peer, and you cannot resize a live instance without destroying it and its workload.

ESXi 9 Update 1 (build 9.0.1.0, shipped September 2025) adds a third option that did not exist before: time-sliced MIG-backed vGPU. It puts MIG walls between groups and time-shares within a single MIG instance, giving you a middle tier of isolation plus density on the same card. The fourth option is the bluntest: full GPU passthrough via DirectPath I/O, where one VM owns the entire physical device. Highest raw performance, lowest density, and you give up vMotion, snapshots, and the operational sugar that makes the rest of the platform pleasant to run.

Four ways to carve up one physical GPU Same H100, four very different isolation and density outcomes 1 · Full time-slicing VM A / VM B / VM C share one VRAM pool, run in turns Max density · NVLink kept · NIM OK · soft isolation 2 · MIG-backed vGPU Up to 7 hardware instances · no NVLink · no NIM 3 · Time-sliced MIG-backed MIG walls between groups, time-share inside (ESXi 9 U1+) 4 · Passthrough (DirectPath I/O) One VM owns the whole GPU Max raw perf · NVLink kept (non-NVSwitch) · no sharing, no vMotion
The same physical GPU under each of the four sharing modes available on VCF 9.

The capability matrix

Before the design decisions, line the four modes up against the dimensions that actually drive an architecture review. Note the throughput figures: NVIDIA’s own A100 inference data puts MIG at roughly 1.00 requests per second per GPU against 0.76 for time-slicing under concurrent load, about a 33% gap, because MIG’s dedicated memory controllers remove the contention that time-slicing suffers. Time-slicing wins on single-task latency for bursty work, but that edge collapses the moment the GPU is busy.

DimensionTime-slicingMIG-backedHybridPassthrough
IsolationSoft, shared poolHardware, per instanceHardware between slicesDedicated GPU
Max densityHighestUp to 7Medium1 VM
NVLink P2PFull-FB profiles onlyDisabledDisabledYes (non-NVSwitch)
NIM compatibleYesNoTime-sliced portion onlyYes
Fault blast radiusWhole physical GPUOne instanceOne sliceOne VM
Inference throughput~0.76 req/s (A100)~1.00 req/s (+33%)Between the twoBare-GPU
Min ESXiAll9 GA (vGPU 19.0+)9 Update 1All
Best forNIM inference, dev/testIsolated training, multi-tenantGroup isolation + sharingMax-throughput single workload

The two constraints that decide most designs

Most of the partitioning debate evaporates once you apply two non-negotiable rules. They are the first things I check on a whiteboard, before anyone argues about density.

NIM cannot run on MIG. NVIDIA NIM microservices, the primary inference runtime in Private AI Foundation, are not supported on MIG sharing mode as of VCF 9.0. If NIM serves your RAG pipelines, chatbots, or agentic workloads, those GPUs must be in time-slicing mode. Full stop. This single rule pushes the entire inference path onto time-sliced vGPU, which is why so many real Private AI clusters end up time-sliced even though MIG looks tidier on paper.

NVLink peer-to-peer needs full-frame-buffer time-slicing. Any workload that spans multiple physical GPUs over NVLink, large model training or multi-GPU inference, only gets P2P on time-sliced vGPUs that hold the entire frame buffer. MIG disables NVLink P2P under every ESXi version, with no exception. And there is a sharper edge for HGX-class hosts: NVSwitch fabric is not supported for vGPU on vSphere at all, and passthrough is not supported on NVSwitch systems either. On an 8-way SXM5 HGX box under ESXi, full-frame-buffer time-sliced vGPU is effectively your only way to reach the 900 GB/s mesh you paid for. Put MIG on that box and you are renting a Ferrari to sit in traffic.

My take: if a vendor slide tells you MIG is the default best practice for Private AI, push back. It is the right default only for the training and isolation path. The inference path that most enterprises actually deploy first is a time-slicing story.

Reading the vGPU profile name

The profile name is the fastest way to tell what you are actually allocating, and people get burned because they skim it. The middle digit is the tell.

Decode the profile name The middle digit separates MIG-backed from time-sliced nvidia_h100xm-20c No middle digit = TIME-SLICED · 20c = 20 GB frame buffer, C-series compute nvidia_h100xm-1-10c Middle digit present = MIG-BACKED · 1 GPU-instance slice, 10 GB frame buffer
Profile naming: the middle segment is the difference between a hardware slice and a time-share.

Here is how the options lay out on a single H100 80GB. Time-sliced fractional profiles divide the frame buffer; MIG profiles map to fixed instance shapes. Remember NVLink P2P only survives on the full-frame-buffer time-sliced profile.

ProfileModeFrame bufferPer H100 80GBNVLink P2P
nvidia_h100xm-80cTime-sliced, full FB80 GB1Yes
nvidia_h100xm-20cTime-sliced20 GB4No (fractional)
nvidia_h100xm-3-40cMIG-backed (3g.40gb)40 GB2No
nvidia_h100xm-1-10cMIG-backed (1g.10gb)10 GB7No
DirectPath I/OPassthrough80 GB1 VMYes (non-NVSwitch)

A reference topology for mixed workloads

Most enterprises do not run a single workload type, so the clean design is two GPU pools inside one VCF 9 workload domain: a time-sliced pool for NIM inference and self-service endpoints, and a MIG pool for isolated training and fine-tuning. Private AI Services and VCF Automation sit above both and place workloads onto the right pool. On ESXi 9 Update 1 you can collapse this onto fewer hosts using the hybrid mode, but keeping the pools as distinct clusters keeps capacity planning and lifecycle simple, and I usually recommend that for the first deployment.

Two GPU pools, one workload domain Time-sliced inference and MIG training, placed by Private AI Services VCF 9 WORKLOAD DOMAIN VCF Automation · Private AI Services Model Runtime · Model Store · self-service catalog Time-sliced vGPU cluster NIM inference, RAG, agents, dev/test GPU host shared VRAM, NVLink GPU host NIM endpoints MIG cluster Isolated training, fine-tuning, multi-tenant GPU host hardware-isolated slices GPU host per-tenant boundaries
Reference split: NIM inference on a time-sliced pool, training and fine-tuning on a MIG pool.

The decision path

Work the questions in order. Each answer gates the next, and the first few often settle the design before you ever get to a density argument.

Partitioning decision path Answer top to bottom; the first yes often ends the debate 1 GPU is Turing or Volta (T4, V100)? No MIG silicon, ever YES -> Time-slicing only refresh to Ampere+ if MIG is required 2 Need NVLink across multiple GPUs? multi-GPU training or inference YES -> Full-FB time-slicing MIG disables NVLink P2P 3 Is NIM your primary inference runtime? RAG, chatbot, agents YES -> Time-slicing NIM is not supported on MIG 4 Strict multi-tenant isolation needed? regulated or cross-BU data boundaries YES -> MIG-backed vGPU hardware isolation, fault containment 5 On ESXi 9 U1 and need group share + isolation? share inside a tenant, wall between tenants YES -> Hybrid mode time-sliced MIG-backed vGPU DEFAULT (single max-throughput workload, dedicated): Passthrough via DirectPath I/O, accepting no vMotion and no sharing
Five gating questions, in order. The first yes usually fixes the mode.
Disclaimer: Enabling MIG with nvidia-smi -mig 1, changing vGPU mode, or reconfiguring profiles is a host-level change that destroys running instances. Validate your GPU and ESXi build against the NVIDIA AI Enterprise and Private AI Foundation interop matrix, confirm the vGPU host driver VIB matches the guest driver, drain and back up workloads, and test the mode change on a non-production host before rolling it across the cluster.

Operational details that bite later

Three things consistently surprise teams after the design is signed off. First, DRS automation must be set to Partially Automated or Manual for any cluster running vGPU VMs. Fully Automated DRS cannot reason about GPU profile type, MIG layout, or NVLink topology, so placement stays an operator decision. ESXi 9 does help here with configurable device selection policies, a performance policy that spreads VFs across physical devices and a consolidation policy that packs them to preserve free GPUs, plus DirectPath Profile Pools that give a cluster-wide view of consumed and remaining vGPU capacity.

Second, GPU Reservations, the feature you would want for guaranteeing slots to mission-critical AI workloads, shipped as a tech preview in VCF 9.0. It is not GA. If your SLA model depends on reserved GPU capacity, design around that gap rather than assuming the feature.

Third, observability. NVIDIA DCGM is not supported inside vGPU guest VMs or on hosts running the NVIDIA AI Enterprise vGPU Manager, so SM utilization, tensor core activity, and per-process metrics are not visible through standard DCGM tooling in a vGPU deployment. VCF Operations gives you the hypervisor view (profile allocation, frame buffer, DPP capacity), but plan for a workload-layer overlay if you need deep per-job GPU metrics. Most of the early production pain I see traces back to vGPU mode and profile mismatches, which is the territory the common GPU and vGPU mistakes post walks through in detail.

What I’d Do

For a first Private AI build, design two pools and keep them clean: a time-sliced vGPU cluster for the NIM inference path, where density and NVLink matter and MIG is not even an option, and a MIG-backed cluster for isolated fine-tuning and any regulated multi-tenant training, where hardware isolation and fault containment earn their keep. Reach for the ESXi 9 Update 1 hybrid mode only when you have a genuine need to combine inter-tenant isolation with intra-tenant sharing on scarce hardware, and only after you have it running on 9.0.1.0 or later. Keep passthrough for the rare dedicated, maximum-throughput workload that can live without vMotion. Verdict: do not let the tidier-looking MIG diagram drive the inference cluster. NIM compatibility and NVLink decide that pool, and they decide it in favour of time-slicing.

Which pool is harder to size in your environment, the inference side or the training side? That is usually where the interesting trade-offs live.

References

VMware Private AI Series · Part 6 of 30
« Previous: Part 5  |  VMware Private AI Complete Guide  |  Next: Part 7 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading