Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

GPU Partitioning on NVIDIA Data-Center GPUs: MIG vs vGPU vs Time-Slicing vs Passthrough (NVIDIA AI Series, Part 6)

Four ways to partition an NVIDIA H100, H200, or B200 GPU: MIG, vGPU, CUDA time-slicing, and full passthrough. This post covers the isolation guarantees, profile geometry, Kubernetes GPU Operator configuration, and a sizing worked example to help you pick the right mode for your cluster.

NVIDIA AI Series · Part 6 of 30
TL;DR: Four ways to partition an NVIDIA data-center GPU exist: MIG, vGPU, CUDA time-slicing, and full passthrough. MIG is the only one that gives you genuine hardware isolation with per-instance memory bandwidth and fault containment. Use it whenever multi-tenant workloads must not interfere with each other. vGPU adds a hypervisor layer on top and suits VM-based platforms. Time-slicing is free and simple but gives zero isolation. Passthrough is for single-tenant, maximum-performance work. The verdict: default to MIG on H100/H200/B200 for shared inference and dev clusters; use passthrough for large training runs; cross-link to Private AI GPU partitioning if your platform is VMware Cloud Foundation.
Who this is for: Platform engineers and AI infrastructure architects who need to decide how to share NVIDIA H100, H200, or B200 GPUs across multiple tenants or workloads. You should already know what a GPU is and have a passing familiarity with Kubernetes. This post covers the partitioning mechanisms, not CUDA programming or model training specifics.

You buy an H100 SXM5 to run inference for ten internal teams. On day one, team A submits a Llama 3 request that spikes to 95 percent SM utilization. Teams B through J get throttled to nothing. The GPU is not "shared"; it is timeshared by whoever happens to be loudest. This is the default CUDA scheduling behavior and it is a production problem the moment you have more than one tenant on a node. Picking the right partitioning strategy before you deploy is the decision that determines whether your multi-tenant cluster stays predictable or turns into a noisy-neighbor complaint queue.

The Four Partitioning Modes

Before comparing the modes, it helps to be precise about what "isolation" means in this context. There are three independent dimensions: compute isolation (SMs cannot be stolen by a neighbor), memory isolation (each partition has exclusive, bounded HBM bandwidth), and fault isolation (an error in one partition does not tear down another). The four modes each deliver different subsets of these guarantees.

GPU Partitioning Modes: Isolation Spectrum
From no isolation (time-slicing) to full hardware partition (MIG)
Weak isolation Strong isolation Time-Slicing Compute: shared Memory: shared Fault: none License: free Overhead: lowest No QoS Passthrough Compute: full Memory: full Fault: VM-level License: NVAIE One tenant/GPU No sharing vGPU Compute: sched Memory: partitioned Fault: partial License: NVAIE VM hypervisor Scheduler QoS MIG Compute: hw-isolated Memory: hw-isolated Fault: contained License: NVAIE Ampere+ only Full QoS
Four partitioning modes on NVIDIA data-center GPUs. Only MIG provides hardware isolation in all three dimensions.

CUDA Time-Slicing

The CUDA scheduler context-switches between processes at a fixed interval, typically every millisecond. No hardware boundary separates tenants. If one process is running a large kernel, the scheduler will preempt it and hand the SM to the next process, but the preemption is coarse and the shared L2 cache and HBM bus are never partitioned. The latency impact from context switching is real: switching saves and restores SM register state, which takes time proportional to the register footprint of the outgoing kernel.

In Kubernetes, you enable time-slicing via a ConfigMap consumed by the GPU Device Plugin. The replicas field tells the plugin how many logical GPU resources to advertise per physical GPU. Setting replicas: 4 on an H100 makes Kubernetes think it has four GPUs; four pods get scheduled. The GPU itself receives all four contexts and time-slices among them. There is no memory cap per pod.

Full Passthrough

A single VM or container gets the entire GPU. On a bare-metal Kubernetes node this is the default: one pod with nvidia.com/gpu: 1 gets a full H100. In a virtualized environment, GPU passthrough assigns the PCIe device directly to one VM via VFIO/IOMMU. The VM sees the raw hardware with no mediated layer. Performance is close to bare metal. The downside is obvious: you cannot split the GPU across tenants. If your H100 SXM5 costs $30,000 [VERIFY] per card and your workload only needs 20 GB of HBM, you are leaving 60 GB idle.

NVIDIA vGPU

vGPU is a software mediation layer that runs in the hypervisor. The physical GPU is split into virtual GPU instances using a driver-level scheduler. Each VM sees a virtual GPU with a fixed memory allocation and a scheduler policy. Three scheduler modes exist: best-effort, equal-share, and fixed-share. Fixed-share gives each vGPU a guaranteed time budget per second, which is the closest vGPU gets to QoS without MIG hardware backing. Memory allocation is enforced; one vGPU cannot read another's framebuffer. Fault isolation is partial: a driver fault in one VM can trigger a GPU reset that affects all VMs on the card.

vGPU requires an NVIDIA AI Enterprise license and a supported hypervisor (VMware vSphere, Red Hat KVM, Nutanix AHV). If your platform is vSphere, this is the natural path. The Private AI series covers the VCF-specific deployment in detail at GPU partitioning on VMware Private AI.

Multi-Instance GPU (MIG)

MIG partitions a single GPU into up to seven independent GPU Instances (GIs) at the hardware level using GPC slices and memory channel groupings. Each GI gets its own SMs, L2 cache slice, HBM channel allocation, and copy engines. There is no shared bus between instances that an attacker or a faulty kernel can exploit. Within each GI, you create a Compute Instance (CI) that maps SMs to a schedulable CUDA context. A MIG device is the pair (GI, CI) that a container or VM sees as a GPU.

MIG is available on A100, A30, H100, H200, and B200, and selected Blackwell professional GPUs. It requires enabling MIG mode at the driver level, which requires a brief GPU reset. You cannot mix MIG and non-MIG workloads on the same physical GPU simultaneously.

MIG Profile Geometry on H100, H200, and B200

Profile names follow a consistent convention: Ng.Mgb where N is the number of GPC slices and M is the HBM allocation. An H100 SXM5 80 GB card supports seven 1g.10gb slices (7 x 10 GB = 70 GB usable, the remaining HBM reserved for system overhead). You cannot mix arbitrary profiles on one GPU; NVIDIA publishes exact valid combinations in the MIG User Guide.

MIG Profile Geometry: H100 vs H200 vs B200
Hardware slice layout across three data-center GPU generations
H100 80 GB SXM5 / PCIe 1g.10gb (1/7 SM)1g.10gb (1/7 SM)1g.10gb (1/7 SM)1g.10gb (1/7 SM)1g.10gb (1/7 SM)1g.10gb (1/7 SM)1g.10gb (1/7 SM) 7 instances max H200 141 GB SXM5 1g.18gb (1/7 SM)1g.18gb (1/7 SM)1g.18gb (1/7 SM)1g.18gb (1/7 SM)1g.18gb (1/7 SM)1g.18gb (1/7 SM)1g.18gb (1/7 SM) 7 instances max B200 180 GB SXM6 1g.23gb (1/7 SM)1g.23gb (1/7 SM)1g.23gb (1/7 SM)1g.23gb (1/7 SM)1g.23gb (1/7 SM)1g.23gb (1/7 SM)1g.23gb (1/7 SM) 7 instances max
Maximum-density MIG configuration (all 1g slices) on H100 80 GB, H200 141 GB, and B200 180 GB. Source: NVIDIA MIG User Guide r610.

The following table shows the complete profile set for H100 80 GB SXM5, which is the most common production variant as of mid-2026. Every profile is a valid, supported combination; you cannot freely mix profiles that would exceed the GPC or memory budget.

Profile Memory SM Fraction Max Instances Typical Use
1g.10gb10 GB1/77Small LLM inference, embedding models
1g.10gb+me10 GB1/71Video transcoding alongside inference
1g.20gb20 GB1/74Medium models requiring more HBM
2g.20gb20 GB2/737B param models in INT8
3g.40gb40 GB3/7213B-34B param models in FP8
4g.40gb40 GB4/71Large single-tenant workload on half the card
7g.80gb80 GB7/71Full GPU in MIG mode (single tenant)
Gotcha: The 1g.20gb profile allocates 2/8 of the HBM channels but only 1/7 of the SMs. It is memory-heavy relative to compute. Workloads that are SM-bound will not benefit from the extra HBM. Use 2g.20gb if you need the matching compute fraction alongside the memory. The profile names do not hint at this asymmetry.

Comparison: What Each Mode Actually Delivers

Criterion Time-Slicing Passthrough vGPU MIG
Memory isolationNoneFull (1 tenant)Software-enforcedHardware-enforced
Compute QoSNoneN/AScheduler policyHardware partition
Fault isolationNoneVM-levelPartialHardware boundary
Hardware requiredAny NVIDIA GPUAny with IOMMUvGPU-capableAmpere or later
License neededNoneNVAIE (for virt)NVAIENVAIE (for support)
Reconfigure without rebootYesN/AYes (profile change)Yes (MIG Mgr drains pods)
Kubernetes schedulingReplicas in DevicePluginnvidia.com/gpu: 1KubeVirt or vSpherenvidia.com/mig-Ng.Mgb

MIG and the Kubernetes GPU Operator

The NVIDIA GPU Operator v26.3.2 (released June 2026) deploys a MIG Manager DaemonSet on every MIG-capable node. MIG Manager watches the nvidia.com/mig.config node label and drives the full reconfiguration cycle: drain GPU pods, stop DCGM Exporter, call mig-parted to apply the new geometry, and restart everything. No manual nvidia-smi invocations required on a properly configured cluster.

GPU Operator MIG Reconfiguration Flow
What happens when you relabel a Kubernetes node
Admin Labelsmig.config=all-1g.10gb MIG Mgrdetects labelchange Drain Podsevict GPUworkloads Stop DCGMand deviceplugin mig-partedapplies newgeometry Restart + Labelstate: successpods resume
MIG Manager drives the full reconfiguration cycle via a Kubernetes node label. Source: docs.nvidia.com GPU Operator docs, June 2026.

Two MIG strategies exist: single (all GPUs on the node run the same profile) and mixed (different profiles on different GPUs, or a mixed profile like all-balanced on one GPU). Mixed strategy is more operationally complex because each pod must request the specific MIG resource type via nvidia.com/mig-1g.10gb, nvidia.com/mig-2g.20gb, and so on. Kubernetes does not auto-bin-pack across MIG slice types without a custom scheduler.

Starting with GPU Operator v26.3.0, MIG Manager auto-generates the ConfigMap for each node by querying the hardware via NVML at startup. You no longer need to maintain a hand-crafted default-mig-parted-config ConfigMap for standard profiles. Custom profiles still require a user-provided ConfigMap.

Operational Artifact: MIG Creation via nvidia-smi and GPU Operator

The two paths to configure MIG are: directly via nvidia-smi on bare metal (or inside the driver container), and via the GPU Operator node label on Kubernetes. Both are shown below.

Worked Example

Scenario: bare-metal H100 80 GB SXM5 node. Goal: create 7 x 1g.10gb slices, verify them, then reconfigure to 2 x 3g.40gb for a larger tenant.

# Step 1: enable MIG mode on GPU 0 (requires no active GPU processes)
nvidia-smi -i 0 -mig 1

# Expected output:
# Enabled MIG Mode for GPU 00000000:01:00.0
# All done.

# Step 2: create 7 GPU Instances using the 1g.10gb profile (profile ID 19 on H100 80GB)
nvidia-smi mig -cgi 19,19,19,19,19,19,19 -i 0

# Expected output:
# Successfully created GPU instance ID  1 on GPU  0 ...
# (repeated 7 times)

# Step 3: create one Compute Instance per GPU Instance
nvidia-smi mig -cci -i 0

# Step 4: verify
nvidia-smi -L

# Expected output:
# GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-xxxxx)
#   MIG 1g.10gb     Device  0: (UUID: MIG-xxxxx)
#   MIG 1g.10gb     Device  1: (UUID: MIG-xxxxx)
#   ...  (7 total)

# FAILURE MODE: if any GPU process (DCGM, driver health check) is still
# attached, nvidia-smi returns:
# Unable to destroy GPU instance: In use by another client
# Fix: stop DCGM (systemctl stop nvidia-dcgm) then retry.

To do the same via GPU Operator on Kubernetes (on a node named gpu-node-01):

# Set MIG strategy to single during install
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v26.3.2 \
  --set mig.strategy=single

# Label the node to trigger MIG Manager reconfiguration
kubectl label nodes gpu-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite

# Monitor reconfiguration progress
kubectl get node gpu-node-01 -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
# Transitions: pending -> success

# Verify 7 slices are advertised
kubectl get node gpu-node-01 -o json | jq '.status.capacity'
# Expected: nvidia.com/gpu.count = 7

# Reconfigure to 2 x 3g.40gb for a heavier tenant
kubectl label nodes gpu-node-01 nvidia.com/mig.config=all-3g.40gb --overwrite
# MIG Manager drains pods, reconfigures, and relabels state=success

Expected result after all-3g.40gb: nvidia.com/gpu.count: 2, nvidia.com/gpu.slices.gi: 3 on the node labels. Pods requesting nvidia.com/gpu: 1 will bind to one 3g.40gb instance.
Failure mode: if mig.config.state stays at pending more than two minutes, check MIG Manager logs: kubectl logs -n gpu-operator -l app=nvidia-mig-manager. Common cause is a zombie DCGM process holding the GPU context.

Disclaimer

MIG reconfiguration is a disruptive operation in production. It evicts all running GPU pods on the target node. In a Kubernetes cluster with live inference traffic, cordon the node and drain GPU workloads before relabeling. On some cloud providers, enabling MIG mode requires a node reboot; set migManager.env[0].name=WITH_REBOOT in the Helm values. Test reconfiguration procedures in a staging cluster before running them against production nodes.

Sizing a Shared Inference Cluster: A Decision Walk

The right profile depends on the models you are serving and how many teams need concurrent access. Here is how I work through it for a typical enterprise cluster with a mix of embedding, chat, and code completion services.

MIG Profile Selection Decision Tree
For shared inference on H100 80 GB
Model memory how much HBM needed? under 10 GB 10-20 GB 20-40 GB over 40 GB 1g.10gb 7 tenants/GPU embed, small LLM SM-bound? check sm_active in DCGM metrics no yes 1g.20gb 4 tenants/GPU memory-heavy 2g.20gb 3 tenants/GPU balanced 3g.40gb 2 tenants/GPU 13B-34B FP8 Passthrough 1 tenant/GPU 70B+ models
Profile selection decision tree for shared inference on H100 80 GB. SM utilization data from DCGM drives the compute-bound branch.

Worked Sizing Example: Three-Team Shared Cluster

Four H100 SXM5 80 GB GPUs in a single DGX H100 node. Three teams: Team A runs embedding inference (BGE-M3, fits in 8 GB), Team B runs Llama 3 8B FP8 (needs ~12 GB), Team C runs Llama 3 70B (needs all 80 GB for single-GPU inference).

Suggested layout: GPU 0 set to all-1g.10gb (7 slices for Team A, each embedding replica in one 1g.10gb instance). GPU 1 and GPU 2 set to all-2g.20gb (3 slices each, 6 total for Team B Llama 8B replicas). GPU 3 left in non-MIG passthrough mode for Team C's 70B model. Use the GPU Operator mixed node strategy or a dedicated node per GPU.

What to validate before going live: run DCGM field DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_FB_USED per MIG instance for 24 hours under representative load. If any 2g.20gb slice is consistently above 85 percent SM utilization, the model is outrunning the slice; move it to a 3g.40gb profile or add a second replica on a different slice. If HBM usage is consistently under 50 percent on a 1g.10gb slice, you may be able to double the replicas with time-slicing inside that MIG instance.

MIG-Backed vGPU: The Hybrid Mode

On a platform running KubeVirt or VMware vSphere, you can combine both technologies. A MIG slice becomes the hardware boundary and vGPU runs inside it, adding the VM abstraction. Benchmarks from NVIDIA's own testing (cited in the AI Enterprise release 8 docs) show MIG-backed vGPU delivers around 20 percent higher throughput than time-sliced vGPU at comparable profile sizes, because contention for the HBM bus is eliminated between MIG instances.

When is this worth the added complexity? When you need both strong isolation (MIG) and VM-level portability (vGPU live migration, NVIDIA-managed driver lifecycle inside the VM). The VCF deployment path for this combination is covered in the Private AI GPU partitioning post. On a pure Kubernetes stack, plain MIG is simpler and sufficient.

In practice: The most common mistake I see on new H100 clusters is using time-slicing as the default because it requires no licensing and no profile planning. The teams report good GPU utilization numbers in dashboards. What the dashboards do not show is the tail latency distribution: P99 latency on Team A's embedding API spikes 4-8x when Team B's training job lands on the same GPU. With MIG, P99 latency is bounded within the hardware slice. The difference shows up in SLA reviews, not in average utilization graphs.

When MIG Is the Wrong Choice

MIG is not always the answer. Three scenarios where I specifically do not use it:

Large model training. Tensor parallelism and pipeline parallelism for 70B+ models need the full NVLink bisection bandwidth across all GPUs in a node. Enabling MIG mode disables NVLink-connected multi-GPU for the partitioned card. If you need NVLink for model-parallel training, keep those GPUs in passthrough mode. See Part 7 on NVLink and NVSwitch for the scale-up topology.

Workloads that need dynamic GPU sizing. MIG geometry is static until you trigger a reconfiguration. A workload that needs 10 GB in off-peak hours but 70 GB during peak cannot grow its MIG slice without draining and reconfiguring. Time-slicing or full passthrough with Kubernetes resource requests handles bursty, variable workloads better.

GPUs older than Ampere. MIG does not exist on V100, T4, or older cards. On those, time-slicing is the only sharing option.

My take: The step to complete before committing to a MIG profile is measuring actual model memory usage with nvidia-smi --query-compute-apps=pid,used_memory --format=csv under real production load, not just model parameter count. FP8 quantized models and KV cache together can push usage well above the naive parameter-count estimate. Size the MIG slice for the P95 memory footprint, not the theoretical minimum.

The Verdict

For any H100, H200, or B200 node that serves more than one tenant or more than one workload type simultaneously, MIG is the correct default. It is the only mechanism that gives you hard memory and compute boundaries backed by hardware, not software promises. The operational cost is real: profile planning requires knowing your model memory footprints, and reconfiguration disrupts running pods. The alternative is noisy-neighbor interference that shows up in P99 latency, not in average utilization dashboards, and that is a far worse problem to debug in production.

Use passthrough for large model training (70B+ requiring NVLink), for workloads with highly variable GPU memory demands, and for legacy GPU generations (pre-Ampere). Use vGPU when your platform is hypervisor-based and you need VM-level lifecycle management. Use time-slicing only for dev environments where SLA guarantees do not exist and profile planning overhead is not justified.

One thing to validate before any production deployment: confirm that your NVIDIA AI Enterprise license covers the GPU Operator version you are running. MIG itself does not require a license for bare-metal use, but the GPU Operator enterprise support, DCGM integration, and vGPU features are all NVAIE-licensed. The licensing model is covered in detail in the NVIDIA AI Guide pillar.

If you are running this on VMware Cloud Foundation and need the hypervisor-side configuration, the Private AI series covers that ground at GPU partitioning on VMware Private AI. The next post in this series moves to the interconnect: Part 7 covers NVLink and NVSwitch and what happens when your model outgrows a single GPU.

NVIDIA AI Series · Part 6 of 30
« Previous: Part 5  |  NVIDIA AI Guide  |  Next: Part 7 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading