You buy an H100 SXM5 to run inference for ten internal teams. On day one, team A submits a Llama 3 request that spikes to 95 percent SM utilization. Teams B through J get throttled to nothing. The GPU is not "shared"; it is timeshared by whoever happens to be loudest. This is the default CUDA scheduling behavior and it is a production problem the moment you have more than one tenant on a node. Picking the right partitioning strategy before you deploy is the decision that determines whether your multi-tenant cluster stays predictable or turns into a noisy-neighbor complaint queue.
The Four Partitioning Modes
Before comparing the modes, it helps to be precise about what "isolation" means in this context. There are three independent dimensions: compute isolation (SMs cannot be stolen by a neighbor), memory isolation (each partition has exclusive, bounded HBM bandwidth), and fault isolation (an error in one partition does not tear down another). The four modes each deliver different subsets of these guarantees.
CUDA Time-Slicing
The CUDA scheduler context-switches between processes at a fixed interval, typically every millisecond. No hardware boundary separates tenants. If one process is running a large kernel, the scheduler will preempt it and hand the SM to the next process, but the preemption is coarse and the shared L2 cache and HBM bus are never partitioned. The latency impact from context switching is real: switching saves and restores SM register state, which takes time proportional to the register footprint of the outgoing kernel.
In Kubernetes, you enable time-slicing via a ConfigMap consumed by the GPU Device Plugin. The replicas field tells the plugin how many logical GPU resources to advertise per physical GPU. Setting replicas: 4 on an H100 makes Kubernetes think it has four GPUs; four pods get scheduled. The GPU itself receives all four contexts and time-slices among them. There is no memory cap per pod.
Full Passthrough
A single VM or container gets the entire GPU. On a bare-metal Kubernetes node this is the default: one pod with nvidia.com/gpu: 1 gets a full H100. In a virtualized environment, GPU passthrough assigns the PCIe device directly to one VM via VFIO/IOMMU. The VM sees the raw hardware with no mediated layer. Performance is close to bare metal. The downside is obvious: you cannot split the GPU across tenants. If your H100 SXM5 costs $30,000 [VERIFY] per card and your workload only needs 20 GB of HBM, you are leaving 60 GB idle.
NVIDIA vGPU
vGPU is a software mediation layer that runs in the hypervisor. The physical GPU is split into virtual GPU instances using a driver-level scheduler. Each VM sees a virtual GPU with a fixed memory allocation and a scheduler policy. Three scheduler modes exist: best-effort, equal-share, and fixed-share. Fixed-share gives each vGPU a guaranteed time budget per second, which is the closest vGPU gets to QoS without MIG hardware backing. Memory allocation is enforced; one vGPU cannot read another's framebuffer. Fault isolation is partial: a driver fault in one VM can trigger a GPU reset that affects all VMs on the card.
vGPU requires an NVIDIA AI Enterprise license and a supported hypervisor (VMware vSphere, Red Hat KVM, Nutanix AHV). If your platform is vSphere, this is the natural path. The Private AI series covers the VCF-specific deployment in detail at GPU partitioning on VMware Private AI.
Multi-Instance GPU (MIG)
MIG partitions a single GPU into up to seven independent GPU Instances (GIs) at the hardware level using GPC slices and memory channel groupings. Each GI gets its own SMs, L2 cache slice, HBM channel allocation, and copy engines. There is no shared bus between instances that an attacker or a faulty kernel can exploit. Within each GI, you create a Compute Instance (CI) that maps SMs to a schedulable CUDA context. A MIG device is the pair (GI, CI) that a container or VM sees as a GPU.
MIG is available on A100, A30, H100, H200, and B200, and selected Blackwell professional GPUs. It requires enabling MIG mode at the driver level, which requires a brief GPU reset. You cannot mix MIG and non-MIG workloads on the same physical GPU simultaneously.
MIG Profile Geometry on H100, H200, and B200
Profile names follow a consistent convention: Ng.Mgb where N is the number of GPC slices and M is the HBM allocation. An H100 SXM5 80 GB card supports seven 1g.10gb slices (7 x 10 GB = 70 GB usable, the remaining HBM reserved for system overhead). You cannot mix arbitrary profiles on one GPU; NVIDIA publishes exact valid combinations in the MIG User Guide.
The following table shows the complete profile set for H100 80 GB SXM5, which is the most common production variant as of mid-2026. Every profile is a valid, supported combination; you cannot freely mix profiles that would exceed the GPC or memory budget.
| Profile | Memory | SM Fraction | Max Instances | Typical Use |
|---|---|---|---|---|
1g.10gb | 10 GB | 1/7 | 7 | Small LLM inference, embedding models |
1g.10gb+me | 10 GB | 1/7 | 1 | Video transcoding alongside inference |
1g.20gb | 20 GB | 1/7 | 4 | Medium models requiring more HBM |
2g.20gb | 20 GB | 2/7 | 3 | 7B param models in INT8 |
3g.40gb | 40 GB | 3/7 | 2 | 13B-34B param models in FP8 |
4g.40gb | 40 GB | 4/7 | 1 | Large single-tenant workload on half the card |
7g.80gb | 80 GB | 7/7 | 1 | Full GPU in MIG mode (single tenant) |
1g.20gb profile allocates 2/8 of the HBM channels but only 1/7 of the SMs. It is memory-heavy relative to compute. Workloads that are SM-bound will not benefit from the extra HBM. Use 2g.20gb if you need the matching compute fraction alongside the memory. The profile names do not hint at this asymmetry.Comparison: What Each Mode Actually Delivers
| Criterion | Time-Slicing | Passthrough | vGPU | MIG |
|---|---|---|---|---|
| Memory isolation | None | Full (1 tenant) | Software-enforced | Hardware-enforced |
| Compute QoS | None | N/A | Scheduler policy | Hardware partition |
| Fault isolation | None | VM-level | Partial | Hardware boundary |
| Hardware required | Any NVIDIA GPU | Any with IOMMU | vGPU-capable | Ampere or later |
| License needed | None | NVAIE (for virt) | NVAIE | NVAIE (for support) |
| Reconfigure without reboot | Yes | N/A | Yes (profile change) | Yes (MIG Mgr drains pods) |
| Kubernetes scheduling | Replicas in DevicePlugin | nvidia.com/gpu: 1 | KubeVirt or vSphere | nvidia.com/mig-Ng.Mgb |
MIG and the Kubernetes GPU Operator
The NVIDIA GPU Operator v26.3.2 (released June 2026) deploys a MIG Manager DaemonSet on every MIG-capable node. MIG Manager watches the nvidia.com/mig.config node label and drives the full reconfiguration cycle: drain GPU pods, stop DCGM Exporter, call mig-parted to apply the new geometry, and restart everything. No manual nvidia-smi invocations required on a properly configured cluster.
Two MIG strategies exist: single (all GPUs on the node run the same profile) and mixed (different profiles on different GPUs, or a mixed profile like all-balanced on one GPU). Mixed strategy is more operationally complex because each pod must request the specific MIG resource type via nvidia.com/mig-1g.10gb, nvidia.com/mig-2g.20gb, and so on. Kubernetes does not auto-bin-pack across MIG slice types without a custom scheduler.
Starting with GPU Operator v26.3.0, MIG Manager auto-generates the ConfigMap for each node by querying the hardware via NVML at startup. You no longer need to maintain a hand-crafted default-mig-parted-config ConfigMap for standard profiles. Custom profiles still require a user-provided ConfigMap.
Operational Artifact: MIG Creation via nvidia-smi and GPU Operator
The two paths to configure MIG are: directly via nvidia-smi on bare metal (or inside the driver container), and via the GPU Operator node label on Kubernetes. Both are shown below.
Worked Example
Scenario: bare-metal H100 80 GB SXM5 node. Goal: create 7 x 1g.10gb slices, verify them, then reconfigure to 2 x 3g.40gb for a larger tenant.
# Step 1: enable MIG mode on GPU 0 (requires no active GPU processes) nvidia-smi -i 0 -mig 1 # Expected output: # Enabled MIG Mode for GPU 00000000:01:00.0 # All done. # Step 2: create 7 GPU Instances using the 1g.10gb profile (profile ID 19 on H100 80GB) nvidia-smi mig -cgi 19,19,19,19,19,19,19 -i 0 # Expected output: # Successfully created GPU instance ID 1 on GPU 0 ... # (repeated 7 times) # Step 3: create one Compute Instance per GPU Instance nvidia-smi mig -cci -i 0 # Step 4: verify nvidia-smi -L # Expected output: # GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-xxxxx) # MIG 1g.10gb Device 0: (UUID: MIG-xxxxx) # MIG 1g.10gb Device 1: (UUID: MIG-xxxxx) # ... (7 total) # FAILURE MODE: if any GPU process (DCGM, driver health check) is still # attached, nvidia-smi returns: # Unable to destroy GPU instance: In use by another client # Fix: stop DCGM (systemctl stop nvidia-dcgm) then retry.
To do the same via GPU Operator on Kubernetes (on a node named gpu-node-01):
# Set MIG strategy to single during install
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v26.3.2 \
--set mig.strategy=single
# Label the node to trigger MIG Manager reconfiguration
kubectl label nodes gpu-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite
# Monitor reconfiguration progress
kubectl get node gpu-node-01 -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
# Transitions: pending -> success
# Verify 7 slices are advertised
kubectl get node gpu-node-01 -o json | jq '.status.capacity'
# Expected: nvidia.com/gpu.count = 7
# Reconfigure to 2 x 3g.40gb for a heavier tenant
kubectl label nodes gpu-node-01 nvidia.com/mig.config=all-3g.40gb --overwrite
# MIG Manager drains pods, reconfigures, and relabels state=success
Expected result after all-3g.40gb: nvidia.com/gpu.count: 2, nvidia.com/gpu.slices.gi: 3 on the node labels. Pods requesting nvidia.com/gpu: 1 will bind to one 3g.40gb instance.
Failure mode: if mig.config.state stays at pending more than two minutes, check MIG Manager logs: kubectl logs -n gpu-operator -l app=nvidia-mig-manager. Common cause is a zombie DCGM process holding the GPU context.
Disclaimer
MIG reconfiguration is a disruptive operation in production. It evicts all running GPU pods on the target node. In a Kubernetes cluster with live inference traffic, cordon the node and drain GPU workloads before relabeling. On some cloud providers, enabling MIG mode requires a node reboot; set migManager.env[0].name=WITH_REBOOT in the Helm values. Test reconfiguration procedures in a staging cluster before running them against production nodes.
Sizing a Shared Inference Cluster: A Decision Walk
The right profile depends on the models you are serving and how many teams need concurrent access. Here is how I work through it for a typical enterprise cluster with a mix of embedding, chat, and code completion services.
Worked Sizing Example: Three-Team Shared Cluster
Four H100 SXM5 80 GB GPUs in a single DGX H100 node. Three teams: Team A runs embedding inference (BGE-M3, fits in 8 GB), Team B runs Llama 3 8B FP8 (needs ~12 GB), Team C runs Llama 3 70B (needs all 80 GB for single-GPU inference).
Suggested layout: GPU 0 set to all-1g.10gb (7 slices for Team A, each embedding replica in one 1g.10gb instance). GPU 1 and GPU 2 set to all-2g.20gb (3 slices each, 6 total for Team B Llama 8B replicas). GPU 3 left in non-MIG passthrough mode for Team C's 70B model. Use the GPU Operator mixed node strategy or a dedicated node per GPU.
What to validate before going live: run DCGM field DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_FB_USED per MIG instance for 24 hours under representative load. If any 2g.20gb slice is consistently above 85 percent SM utilization, the model is outrunning the slice; move it to a 3g.40gb profile or add a second replica on a different slice. If HBM usage is consistently under 50 percent on a 1g.10gb slice, you may be able to double the replicas with time-slicing inside that MIG instance.
MIG-Backed vGPU: The Hybrid Mode
On a platform running KubeVirt or VMware vSphere, you can combine both technologies. A MIG slice becomes the hardware boundary and vGPU runs inside it, adding the VM abstraction. Benchmarks from NVIDIA's own testing (cited in the AI Enterprise release 8 docs) show MIG-backed vGPU delivers around 20 percent higher throughput than time-sliced vGPU at comparable profile sizes, because contention for the HBM bus is eliminated between MIG instances.
When is this worth the added complexity? When you need both strong isolation (MIG) and VM-level portability (vGPU live migration, NVIDIA-managed driver lifecycle inside the VM). The VCF deployment path for this combination is covered in the Private AI GPU partitioning post. On a pure Kubernetes stack, plain MIG is simpler and sufficient.
When MIG Is the Wrong Choice
MIG is not always the answer. Three scenarios where I specifically do not use it:
Large model training. Tensor parallelism and pipeline parallelism for 70B+ models need the full NVLink bisection bandwidth across all GPUs in a node. Enabling MIG mode disables NVLink-connected multi-GPU for the partitioned card. If you need NVLink for model-parallel training, keep those GPUs in passthrough mode. See Part 7 on NVLink and NVSwitch for the scale-up topology.
Workloads that need dynamic GPU sizing. MIG geometry is static until you trigger a reconfiguration. A workload that needs 10 GB in off-peak hours but 70 GB during peak cannot grow its MIG slice without draining and reconfiguring. Time-slicing or full passthrough with Kubernetes resource requests handles bursty, variable workloads better.
GPUs older than Ampere. MIG does not exist on V100, T4, or older cards. On those, time-slicing is the only sharing option.
nvidia-smi --query-compute-apps=pid,used_memory --format=csv under real production load, not just model parameter count. FP8 quantized models and KV cache together can push usage well above the naive parameter-count estimate. Size the MIG slice for the P95 memory footprint, not the theoretical minimum.The Verdict
For any H100, H200, or B200 node that serves more than one tenant or more than one workload type simultaneously, MIG is the correct default. It is the only mechanism that gives you hard memory and compute boundaries backed by hardware, not software promises. The operational cost is real: profile planning requires knowing your model memory footprints, and reconfiguration disrupts running pods. The alternative is noisy-neighbor interference that shows up in P99 latency, not in average utilization dashboards, and that is a far worse problem to debug in production.
Use passthrough for large model training (70B+ requiring NVLink), for workloads with highly variable GPU memory demands, and for legacy GPU generations (pre-Ampere). Use vGPU when your platform is hypervisor-based and you need VM-level lifecycle management. Use time-slicing only for dev environments where SLA guarantees do not exist and profile planning overhead is not justified.
One thing to validate before any production deployment: confirm that your NVIDIA AI Enterprise license covers the GPU Operator version you are running. MIG itself does not require a license for bare-metal use, but the GPU Operator enterprise support, DCGM integration, and vGPU features are all NVAIE-licensed. The licensing model is covered in detail in the NVIDIA AI Guide pillar.
If you are running this on VMware Cloud Foundation and need the hypervisor-side configuration, the Private AI series covers that ground at GPU partitioning on VMware Private AI. The next post in this series moves to the interconnect: Part 7 covers NVLink and NVSwitch and what happens when your model outgrows a single GPU.
References
- NVIDIA MIG User Guide r610: Supported MIG Profiles (H100, H200, B200)
- NVIDIA GPU Operator with MIG Documentation (updated June 2026)
- NVIDIA AI Enterprise Release 8: MIG-Backed vGPU Features
- NVIDIA GPU Operator: Time-Slicing GPUs in Kubernetes



