TL;DR · Key Takeaways
- VCF 9.1 went GA on 27 May 2026. The 9.0 to 9.1 move is orchestrated through VCF Operations and the new Management Services layer, not the old SDDC Manager bundle workflow you remember from 5.x.
- The single thing that breaks GPU clusters on upgrade is a vGPU host VIB and guest driver branch mismatch. Stage the matching guest driver and GPU Operator version before you remediate ESXi hosts, never after.
- Management domain first, GPU workload domains as a deliberate Day-N step. Do not let a maintenance window remediate GPU hosts while a model is still serving traffic.
- vMotion of vGPU VMs needs a target host with the same profile and free framebuffer. A GPU-saturated cluster cannot evacuate a host, and the remediation stalls there.
- Run the VCF 9.1 Upgrade Planning Tool first. It catches interop gaps the generic checklist misses.
Most VCF upgrades are boring, and boring is the goal. A Private AI upgrade is not, because you are not moving one platform, you are moving two coupled platforms at once: the VCF data plane underneath and the NVIDIA AI Enterprise stack riding on top of it. The host vGPU manager, the guest drivers inside your Deep Learning VMs, the GPU Operator reconciling on VKS, the NIM Operator, Private AI Services, and the pgvector instance on Data Services Manager all have to land on versions that agree with each other. Get the order wrong and the upgrade itself succeeds while every GPU in the estate quietly goes dark.
This is the last part of the series, so it does double duty: a practical 9.0 to 9.1 runbook, and a closing verdict on the platform after twenty-four parts of building it out. Start at the pillar if you landed here cold: the VMware Private AI complete guide.
What actually changes from 9.0 to 9.1
9.1 is an in-place evolution of 9.0, not the architectural rebuild that 5.2.x to 9.x was. The lifecycle now runs through VCF Operations and the consolidated Management Services layer, with a separate VCF License Appliance you deploy or update as part of the pass. On the Private AI side, Private AI Services gains a reworked Model Endpoint UI and UI-driven self-service enablement, so the model-serving control plane gets friendlier without changing the underlying Model Store and Model Runtime contract from Part 12. None of that is the risk. The risk lives entirely in the GPU enablement layer, which the VCF upgrade does not manage for you.
Here is the mental model that keeps people out of trouble: VCF upgrades the host. NVIDIA AI Enterprise owns everything above the host vGPU manager VIB, and the upgrade tooling will happily move the host VIB out from under a guest driver that no longer matches it. That single seam is where the field failures cluster.
Step 1: Plan and pre-check before you touch anything
Run the VCF 9.1 Upgrade Planning Tool against your actual inventory first. It produces a tailored path and flags interop gaps that a generic checklist never will. Then capture, in writing, the current driver branch for every GPU workload domain. This is the input that decides whether step 3 is safe.
- Record the running vGPU manager VIB branch per GPU host:
esxcli software vib list | grep -i nvidia - Record the guest driver branch inside a representative DLVM and the GPU Operator driver version on each VKS cluster.
- Confirm the 9.1 NVIDIA AI Enterprise entitlement and that your NLS license server is reachable from the GPU hosts and guests.
- Snapshot or back up the DSM pgvector instances and export the Model Store inventory so a failed runtime upgrade does not cost you the vector data behind RAG.
- Run the VCF upgrade prechecks until they are clean. Do not start with a yellow precheck and a plan to fix it later.
If you run an isolated estate, the bundle staging is its own project. The mirroring and bootstrap problem is covered in the air-gapped Private AI Foundation walkthrough, and the same depot discipline applies to 9.1 upgrade bundles.
Step 2: Upgrade the management plane and management domain
This is the standard, well-trodden part. VCF Operations and the Management Services layer go first, including the License Appliance. Then the management domain in the fixed component order. None of your GPU workloads live here, so this phase carries normal VCF risk, not AI-specific risk. Let it complete and settle before going near a GPU host.
| Order | Component | Why it sits here | GPU risk |
|---|---|---|---|
| 1 | VCF Operations + License Appliance | Drives the rest of the lifecycle in 9.1 | None |
| 2 | SDDC / fleet lifecycle services | Coordinates domain remediation | None |
| 3 | NSX Manager (mgmt) | Manager before host bits and Edges | None |
| 4 | vCenter (mgmt) | Inventory and compatibility baseline | None |
| 5 | ESXi (mgmt) + NSX host bits | Host bits now ride with the ESX upgrade | None |
| 6 | GPU workload domain (Day-N) | Optional and deliberate after mgmt is done | HIGH |
Step 3: The GPU workload domain, where care actually matters
Workload domain upgrades are a Day-N procedure, performed after the management domain is on 9.1. For a GPU domain, treat that flexibility as a gift, not a chore. You get to drain, stage, and validate on your own clock instead of inside the management window. The order inside this step is the whole ballgame.
The procedure that does not break GPUs:
- Stage the matching guest driver and GPU Operator version first. Update the GPU Operator driver to the branch that pairs with the incoming host VIB, and rebuild or update DLVM guest drivers to the same branch, before any host is remediated.
- Drain GPU workloads off the first host. Cordon the VKS node and let pods reschedule, or scale model endpoints to zero. vMotion vGPU VMs to a host with the same profile and free framebuffer. If nothing can take them, pause serving deliberately rather than forcing it.
- Put the host in maintenance mode and remediate. ESXi and the new host vGPU manager VIB land together.
- Verify the host sees GPUs before moving on:
nvidia-smion the host, profiles intact, no Xid errors in the logs. - Repeat host by host. Resist the urge to remediate the whole cluster in one rolling pass while a model is still serving from it.
# confirm the host VIB branch after remediation
esxcli software vib list | grep -i nvidia
# host-side GPU sanity (run on the ESXi host)
nvidia-smi
# cordon a VKS node before draining for host maintenance
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# after upgrade, confirm the GPU Operator driver daemonset is Ready
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset --tail=40
Step 4: Reconcile the AI stack and validate
With hosts on 9.1 and drivers paired, bring the rest of the stack back into agreement. Do not let Helm auto-upgrade the GPU Operator or NIM Operator to whatever is latest. Pin them to the versions your 9.1 BOM blesses, then reconcile in this order, validating at each layer rather than at the end.
For the deeper signals to watch while you validate, the monitoring approach in GPU monitoring with VCF Operations tells you whether endpoints are genuinely healthy or just returning 200 with degraded throughput. And if a host comes back without GPUs, or pods crash-loop on the driver, walk the failure map in the Private AI troubleshooting guide before you start reinstalling things at random.
The series, in one verdict
Twenty-four parts in, the honest assessment is this. VMware Private AI Foundation with NVIDIA is the most coherent way to run private GenAI on infrastructure you already own and govern, and the gap between it and stitching together raw Kubernetes plus NIM plus a vector database yourself is real. You get sizing guidance, a model serving control plane, self-service, and a monitoring story that actually understands GPUs. That is not nothing, and for a regulated enterprise that cannot send data to a hyperscaler, it is close to the only sane option.
It is also not magic. The platform abstracts the data plane beautifully and stops at the host vGPU VIB, which means the operational burden of driver and operator interop never goes away. The agentic AI features are earlier than the marketing implies, and Agent Builder is a starting point, not a finished product. The teams that succeed treat this as VMware infrastructure with a demanding NVIDIA tenant on top, version-pin everything, and validate interop on every change. The teams that struggle expect it to behave like a SaaS endpoint. It does not.
My Take
Build it if you have a genuine data-residency or governance reason to keep inference in-house and the operational maturity to own a GPU enablement layer. If neither is true, a managed endpoint will be cheaper and calmer. Private AI Foundation rewards discipline and punishes shortcuts, which is exactly what you want from infrastructure that serves models to your business.
What’s Next
That closes the build-out. From here it is operations: keep your interop matrix current, rehearse the next upgrade on a spare workload domain, and revisit your sizing as model footprints change. If you have run a 9.0 to 9.1 Private AI upgrade already, what bit you that this runbook would have caught, and what did it miss? That feedback is what sharpens the next pass.
References
- Upgrading to VMware Cloud Foundation 9.1, Broadcom TechDocs
- VMware Private AI Foundation with NVIDIA 9.1, Broadcom TechDocs
- Announcing the VCF 9.1 Upgrade Planning Tool, VCF Blog
- Upgrade Sequence and Related Issues for VCF and vSphere Foundation 9.1
- Installing and Configuring NVIDIA AI Enterprise Host Software, NVIDIA Docs
« Previous: Part 23 | VMware Private AI Complete Guide | Next: Part 25 »








