Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

VMware Private AI Foundation Upgrade: Moving from VCF 9.0 to 9.1 Without Breaking Your GPUs (Private AI Series, Part 24)

A practical 9.0 to 9.1 upgrade runbook for VMware Private AI Foundation, plus a closing verdict on the platform after 24 parts. The order of operations, the vGPU driver branch trap, and host-by-host GPU domain remediation.

VMware Private AI Series · Part 24 of 24

TL;DR · Key Takeaways

  • VCF 9.1 went GA on 27 May 2026. The 9.0 to 9.1 move is orchestrated through VCF Operations and the new Management Services layer, not the old SDDC Manager bundle workflow you remember from 5.x.
  • The single thing that breaks GPU clusters on upgrade is a vGPU host VIB and guest driver branch mismatch. Stage the matching guest driver and GPU Operator version before you remediate ESXi hosts, never after.
  • Management domain first, GPU workload domains as a deliberate Day-N step. Do not let a maintenance window remediate GPU hosts while a model is still serving traffic.
  • vMotion of vGPU VMs needs a target host with the same profile and free framebuffer. A GPU-saturated cluster cannot evacuate a host, and the remediation stalls there.
  • Run the VCF 9.1 Upgrade Planning Tool first. It catches interop gaps the generic checklist misses.
Who this is for: VCF architects and admins running VMware Private AI Foundation with NVIDIA on VCF 9.0 who need to reach 9.1.  Prerequisites: a healthy 9.0 fleet, current backups, the 9.1 BOM, and a documented vGPU driver branch for every GPU workload domain.

Most VCF upgrades are boring, and boring is the goal. A Private AI upgrade is not, because you are not moving one platform, you are moving two coupled platforms at once: the VCF data plane underneath and the NVIDIA AI Enterprise stack riding on top of it. The host vGPU manager, the guest drivers inside your Deep Learning VMs, the GPU Operator reconciling on VKS, the NIM Operator, Private AI Services, and the pgvector instance on Data Services Manager all have to land on versions that agree with each other. Get the order wrong and the upgrade itself succeeds while every GPU in the estate quietly goes dark.

This is the last part of the series, so it does double duty: a practical 9.0 to 9.1 runbook, and a closing verdict on the platform after twenty-four parts of building it out. Start at the pillar if you landed here cold: the VMware Private AI complete guide.

The 9.0 to 9.1 Order of Operations Each tier must be healthy before the next one starts 1 VCF Operations and Management Services License appliance, fleet lifecycle, the new 9.1 services cluster come first 2 Management Domain NSX Manager, then vCenter, then ESXi (NSX host bits bundled with ESX) 3 GPU Workload Domain (Day-N, deliberate) Drain GPU workloads, stage matching guest driver, then remediate hosts 4 AI Stack Reconcile GPU Operator, NIM Operator, Private AI Services, DSM pgvector, DLVMs 5 Validate and Resume Serving nvidia-smi on hosts, model endpoints return 200, retrieval quality unchanged
The orchestration is top-down. Skipping a tier or reordering the GPU domain is how estates break.

What actually changes from 9.0 to 9.1

9.1 is an in-place evolution of 9.0, not the architectural rebuild that 5.2.x to 9.x was. The lifecycle now runs through VCF Operations and the consolidated Management Services layer, with a separate VCF License Appliance you deploy or update as part of the pass. On the Private AI side, Private AI Services gains a reworked Model Endpoint UI and UI-driven self-service enablement, so the model-serving control plane gets friendlier without changing the underlying Model Store and Model Runtime contract from Part 12. None of that is the risk. The risk lives entirely in the GPU enablement layer, which the VCF upgrade does not manage for you.

Here is the mental model that keeps people out of trouble: VCF upgrades the host. NVIDIA AI Enterprise owns everything above the host vGPU manager VIB, and the upgrade tooling will happily move the host VIB out from under a guest driver that no longer matches it. That single seam is where the field failures cluster.

The Branch Seam That Bites Host vGPU manager VIB and guest driver must share a compatible branch CORRECT: staged together Guest driver (DLVM / GPU Operator) Branch 580.x, staged before host touch Host vGPU manager VIB (ESXi) Branch 580.x, from the 9.1 BOM vGPU active on guest boot WRONG: host moved first Guest driver (stale) Branch 570.x, never updated Host vGPU manager VIB (new) Branch 580.x, remediated by ESXi upgrade vGPU disabled, VM boots without GPU
Branch numbers are illustrative. Validate the exact pairing in your 9.1 NVIDIA AI Enterprise BOM.
Disclaimer: This is a production change. Validate the full 9.1 BOM and the NVIDIA AI Enterprise interop matrix, confirm the host VIB and guest driver branch pairing, back up vCenter, NSX, SDDC inventory and your DSM databases, run the upgrade prechecks, and rehearse on a non-production workload domain before you touch the estate.

Step 1: Plan and pre-check before you touch anything

Run the VCF 9.1 Upgrade Planning Tool against your actual inventory first. It produces a tailored path and flags interop gaps that a generic checklist never will. Then capture, in writing, the current driver branch for every GPU workload domain. This is the input that decides whether step 3 is safe.

  1. Record the running vGPU manager VIB branch per GPU host: esxcli software vib list | grep -i nvidia
  2. Record the guest driver branch inside a representative DLVM and the GPU Operator driver version on each VKS cluster.
  3. Confirm the 9.1 NVIDIA AI Enterprise entitlement and that your NLS license server is reachable from the GPU hosts and guests.
  4. Snapshot or back up the DSM pgvector instances and export the Model Store inventory so a failed runtime upgrade does not cost you the vector data behind RAG.
  5. Run the VCF upgrade prechecks until they are clean. Do not start with a yellow precheck and a plan to fix it later.

If you run an isolated estate, the bundle staging is its own project. The mirroring and bootstrap problem is covered in the air-gapped Private AI Foundation walkthrough, and the same depot discipline applies to 9.1 upgrade bundles.

Step 2: Upgrade the management plane and management domain

This is the standard, well-trodden part. VCF Operations and the Management Services layer go first, including the License Appliance. Then the management domain in the fixed component order. None of your GPU workloads live here, so this phase carries normal VCF risk, not AI-specific risk. Let it complete and settle before going near a GPU host.

OrderComponentWhy it sits hereGPU risk
1VCF Operations + License ApplianceDrives the rest of the lifecycle in 9.1None
2SDDC / fleet lifecycle servicesCoordinates domain remediationNone
3NSX Manager (mgmt)Manager before host bits and EdgesNone
4vCenter (mgmt)Inventory and compatibility baselineNone
5ESXi (mgmt) + NSX host bitsHost bits now ride with the ESX upgradeNone
6GPU workload domain (Day-N)Optional and deliberate after mgmt is doneHIGH

Step 3: The GPU workload domain, where care actually matters

Workload domain upgrades are a Day-N procedure, performed after the management domain is on 9.1. For a GPU domain, treat that flexibility as a gift, not a chore. You get to drain, stage, and validate on your own clock instead of inside the management window. The order inside this step is the whole ballgame.

Can You Actually Evacuate the Host? A GPU-saturated cluster cannot drain, and remediation stalls there Spare host with matching profile + FB? YES vMotion vGPU VMs off Maintenance mode, remediate NO Schedule a serving pause Cordon the node, scale endpoints to zero, power off DLVMs, then maintenance mode
If DRS cannot place a vGPU VM anywhere, the host never enters maintenance mode and the upgrade hangs.

The procedure that does not break GPUs:

  1. Stage the matching guest driver and GPU Operator version first. Update the GPU Operator driver to the branch that pairs with the incoming host VIB, and rebuild or update DLVM guest drivers to the same branch, before any host is remediated.
  2. Drain GPU workloads off the first host. Cordon the VKS node and let pods reschedule, or scale model endpoints to zero. vMotion vGPU VMs to a host with the same profile and free framebuffer. If nothing can take them, pause serving deliberately rather than forcing it.
  3. Put the host in maintenance mode and remediate. ESXi and the new host vGPU manager VIB land together.
  4. Verify the host sees GPUs before moving on: nvidia-smi on the host, profiles intact, no Xid errors in the logs.
  5. Repeat host by host. Resist the urge to remediate the whole cluster in one rolling pass while a model is still serving from it.
# confirm the host VIB branch after remediation
esxcli software vib list | grep -i nvidia

# host-side GPU sanity (run on the ESXi host)
nvidia-smi

# cordon a VKS node before draining for host maintenance
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# after upgrade, confirm the GPU Operator driver daemonset is Ready
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset --tail=40

Step 4: Reconcile the AI stack and validate

With hosts on 9.1 and drivers paired, bring the rest of the stack back into agreement. Do not let Helm auto-upgrade the GPU Operator or NIM Operator to whatever is latest. Pin them to the versions your 9.1 BOM blesses, then reconcile in this order, validating at each layer rather than at the end.

Validate Bottom-Up Each layer green before you test the one above it Model endpoints return 200, RAG retrieval quality unchanged Private AI Services: Model Store + Model Runtime, new endpoint UI DSM pgvector reachable, embeddings intact NIM microservices healthy, NIM Operator reconciled GPU Operator driver Ready, host vGPU VIB paired
The foundation layer is the GPU Operator. If it is not green, nothing above it is trustworthy.

For the deeper signals to watch while you validate, the monitoring approach in GPU monitoring with VCF Operations tells you whether endpoints are genuinely healthy or just returning 200 with degraded throughput. And if a host comes back without GPUs, or pods crash-loop on the driver, walk the failure map in the Private AI troubleshooting guide before you start reinstalling things at random.


The series, in one verdict

Twenty-four parts in, the honest assessment is this. VMware Private AI Foundation with NVIDIA is the most coherent way to run private GenAI on infrastructure you already own and govern, and the gap between it and stitching together raw Kubernetes plus NIM plus a vector database yourself is real. You get sizing guidance, a model serving control plane, self-service, and a monitoring story that actually understands GPUs. That is not nothing, and for a regulated enterprise that cannot send data to a hyperscaler, it is close to the only sane option.

It is also not magic. The platform abstracts the data plane beautifully and stops at the host vGPU VIB, which means the operational burden of driver and operator interop never goes away. The agentic AI features are earlier than the marketing implies, and Agent Builder is a starting point, not a finished product. The teams that succeed treat this as VMware infrastructure with a demanding NVIDIA tenant on top, version-pin everything, and validate interop on every change. The teams that struggle expect it to behave like a SaaS endpoint. It does not.

My Take

Build it if you have a genuine data-residency or governance reason to keep inference in-house and the operational maturity to own a GPU enablement layer. If neither is true, a managed endpoint will be cheaper and calmer. Private AI Foundation rewards discipline and punishes shortcuts, which is exactly what you want from infrastructure that serves models to your business.

What’s Next

That closes the build-out. From here it is operations: keep your interop matrix current, rehearse the next upgrade on a spare workload domain, and revisit your sizing as model footprints change. If you have run a 9.0 to 9.1 Private AI upgrade already, what bit you that this runbook would have caught, and what did it miss? That feedback is what sharpens the next pass.

References

VMware Private AI Series · Part 24 of 30
« Previous: Part 23  |  VMware Private AI Complete Guide  |  Next: Part 25 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading