Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

How to Upgrade VMware Cloud Foundation to 9.1: The Full-Stack Runbook (VCF 9 Series, Part 23)

VCF Operations now drives lifecycle, so the 9.1 upgrade starts there, not at SDDC Manager. Here is the full-stack sequence, the new Management Services CIDR and License Appliance prep, and the three failures that bite.

VCF 9 Series · Part 23 of 36

TL;DR · Key Takeaways

  • In VCF 9.x, VCF Operations is the lifecycle control point. The upgrade starts there, not at SDDC Manager. Get that order wrong and you stall halfway through.
  • 9.1 retires the standalone 9.0 Fleet Management Appliance and the Identity Broker, replacing them with a unified VCF Management Services cluster that needs its own fresh CIDR block (a /28 minimum, /27 if you have room) and DNS records.
  • A dedicated License Appliance is now a hard requirement, with its own IP and DNS record. License assignment failures after upgrade are one of the most common 9.1 stumbles (see KB 424533).
  • The strict sequence is: VCF Operations, depot, binaries, Management Services, dependency components (Avi/SRM), SDDC Manager, plan domain upgrade, prechecks, NSX, vCenter, ESX.
  • Patching (async bundles) and version upgrades use the same engine but are not the same risk profile. Treat patches as routine hygiene, not optional.
Who this is for: VCF administrators and architects running a 9.0.x or 5.2.x fleet and planning the move to 9.1, plus anyone owning ongoing patch cadence.  Prerequisites: a healthy management domain, external SFTP backups, depot/Broadcom Support entitlement, change-window authority, and free management-VLAN address space.

Here is the mistake I see teams make on the very first 9.1 attempt: they open SDDC Manager, look for the upgrade bundle, and find nothing useful. That is not a bug. In the 9.x model the lifecycle brain moved to VCF Operations, and 9.1 pushes that further by deleting the standalone Fleet Management Appliance entirely. If your mental model still has SDDC Manager driving the upgrade, you will fight the platform the whole way. This runbook walks the full-stack sequence end to end, calls out the three places people actually get stuck, and separates real version upgrades from routine patching.

Why the order is the whole game

VCF upgrades are not a set of independent component updates you can run in any order. Lifecycle, operations, SDDC Manager, NSX, vCenter and ESX have hard dependencies on each other, and 9.1 adds a new one: the unified VCF Management Services cluster has to own inventory before anything else moves. The official guidance is blunt about it. Broadcom states that upgrading to 9.1 “requires a strict component upgrade sequence,” and an incorrect sequence “results in errors.” That is not boilerplate. The most common self-inflicted failure I see is starting the SDDC Manager or domain upgrade before VCF Operations and Management Services are healthy, then spending the rest of the window unwinding a half-migrated fleet.

The sequence below is the one that holds up in practice. The diagram shows the dependency flow so you can brief a change board on it in one glance.

VCF 9.1 Upgrade Sequence Run top to bottom. Each box gates the next. 1 Upgrade VCF Operations to 9.1 (PAK) Fleet management migrates; old appliance is powered off 2 Configure online depot + download binaries 3 Deploy VCF Management Services New /28 or /27 CIDR + DNS; License Appliance 4 Dependency components: Avi, SRM, replication 5 Upgrade SDDC Manager to 9.1 6 Plan domain upgrade + run prechecks 7 NSX, then vCenter, then ESX In that order; vCenter needs a temporary IP 8 Validate from multiple panes, then re-backup DO BEFORE THE WINDOW Carve a /28 (or /27) on mgmt VLAN Create forward + reverse DNS for new service FQDNs + License Appliance Confirm SFTP backups are recent Download all binaries up front Stage temporary IP for vCenter Verify depot/entitlement access Check interop matrix for Avi/SRM NEVER PROCEED IF Prechecks are red Management Services task is not marked Successful Backups are stale or unverified
VCF 9.1 upgrade sequence and the prep that has to happen before the change window opens.
Disclaimer: This is a production-change procedure. Validate the target 9.1 Bill of Materials and the interoperability matrix for every component (including Avi, SRM and vSphere Replication), confirm recent successful SFTP backups, take supported snapshots, run all prechecks, and test the path in a lab or non-production domain first. Understand your rollback boundaries before you touch anything.

Step 0: The prep that actually saves the weekend

Three new-in-9.1 items cause most of the avoidable pain, so handle them before the change window, not during it.

  1. Carve a fresh CIDR block for VCF Management Services. You cannot reuse your 9.0.x IPs. While the new cluster comes up, the old 9.0.x components are still online orchestrating the upgrade, so reused IPs collide mid-flight. The wizard wants an IP pool in CIDR format, not a range, so you need at least a /28 (14 usable, with 12 free required). Allocate a /27 if you can, for future scale-out.
  2. Stand up the License Appliance. 9.1 makes a dedicated license server a hard requirement, with its own IP and DNS record. Get it routed and resolvable before you start.
  3. Prepare DNS up front. Create forward and reverse records for the new service FQDNs (service-runtime, fleet-lifecycle, the license server, and so on) and validate resolution. The Management Services precheck fails on DNS or reachability errors, and that is not a fun thing to debug at 2 AM.

Then the usual hygiene: confirm a recent SDDC Manager backup on external SFTP, confirm backups for managed components, download every binary ahead of time, and stage a free temporary IP on the management subnet for the vCenter upgrade. If you ran the older 5.2.x to 9.x jump, much of this rhymes with the pitfalls in migrating older VCF to 9 and the things that bite you.

Step 1: Upgrade VCF Operations first

VCF Operations is the control point, so it goes first. The upgrade is a PAK file uploaded through the VCF Operations administrator interface, not a bundle from SDDC Manager.

# VCF Operations admin UI, primary node
Admin interface  ->  Software Update  ->  Install a Software Update
  1. Upload the VCF Operations 9.1 upgrade PAK
  2. Accept EULA and release info
  3. Start install (the admin UI restarts; you will be logged out)
  4. Log back in, watch Software Update until all nodes show ONLINE

# Fleet management migration prompt appears mid-upgrade:
  - provide the root password of the OLD 9.0 fleet mgmt appliance
  - the cloud proxy upgrades with the cluster
  - the old fleet management appliance is decommissioned + powered off

Do not move on while the cluster flickers between offline and going-online. Wait for every node and the cluster itself to report online. This is the step that quietly replaces the 9.0 Fleet Management Appliance and the Identity Broker with the unified Management Services model, so let it finish cleanly.

Step 2: Depot, binaries, then Management Services

With Operations on 9.1, point it at the depot and pull everything you need before deploying the new service layer.

# Online depot
Build -> Depot Settings -> Configure (online depot)
  enter the activation code from the business services console
  verify the connection shows ACTIVE   # if not, fix this before anything else

# Pull binaries
Build -> Lifecycle -> (select VCF instance) -> Binary Management
  select the 9.1 bundles -> Download   # do this BEFORE the window

# Deploy the new control plane
Build -> Lifecycle -> "Deploy Management Services" banner
  CIDR block, e.g. 10.0.0.32/27   (gateway + VLAN)
  map FQDNs (service-runtime, fleet-lifecycle, ...) to IPs in that block
  run Management Services Precheck  ->  must be clean
  Deploy and wait for SUCCESSFUL

SDDC Manager will refuse to proceed until Management Services is up and owns the inventory. That is by design. Once the task shows successful, lifecycle control has transferred to the new cluster and you can move to dependency components. For the bigger picture on how this fleet-level control plane is meant to operate day to day, see the VCF 9 fleet lifecycle management reference architecture.

Step 3: Dependency components before SDDC Manager

If Avi Load Balancer is deployed, upgrade it before SDDC Manager. Confirm the controller version against the 9.1 compatibility list, run Avi prechecks, verify controller cluster, service engine and virtual service health, then back up per the supported Avi procedure. Load balancer health problems are exactly the kind of thing that turns a clean platform upgrade into an incident bridge. If SRM or vSphere Replication is present, validate the supported path and confirm pairings, protection groups, recovery plans and replications are healthy first. These components sit slightly outside the normal SDDC Manager path, which is precisely why people forget them.

Step 4: SDDC Manager, domain plan, prechecks

Only now do you touch SDDC Manager, and you do it from VCF Operations.

# SDDC Manager
Build -> Lifecycle -> (VCF instance) -> SDDC Manager Updates
  download the 9.1 update -> Update Now

# Plan the management domain upgrade
Build -> Lifecycle -> (management domain) -> Upgrades -> Plan Domain Upgrade
  pick the VCF target version (+ custom component targets if needed)
  choose the upgrade scenario for vCenter and NSX Manager
  review -> submit

# Prechecks - non-negotiable
Run prechecks BEFORE NSX/vCenter/ESX. Validate:
  cluster + host health, vSAN health, NSX manager/edge health,
  vCenter services, certificate validity, DNS/NTP, backups, capacity

Do not proceed on red prechecks, and read the warnings too. Some “warnings” are future outages wearing polite clothing. At plan time you also choose optimized versus sequential; in production I lean sequential unless the window and environment genuinely justify parallel change.

Step 5: NSX, then vCenter, then ESX

Drive each from Build > Lifecycle > Upgrades. NSX first: click Upgrade Now, then validate management plane, control plane, transport nodes, edges and routing both before and after. vCenter next, and this one needs the temporary network details you staged earlier because the upgrade uses a temporary network configuration during the switchover; take the snapshot, choose the backup option, configure the temp IP/subnet/gateway, then schedule. ESX last: import the 9.1 image into Image Management, assign it to the target clusters, and roll out. In production prefer a batched cluster or host selection over upgrading everything at once, and watch DRS, maintenance-mode transitions, workload evacuation and any vSAN resync as hosts cycle.

After everything completes, validate from more than one pane. Check VCF Operations, SDDC Manager, the lifecycle version view, NSX Manager, vCenter and the host level. Only then remove temporary snapshots and take fresh backups of the upgraded components. Certificate, identity and backup problems love to surface right after an upgrade, so if something looks off there, the fixes in VCF 9 certificate, identity and backup failures and how to fix them are the right next stop.


Patching is not the same as upgrading

Two things confuse people here. First, the major-version move itself can have a gate: getting from some 9.0.x builds onto the 9.1 line may require applying a specific patch through the Admin Portal first, outside the normal Lifecycle UI. Check the release notes for your exact source build before you assume the standard flow applies. Second, async patches (security and bug-fix bundles released between full versions) run through the same Operations-driven lifecycle engine, but they are lower risk and should be routine. The teams that get burned are the ones treating every patch like a major upgrade and therefore deferring all of them, then carrying a stack of known CVEs for months.

My take

Patch on a predictable cadence, keep full version upgrades to planned windows, and never let “we will do it with the next big upgrade” become your patch strategy.

Patching is not the same as upgradingTreat them differently, or you carry CVEs for monthsAsync patches (routine)Security + bug-fix bundlesLower risk, same enginePredictable cadenceVersion upgrades (planned)Major version movesStrict component sequencePlanned change windowsNever let “we will do it with the next big upgrade” become your patch strategy.
Patch on a cadence; reserve the strict sequence for major version upgrades.

Three failures that bite, and what to do

License assignment failure after upgrade is common enough that Broadcom has a dedicated KB (424533); it usually traces back to the License Appliance not being deployed, resolvable, or reachable, so verify that appliance first. Binaries not appearing in VCF Operations almost always means the depot connection is not actually active or the entitlement is wrong, so re-check Depot Settings before blaming the bundle. And the “Import VCF Operations in Fleet Lifecycle” step can fail on a certificate SAN issue with VCF Operations for Networks, which is a certificate problem to fix rather than an upgrade to retry. The pattern across all three: the failure shows up at upgrade time but the root cause is prep work (DNS, depot, certificates) that was skipped earlier.

Three upgrade failures and their real causeEach shows up at upgrade time; the root cause is prep that was skippedLicense assignment fails after upgradeRoot cause: License Appliance not deployed, resolvable or reachable (KB 424533)Binaries not appearing in VCF OperationsRoot cause: Depot connection not active, or wrong entitlement“Import VCF Operations” step failsRoot cause: Certificate SAN issue on VCF Operations for NetworksThe pattern: DNS, depot and certificates done up front prevent all three.
Fix these in prep, not at 2am during the window.

What I’d Do

Treat 9.1 as an architecture change, not a click-update. Spend a full prep cycle on the CIDR block, the License Appliance and DNS, rehearse the VCF Operations to Management Services hand-off in a non-production domain, and only then schedule the production window. Run the sequence top to bottom, refuse to advance on red prechecks, and validate from multiple panes before you call it done. Do that and the upgrade is uneventful, which is exactly what you want from infrastructure. What is the one component in your stack you always forget to check on the interoperability matrix? That is usually the one that bites.

References

VCF 9 Series · Part 23 of 36
« Previous: Part 22  |  VCF 9 Complete Guide  |  Next: Part 24 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading