TL;DR · Key Takeaways
- In NSX 9 the upgrade is no longer a standalone task. It folds into the VCF lifecycle and runs orchestrated from SDDC Manager. The NSX Upgrade Coordinator still exists, but it is the manual fallback, not the default path.
- For a full VCF major or minor upgrade the component order is fixed: SDDC Manager, then NSX, then vCenter, then ESX, then the NSX host finalize step. Pure patches can go in any order. Get the order wrong on a major and you will hit interop validation failures.
- NSX VIBs ship pre-packaged with ESX in VCF 9, so the host portion of an NSX upgrade rides along with the ESX upgrade rather than being a separate VIB push.
- Host upgrades run in maintenance mode (DRS evacuates the host) or in-place (no VM evacuation, no power-off). In-place is faster but narrows your rollback options. Pick deliberately.
- Edge clusters upgrade serially to preserve north-south forwarding. If your Tier-0 is active-active ECMP across only two Edges, you lose half your throughput for the duration. Size for it.
The most expensive NSX upgrade I have watched go sideways did not fail on a host or an Edge. It failed in the planning meeting, three weeks earlier, when someone said “NSX is just one more tile in SDDC Manager now, we will click it after lunch.” That sentence is half true, and the half that is false is where the outage lives. NSX 9 really did move the upgrade into the VCF lifecycle. It did not remove the data-plane reality that you are rolling code through every host and every Edge that carries production traffic.
The mental model shift: NSX upgrades are a VCF operation now
In the NSX-T 3.x world you logged into NSX Manager, opened the Upgrade Coordinator, uploaded an MUB bundle, and drove the whole thing yourself. That model still physically exists in NSX 9, but Broadcom’s supported and recommended path in VCF 9 is to run the upgrade through SDDC Manager. SDDC Manager pulls the validated bundle, checks interoperability against the rest of the fleet, and orchestrates NSX as one step in a larger sequence. When the upgrade runs from SDDC Manager, the Upgrade Coordinator in NSX Manager reflects the same progress. They are two windows onto one job, not two separate jobs.
This matters because the most common planning mistake is treating NSX as an island. It is not. In VCF 9 the NSX VIBs are pre-packaged with ESX, so the host-level NSX bits are upgraded as part of the ESX upgrade rather than as a separate push you schedule independently. The upshot: you cannot fully “finish” NSX before you touch ESX, because part of NSX lives inside the ESX image. There is a distinct NSX finalize step that lands after the ESX hosts are done.
Inside NSX: the sequence the Upgrade Coordinator enforces
Whether SDDC Manager drives it or you run it manually, the internal NSX upgrade order is the same and it is not negotiable. The Upgrade Coordinator upgrades itself first, then the Edge transport nodes, then the host transport nodes, and the NSX Manager cluster last. There is a reason the Managers go last: you want the newest control and management plane validating against an already-upgraded data plane, not the reverse. If you have ever wondered why the UI nags you to finish Edges before it will let you move on, that is why.
When to let SDDC Manager drive, and when not to
My default is to let SDDC Manager orchestrate. You get interoperability validation against vCenter and ESX for free, the bundle is the validated one, and the whole thing is auditable in one place. I reach for the standalone Upgrade Coordinator path in NSX Manager only in narrow cases: a support engagement where Broadcom asks for it, a brownfield NSX that is not yet fully under VCF lifecycle, or a targeted NSX-only fix where dragging the entire fleet into a maintenance window is the wrong trade. For most production VCF estates, hand-driving the Upgrade Coordinator when SDDC Manager would do it for you just adds a place to make a mistake.
Pre-upgrade: what I check before I touch anything
NSX 9.0.1 sharpened the pre-upgrade checks and made them faster and more focused, which is welcome. But the Upgrade Coordinator’s automated checks are necessary, not sufficient. The first thing I check is not a checkbox in the UI, it is whether the team has a current, restorable NSX backup with a passphrase someone can actually produce. I covered why that passphrase is the quiet failure point in NSX 9 backup and restore. If you cannot restore, you do not have a rollback. You have a hope.
From the NSX CLI I confirm the management cluster is genuinely stable before anything starts. A cluster showing one node DEGRADED will fail the upgrade at an awkward moment, so I want it green first.
# On each NSX Manager node, confirm cluster health is STABLE
get cluster status
# Check the upgrade state once a bundle is staged
get upgrade status
# Confirm transport-node connectivity before you roll hosts
get transport-nodes status
# From an ESX host: confirm the running NSX VIBs (host portion rides ESX in VCF 9)
esxcli software vib list | grep -i nsx
Because NSX 9 exposes only the Policy API (the Manager API for logical networking and security was removed), any scripted pre-flight you wrote against the old Manager API needs to target the Policy tree. A quick read against the Policy API confirms the Coordinator is reachable and tells you the current state programmatically, which is handy if you are gating a change-management pipeline on it.
# Policy API only in NSX 9. Read the upgrade-coordinator status.
curl -k -u 'admin'
https://NSX-MGR/api/v1/upgrade/status-summary
# Validate intent realization is clean before you start
curl -k -u 'admin'
https://NSX-MGR/policy/api/v1/infra/realized-state/status
| Pre-upgrade check | Why it bites if skipped | Where |
|---|---|---|
| Restorable backup + known passphrase | No real rollback without it | SFTP target |
| Cluster status STABLE | Degraded node fails the run mid-flight | get cluster status |
| Clean realized-state (no errors) | Unrealized intent gets carried into the new version | Policy API |
| DRS set to fully automated | Maintenance-mode host upgrades stall waiting on manual vMotion | vCenter |
| Interop validated for the target BOM | Wrong order on a major release fails validation | SDDC Manager |
Upgrading the hosts: maintenance mode vs in-place
This is the decision that shapes your maintenance window more than any other, and the one people make on autopilot. Host transport nodes upgrade in one of two modes, and they behave very differently.
Maintenance mode
The Coordinator asks vCenter to put the host into maintenance mode. With DRS fully automated, vSphere evacuates the running VMs to other hosts, the host upgrades clean, then it exits maintenance mode and rebalances. This is the conservative path. Nothing runs on a host while its NSX and ESX bits change, so if something goes wrong the blast radius is an empty host. The cost is time and capacity: you need somewhere for those VMs to go, and a serial walk through a large cluster is not quick.
In-place
In-place upgrades the host without entering maintenance mode and without powering off tenant VMs. It is faster and it does not demand spare capacity to evacuate onto, which is exactly why it is tempting on dense clusters. The trade is that the workloads stay live on a host whose data-plane code is changing, and your options if a host wedges mid-upgrade are narrower. I use in-place when capacity is genuinely tight or the cluster is too large to drain serially in the window, and I accept that I am trading rollback comfort for speed. On anything carrying sensitive production with headroom to evacuate, I default to maintenance mode.
| Dimension | Maintenance mode | In-place |
|---|---|---|
| VM disruption | vMotion to other hosts | None, VMs stay running |
| Spare capacity needed | Yes, room to evacuate | No |
| Speed per host | Slower (drain + rebalance) | Faster |
| Rollback comfort if a host wedges | Higher, host is empty | Lower, workloads are live |
| My default | Production with headroom | Dense clusters, tight windows |
Upgrading Edges without dropping north-south
Edge transport nodes carry your north-south traffic and your stateful services, so this is the stage where a sloppy plan shows up as a real outage. The Coordinator upgrades Edge nodes serially within an Edge cluster. For an auto-deployed Edge the workflow is mechanical: the bundle is prepared, the node is placed into maintenance mode, the new OS is downloaded, installed, and then the OS switch happens. One node at a time, the rest of the cluster keeps forwarding.
The trap is capacity, not mechanics. While one Edge is in maintenance mode, its share of the forwarding capacity is gone. If your Tier-0 runs active-active ECMP across two Edge nodes, taking one down for upgrade halves your north-south throughput for the duration of that node’s upgrade. On a quiet Sunday that is fine. During a busy window it is a brownout. This is the same Edge-sizing math from host and transport-node prep, just felt in reverse: you sized for N+1 so that losing one Edge to an upgrade does not hurt. If you sized for exactly two and called it redundant, the upgrade is when you find out it was not.
Worked example
Say your north-south peak is 18 Gbps and each large-form-factor Edge VM gives you roughly 12 Gbps of usable forwarding. Two Edges in active-active ECMP look fine at idle: 24 Gbps of headroom for an 18 Gbps peak. Now upgrade one. You are down to a single Edge at 12 Gbps against an 18 Gbps peak, a 6 Gbps shortfall, for the length of that node’s OS switch. A three-node cluster (36 Gbps total) drops to 24 Gbps during a node upgrade, still above peak. The lesson is blunt: if upgrades cannot run inside a true off-peak window, your Edge cluster needs N+1, not N. Size the cluster so the upgrade is boring.
Failure modes that actually bite
Most NSX upgrade incidents are not exotic. They cluster around a handful of repeatable causes, and almost all of them are decided before the upgrade starts. Here are the ones I see most often on engagements, with the cause and the fix.
| Symptom | Cause | Fix |
|---|---|---|
| Host upgrade stalls waiting to enter maintenance mode | DRS not fully automated, or no capacity to evacuate | Set DRS to fully automated, free capacity, or switch that cluster to in-place |
| Edge upgrade hangs at OS switch | Edge resource contention or a node already unhealthy pre-upgrade | Confirm Edge health and resource reservations before starting, not after |
| Pre-checks flag unrealized intent | A config error sitting in the Policy tree before you ever upgraded | Fix realized-state errors first, do not carry them across versions |
| Validation fails on a major-version run | Component order wrong, or target BOM not interoperable | Follow the SDDC Manager order; let it validate the BOM for you |
| North-south brownout during Edge stage | Edge cluster sized N, not N+1 | Add an Edge, or move the window to true off-peak |
[AUTHOR: add anecdote]What I’d Do
Run NSX upgrades through SDDC Manager unless you have a specific reason not to, and treat the standalone Upgrade Coordinator as the exception. Before the window, get the realized state clean, confirm a restorable backup, and decide maintenance-mode versus in-place per cluster rather than as a blanket setting. Size your Edge cluster so losing one node to a serial upgrade does not show up in a dashboard. The integrated lifecycle in VCF 9 genuinely makes NSX upgrades less hand-driven than the NSX-T 3.x days. It does not make the data plane care any less about losing an Edge or a host. Plan for the data plane and the orchestration takes care of itself. If you want the wider VCF-level view of how this fits the full-stack upgrade, my VCF 9.1 upgrade and patching runbook walks the whole sequence. What is the first thing you will check before your next NSX window?
References
- Broadcom TechDocs: Upgrading NSX to Version 9 (VCF 9)
- Broadcom KB 390634: Update Sequence for VCF 9.0 and Compatible VMware Products
- Broadcom TechDocs: NSX 9.0.1 Release Notes
- Broadcom TechDocs: VCF 9.1 Patch Releases 9.1.0.x



