Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

NSX 9 Upgrades and Lifecycle: VCF Integrated LCM, Upgrade Coordinator and the Order That Matters (NSX Series, Part 20)

In NSX 9, upgrades fold into the VCF lifecycle and run from SDDC Manager, with Upgrade Coordinator as the manual fallback. Here is the order that matters, how host and Edge upgrades really behave, and the failure modes that catch teams mid-window.

NSX Series · Part 20 of 30

TL;DR · Key Takeaways

  • In NSX 9 the upgrade is no longer a standalone task. It folds into the VCF lifecycle and runs orchestrated from SDDC Manager. The NSX Upgrade Coordinator still exists, but it is the manual fallback, not the default path.
  • For a full VCF major or minor upgrade the component order is fixed: SDDC Manager, then NSX, then vCenter, then ESX, then the NSX host finalize step. Pure patches can go in any order. Get the order wrong on a major and you will hit interop validation failures.
  • NSX VIBs ship pre-packaged with ESX in VCF 9, so the host portion of an NSX upgrade rides along with the ESX upgrade rather than being a separate VIB push.
  • Host upgrades run in maintenance mode (DRS evacuates the host) or in-place (no VM evacuation, no power-off). In-place is faster but narrows your rollback options. Pick deliberately.
  • Edge clusters upgrade serially to preserve north-south forwarding. If your Tier-0 is active-active ECMP across only two Edges, you lose half your throughput for the duration. Size for it.
Who this is for: VCF and NSX administrators, network and platform architects, and consultants planning or running an NSX 9 upgrade inside VCF 9.  Prerequisites: a working NSX 9 deployment on VCF 9, a recent NSX backup, and access to SDDC Manager and the NSX Manager UI. Familiarity with transport nodes and Edge clusters from earlier Parts helps.

The most expensive NSX upgrade I have watched go sideways did not fail on a host or an Edge. It failed in the planning meeting, three weeks earlier, when someone said “NSX is just one more tile in SDDC Manager now, we will click it after lunch.” That sentence is half true, and the half that is false is where the outage lives. NSX 9 really did move the upgrade into the VCF lifecycle. It did not remove the data-plane reality that you are rolling code through every host and every Edge that carries production traffic.

The mental model shift: NSX upgrades are a VCF operation now

In the NSX-T 3.x world you logged into NSX Manager, opened the Upgrade Coordinator, uploaded an MUB bundle, and drove the whole thing yourself. That model still physically exists in NSX 9, but Broadcom’s supported and recommended path in VCF 9 is to run the upgrade through SDDC Manager. SDDC Manager pulls the validated bundle, checks interoperability against the rest of the fleet, and orchestrates NSX as one step in a larger sequence. When the upgrade runs from SDDC Manager, the Upgrade Coordinator in NSX Manager reflects the same progress. They are two windows onto one job, not two separate jobs.

This matters because the most common planning mistake is treating NSX as an island. It is not. In VCF 9 the NSX VIBs are pre-packaged with ESX, so the host-level NSX bits are upgraded as part of the ESX upgrade rather than as a separate push you schedule independently. The upshot: you cannot fully “finish” NSX before you touch ESX, because part of NSX lives inside the ESX image. There is a distinct NSX finalize step that lands after the ESX hosts are done.

VCF 9 core upgrade order Order matters for major and minor releases. Patches can apply in any order. 1 SDDC Mgr orchestrator 2 NSX mgr + Edge 3 vCenter 4 ESX NSX VIBs ride along 5 NSX finalize after hosts SDDC Manager drives the whole chain. You approve and monitor; you do not hand-walk each component.
Diagram 1: The fixed component order for a VCF 9 major or minor upgrade. NSX sits second, but its host portion finishes only after ESX.
In practice: the order in Diagram 1 only binds for major and minor version jumps. Patch releases (for example a 9.1.0.x cumulative patch) can be applied in any order. Teams burn hours rebuilding a strict sequence for a patch that did not need one. Read the release notes and check whether you are doing a version jump or a patch before you build the runbook.

Inside NSX: the sequence the Upgrade Coordinator enforces

Whether SDDC Manager drives it or you run it manually, the internal NSX upgrade order is the same and it is not negotiable. The Upgrade Coordinator upgrades itself first, then the Edge transport nodes, then the host transport nodes, and the NSX Manager cluster last. There is a reason the Managers go last: you want the newest control and management plane validating against an already-upgraded data plane, not the reverse. If you have ever wondered why the UI nags you to finish Edges before it will let you move on, that is why.

When to let SDDC Manager drive, and when not to

My default is to let SDDC Manager orchestrate. You get interoperability validation against vCenter and ESX for free, the bundle is the validated one, and the whole thing is auditable in one place. I reach for the standalone Upgrade Coordinator path in NSX Manager only in narrow cases: a support engagement where Broadcom asks for it, a brownfield NSX that is not yet fully under VCF lifecycle, or a targeted NSX-only fix where dragging the entire fleet into a maintenance window is the wrong trade. For most production VCF estates, hand-driving the Upgrade Coordinator when SDDC Manager would do it for you just adds a place to make a mistake.

The NSX-internal upgrade sequence Fixed order. The Managers upgrade last, against an already-new data plane. 1 Upgrade Coordinator upgrades itself first 2 Edge nodes serial, per Edge cluster 3 Host nodes rides the ESX image 4 NSX Mgr cluster last, all three nodes Gate at each stage The Coordinator will not advance to hosts until Edges report healthy, and will not advance to the Managers until hosts are done. Forcing past a failed stage is how a clean upgrade becomes an outage.
Diagram 2: The Upgrade Coordinator gates each stage. Edges, then hosts, then Managers, never the other way around.

Pre-upgrade: what I check before I touch anything

NSX 9.0.1 sharpened the pre-upgrade checks and made them faster and more focused, which is welcome. But the Upgrade Coordinator’s automated checks are necessary, not sufficient. The first thing I check is not a checkbox in the UI, it is whether the team has a current, restorable NSX backup with a passphrase someone can actually produce. I covered why that passphrase is the quiet failure point in NSX 9 backup and restore. If you cannot restore, you do not have a rollback. You have a hope.

From the NSX CLI I confirm the management cluster is genuinely stable before anything starts. A cluster showing one node DEGRADED will fail the upgrade at an awkward moment, so I want it green first.

# On each NSX Manager node, confirm cluster health is STABLE
get cluster status

# Check the upgrade state once a bundle is staged
get upgrade status

# Confirm transport-node connectivity before you roll hosts
get transport-nodes status

# From an ESX host: confirm the running NSX VIBs (host portion rides ESX in VCF 9)
esxcli software vib list | grep -i nsx

Because NSX 9 exposes only the Policy API (the Manager API for logical networking and security was removed), any scripted pre-flight you wrote against the old Manager API needs to target the Policy tree. A quick read against the Policy API confirms the Coordinator is reachable and tells you the current state programmatically, which is handy if you are gating a change-management pipeline on it.

# Policy API only in NSX 9. Read the upgrade-coordinator status.
curl -k -u 'admin' 
  https://NSX-MGR/api/v1/upgrade/status-summary

# Validate intent realization is clean before you start
curl -k -u 'admin' 
  https://NSX-MGR/policy/api/v1/infra/realized-state/status
Pre-upgrade checkWhy it bites if skippedWhere
Restorable backup + known passphraseNo real rollback without itSFTP target
Cluster status STABLEDegraded node fails the run mid-flightget cluster status
Clean realized-state (no errors)Unrealized intent gets carried into the new versionPolicy API
DRS set to fully automatedMaintenance-mode host upgrades stall waiting on manual vMotionvCenter
Interop validated for the target BOMWrong order on a major release fails validationSDDC Manager

Upgrading the hosts: maintenance mode vs in-place

This is the decision that shapes your maintenance window more than any other, and the one people make on autopilot. Host transport nodes upgrade in one of two modes, and they behave very differently.

Maintenance mode

The Coordinator asks vCenter to put the host into maintenance mode. With DRS fully automated, vSphere evacuates the running VMs to other hosts, the host upgrades clean, then it exits maintenance mode and rebalances. This is the conservative path. Nothing runs on a host while its NSX and ESX bits change, so if something goes wrong the blast radius is an empty host. The cost is time and capacity: you need somewhere for those VMs to go, and a serial walk through a large cluster is not quick.

In-place

In-place upgrades the host without entering maintenance mode and without powering off tenant VMs. It is faster and it does not demand spare capacity to evacuate onto, which is exactly why it is tempting on dense clusters. The trade is that the workloads stay live on a host whose data-plane code is changing, and your options if a host wedges mid-upgrade are narrower. I use in-place when capacity is genuinely tight or the cluster is too large to drain serially in the window, and I accept that I am trading rollback comfort for speed. On anything carrying sensitive production with headroom to evacuate, I default to maintenance mode.

Host upgrade: pick the mode on purpose Capacity and risk tolerance decide the path, not habit. Spare capacity to evacuate? Maintenance mode DRS evacuates VMs, host drains, upgrades empty, rejoins. Safer, slower, needs headroom. In-place No evacuation, no power-off, VMs stay live during upgrade. Faster, denser, narrower rollback. Yes No / tight
Diagram 3: The host-mode decision. If you have the headroom and the time, evacuate. If you do not, go in-place with eyes open.
DimensionMaintenance modeIn-place
VM disruptionvMotion to other hostsNone, VMs stay running
Spare capacity neededYes, room to evacuateNo
Speed per hostSlower (drain + rebalance)Faster
Rollback comfort if a host wedgesHigher, host is emptyLower, workloads are live
My defaultProduction with headroomDense clusters, tight windows

Upgrading Edges without dropping north-south

Edge transport nodes carry your north-south traffic and your stateful services, so this is the stage where a sloppy plan shows up as a real outage. The Coordinator upgrades Edge nodes serially within an Edge cluster. For an auto-deployed Edge the workflow is mechanical: the bundle is prepared, the node is placed into maintenance mode, the new OS is downloaded, installed, and then the OS switch happens. One node at a time, the rest of the cluster keeps forwarding.

The trap is capacity, not mechanics. While one Edge is in maintenance mode, its share of the forwarding capacity is gone. If your Tier-0 runs active-active ECMP across two Edge nodes, taking one down for upgrade halves your north-south throughput for the duration of that node’s upgrade. On a quiet Sunday that is fine. During a busy window it is a brownout. This is the same Edge-sizing math from host and transport-node prep, just felt in reverse: you sized for N+1 so that losing one Edge to an upgrade does not hurt. If you sized for exactly two and called it redundant, the upgrade is when you find out it was not.

Serial Edge upgrade and the capacity dip One node drains at a time. The cluster keeps forwarding on whatever is left. Edge-1 UPGRADING maintenance mode 0% of traffic Edge-2 FORWARDING active Edge-3 FORWARDING active The math 3-node ECMP: lose ~33% 2-node ECMP: lose ~50% for the duration of each node’s upgrade window.
Diagram 3b: Serial Edge upgrade. The fewer nodes in the cluster, the bigger the per-node capacity hit while one drains.

Worked example

Say your north-south peak is 18 Gbps and each large-form-factor Edge VM gives you roughly 12 Gbps of usable forwarding. Two Edges in active-active ECMP look fine at idle: 24 Gbps of headroom for an 18 Gbps peak. Now upgrade one. You are down to a single Edge at 12 Gbps against an 18 Gbps peak, a 6 Gbps shortfall, for the length of that node’s OS switch. A three-node cluster (36 Gbps total) drops to 24 Gbps during a node upgrade, still above peak. The lesson is blunt: if upgrades cannot run inside a true off-peak window, your Edge cluster needs N+1, not N. Size the cluster so the upgrade is boring.


Failure modes that actually bite

Most NSX upgrade incidents are not exotic. They cluster around a handful of repeatable causes, and almost all of them are decided before the upgrade starts. Here are the ones I see most often on engagements, with the cause and the fix.

SymptomCauseFix
Host upgrade stalls waiting to enter maintenance modeDRS not fully automated, or no capacity to evacuateSet DRS to fully automated, free capacity, or switch that cluster to in-place
Edge upgrade hangs at OS switchEdge resource contention or a node already unhealthy pre-upgradeConfirm Edge health and resource reservations before starting, not after
Pre-checks flag unrealized intentA config error sitting in the Policy tree before you ever upgradedFix realized-state errors first, do not carry them across versions
Validation fails on a major-version runComponent order wrong, or target BOM not interoperableFollow the SDDC Manager order; let it validate the BOM for you
North-south brownout during Edge stageEdge cluster sized N, not N+1Add an Edge, or move the window to true off-peak
My take: the single best predictor of a clean NSX upgrade is not the version you are going to. It is whether the realized state was clean before you started. An upgrade does not fix a broken config, it carries it forward and sometimes surfaces it as a failure at the worst moment. Spend the hour clearing realized-state errors first. [AUTHOR: add anecdote]
Disclaimer: NSX upgrades are production-change procedures. Validate your target BOM against the Broadcom interoperability matrix, confirm a restorable NSX backup with a known passphrase, run the pre-upgrade checks, and rehearse in a non-production environment before you touch production. Patch levels move; re-verify the exact current NSX 9.x version in your VCF release notes before scheduling.

What I’d Do

Run NSX upgrades through SDDC Manager unless you have a specific reason not to, and treat the standalone Upgrade Coordinator as the exception. Before the window, get the realized state clean, confirm a restorable backup, and decide maintenance-mode versus in-place per cluster rather than as a blanket setting. Size your Edge cluster so losing one node to a serial upgrade does not show up in a dashboard. The integrated lifecycle in VCF 9 genuinely makes NSX upgrades less hand-driven than the NSX-T 3.x days. It does not make the data plane care any less about losing an Edge or a host. Plan for the data plane and the orchestration takes care of itself. If you want the wider VCF-level view of how this fits the full-stack upgrade, my VCF 9.1 upgrade and patching runbook walks the whole sequence. What is the first thing you will check before your next NSX window?

References

NSX Series · Part 20 of 30
« Previous: Part 19  |  NSX Complete Guide  |  Next: Part 21 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

NSX 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading