Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

NSX 9 Backup and Restore: Config Protection, SFTP Targets and the Passphrase Trap (NSX Series, Part 19)

How NSX 9 backup and restore actually works in VCF 9: the three backup types, SFTP target design, the exact-version restore requirement, and the passphrase that quietly decides whether your backup is worth anything.

NSX Series · Part 19 of 30

TL;DR · Key Takeaways

  • NSX 9 backs up over SFTP only. There is no FTP, no NFS, no object store. Plan one hardened SFTP target and protect it like a tier-0 asset.
  • A backup is three things, not one: cluster config, per-node config, and an inventory backup that re-syncs the fabric on restore. You need all three current.
  • Restore is version and build exact. You deploy a fresh Manager OVA that matches the build in the backup filename, with the same IP or FQDN, then restore onto it.
  • The passphrase is the whole game. Lose it and every backup you hold is unusable. It belongs in your password vault, never in the runbook stored on the cluster you are trying to recover.
  • In VCF 9, NSX restore is still driven from the NSX UI, not SDDC Manager. VCF discovers the restored appliance afterward and you finish with a Sync Inventory.
Who this is for: NSX and VCF administrators, network and security architects, and anyone who owns the recovery runbook for a production NSX 9 fabric.  Prerequisites: A running NSX 9.0.x Manager cluster (standalone or inside VCF 9), an SFTP server you control, and familiarity with deploying the NSX Manager OVA.

The worst NSX outage I have walked into was not a routing loop or a bad firewall rule. It was a corrupted single-node Manager and a team that had configured backups, checked the green tick, and never once written down the passphrase. The backups existed. They were recent. They were also completely useless, because nobody could decrypt them. That is the quiet failure mode of NSX 9 backup and restore: the part that bites you is almost never the backup job itself.

NSX 9 in VCF 9 changed a few things here worth getting straight before you build the runbook. The Manager API is gone, so backup content is Policy intent plus inventory state. Backups still go over SFTP and nothing else. And restore is fussier than people expect: it wants the same software build, the same address, and the same passphrase, or it does not start. This post is the operational design I give clients for getting all three right, plus the config-protection discipline that makes a restore something you reach for on purpose rather than in a panic.

What an NSX 9 backup actually contains

People say “the NSX backup” as if it is one file. It is not. NSX 9 takes three distinct backup types on every run, and understanding the split is what lets you reason about what a restore will and will not bring back.

Backup typeWhat it holdsWhy it matters on restore
Cluster backupThe desired-state config of the whole Manager cluster: Policy intent, segments, gateways, DFW, profiles, groups.This is the configuration you get back. Lose it and you rebuild the fabric by hand.
Node backupPer-appliance settings for each Manager node: certificates, appliance-local config, identity.Lets a replacement node come up as the original rather than a blank appliance.
Inventory backupA snapshot of the realized fabric: transport nodes, host TNs, Edge nodes and their state.On restore it re-syncs Managers against what the hosts and Edges actually report, fixing drift between intent and reality.

The inventory backup is the one people forget exists, and it is the one doing the clever work. When you restore, the cluster config tells NSX what you wanted. The inventory backup, plus the live heartbeat from the hosts and Edges, tells NSX what is actually out there. Restore reconciles the two. That is why a restored Manager can rejoin a fabric whose data plane never stopped forwarding: the hosts kept switching and firewalling on their last-known rules the whole time the Manager was down.

Where NSX 9 backups go Three backup types, one SFTP target, encrypted with your passphrase NSX Manager cluster Node 1 Node 2 Node 3 Cluster backup · Node backup Inventory backup (hosts + Edges) VIP or each node runs the job SFTP/22 Hardened SFTP target Outside the NSX/VCF blast radius Absolute directory path, not root Files named with NSX version + build Encrypted by passphrase (off-box) Retention + monitoring on the share
Diagram: every backup type lands on one SFTP target, encrypted by a passphrase that must live somewhere the failure cannot reach.

Designing the SFTP target

NSX 9 supports exactly one backup transport: SFTP over SSH. That constraint is the whole design conversation. There is no S3 bucket option, no NFS export, no built-in second copy. Whatever you point NSX at is your single recovery dependency, so it has to survive whatever takes out NSX.

Put it outside the blast radius

The first thing I check on an engagement is where the SFTP server actually lives. If your NSX backup target is a VM running on the same vSphere cluster, behind the same NSX overlay, that NSX is protecting, you have built a circular dependency. The day you need the backup is the day the platform hosting it is down. Park the SFTP target on infrastructure that does not depend on the fabric you are backing up: a separate management cluster, a physical box, or a appliance on a routed VLAN that does not traverse NSX overlay. The directory path must already exist and cannot be the root directory, and if the server is Windows you specify the path with forward slashes even though Windows uses backslashes locally.

Authentication and the fingerprint

You authenticate with either an SSH private key or a username and password, and NSX will ask you to confirm the server’s SSH fingerprint when you configure the target. Use key-based auth where you can. The fingerprint check is not a formality: if it ever changes unexpectedly, your backup job fails closed, which is the correct behavior but also a silent way to stop having backups if nobody is watching the alarms. This is exactly why NSX 9 monitoring and alarms should include the backup-failure alarm in the set you actually route to a human.

In practice: the most common backup gap I find is not a missing job, it is a job that silently stopped weeks ago after an SFTP credential rotation or a host key change, with nobody subscribed to the failure alarm. A backup you are not monitoring is a backup you do not have.

Scheduling and the change-window habit

NSX 9 can run automatic backups on a schedule (weekly or daily at a set time) or on an interval. Inside VCF, SDDC Manager pre-configures the NSX Local Managers to back up roughly hourly by default, which is a sensible baseline RPO for inventory. But interval backups are not the discipline that saves you during change work. The habit that matters is taking a manual backup immediately before any meaningful change: a DFW publish, a Tier-0 edit, a transport zone change. Restore is point-in-time and all-or-nothing, so a backup taken sixty seconds before the change is the clean rollback point you will wish you had.

Config protection is a habit, not a schedule Take a manual backup before the change, not just on the hour 09:00 backup 10:00 backup 10:42 manual backupbefore DFW publish 10:45 bad change Schedule-only rollback loses 45 min of work Manual backup loses 3 min
Diagram: the hourly schedule sets your floor, but a manual pre-change backup is what gives you a clean rollback point.

The restore requirements that actually stop you

Restore fails before it starts far more often than it fails midway. Almost always it is one of three preconditions: the wrong build, the wrong address, or the missing passphrase. Get these three lined up and the actual restore is mostly waiting.

RequirementThe ruleWhat goes wrong
Exact buildDeploy the Manager OVA whose version and build match the backup. The filename carries both.A close-but-newer OVA refuses the restore. You need the archived OVA, not the latest one.
Same identityIf the backup lists an IP, redeploy with that IP and do not publish FQDN. If it lists an FQDN, set and publish that lowercase FQDN.An address mismatch makes the restore workflow refuse to proceed.
PassphraseThe passphrase set at backup time is required to decrypt and restore.Forget it and no backup can ever be restored. There is no recovery path.
Cluster powered offAny surviving cluster nodes must be powered off before you start.A live old node fighting a restored node corrupts cluster state.

My take

Treat the matching OVA as part of the backup, not a thing you fetch later. When you upgrade NSX, archive the exact installer build that matches your current backups in the same protected location as the backups themselves. The day you need to restore, Broadcom’s portal having moved the old build behind a newer release is not a problem you want to discover under pressure.

The restore workflow, end to end

Here is the sequence I follow for a Manager cluster recovery in VCF 9. The data plane keeps forwarding throughout, because hosts and Edges run on their last realized config while the Manager is gone. You are recovering the management and control plane, not the traffic.

NSX 9 restore pipeline Driven from the NSX UI; VCF discovers the result afterward 1Verify backup + passphraseconfirm build in filename 2Power off old nodesno split-brain 3Deploy matching OVAsame IP/FQDN, DNS, NTP 4Point at SFTP, restorepick timestamp, start 5Re-add other nodesrebuild 3-node cluster 6Sync Inventory in VCFverify CM + Edges green
Diagram: restore one node from backup, then re-add the rest. NSX rebuilds the cluster around the restored config.

Step by step

First, verify the backup exists on the SFTP target and that you hold the passphrase. Confirm the version and build embedded in the backup filename, because that is the OVA you need. Power off any Manager nodes that are still alive. Deploy a single fresh Manager appliance from the OVA that matches that build, using identical IP, FQDN, DNS, NTP, gateway and admin credentials to the original. Once it boots, log in, go to System and the Backup and Restore area under Lifecycle Management, configure the SFTP repository (server, directory path, port, protocol, credentials), let NSX enumerate the available backups, select the right timestamp, and start the restore.

The Manager restarts during the restore and the progress UI is, frankly, ugly. You will pass through stages like “Restoring Database” and “Updating NSX Manager fabric module,” and you may hit a transient page showing a raw policy.common.error.message while the appliance is still coming up. That ugly intermediate error is part of the normal flow on a single-node restore; refresh and wait rather than aborting. When the node is healthy, the workflow restores one node first and then prompts you to add the remaining nodes to reform the three-node cluster.

# Confirm Manager cluster health before and after restore (NSX CLI)
get cluster status
get cluster status verbose

# Check the management/control services on the node
get service install-upgrade
get managers

# Verify the node sees its compute manager (vCenter) after restore
get cluster config

# Policy API: list the active backup config (Policy API only in NSX 9)
curl -k -u admin -X GET 
  https://nsx-mgr.example.local/policy/api/v1/cluster/backups/config

Why VCF does not drive this

A fair question in a VCF 9 environment: why am I doing this from the NSX UI instead of SDDC Manager? Because in VCF 9.0, NSX Manager restore is still performed directly from NSX. VCF configures the backup target and discovers the restored appliance after the fact, but it does not orchestrate the NSX restore itself. After the restore completes, you go to VCF Operations, confirm the VCF integration shows the domain green and collecting, and run a Sync Inventory on the instance so VCF and NSX agree on reality. That last sync is the step people skip, and then wonder why VCF Operations shows stale NSX state for a day. The deployment side of this, including how the cluster is built in the first place, is covered in NSX 9 Manager deployment and cluster bring-up.

Worked example

Single-node management-domain Manager, hourly automatic backups, 09:00 and 10:00 jobs succeeded. A storage change corrupts the appliance at 10:50. You hold the passphrase and the matching 9.0.x OVA build. Recovery is: power off the dead node, deploy the matching OVA with the same IP at roughly 11:00, restore the 10:00 backup, ride out the database and fabric-sync stages, and the cluster is healthy again inside the hour. Your data loss is the 50 minutes of config changes after 10:00 (here, none). Now flip one variable: no passphrase. Same hardware, same backups, same OVA, and the recovery time is effectively infinite, because the backup will not decrypt. That single secret is the difference between a one-hour restore and a full rebuild.

Federation, and the things backup does not cover

If you run NSX Federation, the Global Manager has its own backup, separate from each Local Manager, and you protect both. A Global Manager backup carries the stretched, cross-site intent; the Local Manager backups carry the site-local realized config. Restoring one without thinking about the other leaves you with a coherent half of the picture. In a multi-site design, write the restore order into the runbook rather than improvising it during an outage.

It is also worth being clear about what NSX backup is not. It is not object-level. You cannot restore a single deleted segment or one firewall section from a backup while leaving everything else untouched; restore rolls the entire Manager state back to a point in time. For surgical recovery of a single object, your real tool is change discipline and the broader NSX operational practices around Policy intent, ideally version-controlled if you drive NSX through the Policy API and Terraform. NSX backup is a platform-recovery tool, not an undo button for a single bad rule.

In practice: a backup you have never restored is a hypothesis, not a recovery plan. Schedule a restore rehearsal into a lab or an isolated test instance at least once a release cycle. The first time you run the workflow should not be the day production is down.

Disclaimer: restore is a production-impacting operation. Validate the VCF 9 BOM and the exact NSX build before you begin, archive the matching OVA, confirm the passphrase is retrievable, back up current state where possible, and rehearse the workflow in a lab before you ever need it in anger. Treat any restore as a change with full pre-checks, not an emergency improvisation.

What I’d Do

Stand up one hardened SFTP target that does not depend on the fabric it backs up. Turn on automatic backups, route the backup-failure alarm to a human, and add a pre-change manual backup to your change template. Store the passphrase in your password vault and the matching OVA build alongside your backups. Then, once a release cycle, actually restore into a lab and time it. Do those five things and NSX recovery stops being the thing you fear and becomes a routine you have already proven. Skip the passphrase custody and the version archive, and everything else you did was theatre.

Next in the series: Part 20 covers NSX 9 upgrades and lifecycle, where backup stops being a recovery tool and becomes a mandatory pre-flight step folded into the VCF integrated upgrade.


References

VMware NSX 9.0 Administration Guide: Backing Up and Restoring the NSX Manager
NSX 9.0 Administration Guide: Configure Backups
VCF 9 Fleet Management: Configure SFTP Backup Target in VCF Operations

NSX Series · Part 19 of 30
« Previous: Part 18  |  NSX Complete Guide  |  Next: Part 20 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

NSX 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading