- VCF Automation is protected by file-based backups configured and run through VCF Operations Fleet Management, written to a fleet-level SFTP target.
- Back up VCF Automation and the Identity Broker together. A restored Automation instance nobody can authenticate to is not a recovery.
- There are three different recoveries, not one: roll back a config mistake, restore a corrupted component from backup, or fail the whole instance over for site loss. Each uses a different tool.
- A pre-upgrade snapshot is a rollback for the upgrade window, not a backup and not DR. Take it, then delete it after, because lingering snapshots degrade performance.
- Upgrade order is fixed: the lifecycle layer first, then VCF Automation and the other management components. In VCF 9.1 the standalone Fleet Management appliance becomes the Fleet Lifecycle and SDDC Lifecycle services.
Who this is for: Cloud and platform admins who own the VCF Automation instance and have to answer for its recovery point, recovery time, and upgrade safety.
Prerequisites: A VCF Automation 9.x instance on the management domain, VCF Operations Fleet Management access with an Administrator role, and a reachable SFTP backup target. The deployment context from the Fleet Management Part helps.
An untested backup is a hope, not a recovery plan. I have lost count of the environments where a backup schedule was green for a year and the first real restore attempt failed because the SFTP target had filled up, or because someone backed up VCF Automation and never the Identity Broker, and ended up with a perfectly restored instance that no one could log into. Backup, disaster recovery, and upgrade are the three operations where VCF Automation either proves it is production-grade or quietly is not, and they are the easiest three to defer until the day you need them.
This Part is the operations runbook for the instance itself, not the workloads it provisions. I am writing against VCF 9.1, the current release, and I will flag the one architecture change from 9.0 that rewrites every upgrade runbook. VCF Automation is deployed and lifecycle-managed through Fleet Management, and that same layer owns its backup, restore and upgrade.
What you are actually protecting
VCF Automation is not a single appliance. It is a clustered set of virtual machines on the management domain, and it depends on the VCF Identity Broker for authentication. Protect the cluster and forget the Identity Broker and you have protected a locked building while losing the keys. The configuration inside it, your organizations, projects, templates, catalog and policies, is the part with real business value, and it is the slowest to rebuild by hand. Backups capture all of it; the question is whether you can put it back.
Backup: file-based, through Fleet Management
You configure a backup schedule per component in Fleet Management, covering VCF Operations Fleet Management itself, VCF Automation, and the Identity Broker. On-demand backups are available alongside the schedule for the moment before a risky change. Backups are file-based and land on a fleet-level SFTP server you configure once, with a predictable path layout so you can find a specific backup later. Restore prerequisites are easy to overlook: the Automation cluster VMs must be running, the SFTP server must be reachable, and you need at least one recent backup that actually completed.
# Backup path layout on the SFTP target
vcf/backups/<cluster-name>/<version>/<component-name>/<timestamp>/
vcf/backups/<cluster-name>/<version>/<component-name>/<timestamp>/<backup-file.tgz>
# Restore prerequisites (all must be true)
# 1. VCF Automation cluster VMs are running
# 2. The fleet-level SFTP server is reachable
# 3. At least one recent, completed backup exists
# Where: VCF Operations -> Fleet Management -> Lifecycle
# -> VCF Management -> Components -> VCF Automation
# -> Backup & Restore
Back up VCF Automation on its own and you can restore an instance that holds all your orgs and catalog and refuses every login, because authentication lives in the Identity Broker. Schedule both components, keep their backups time-aligned, and restore them as a set. The first time you discover the dependency should not be during an outage.
Three recoveries, three tools
The word recovery hides three very different events, and conflating them is how teams reach for the wrong tool under pressure. A bad config change is not a disaster; a corrupted component is not a site loss. Match the event to the tool before you need to.
| Event | Tool | What it brings back |
|---|---|---|
| Bad config change | File-based restore, or redeploy config as code | Orgs, projects, catalog, policies |
| Corrupted / failed component | Fleet Management component restore | The component and its data |
| Full instance loss | Instance recovery from file-based backup | The whole Automation instance |
| Site loss | Site protection / DR failover | The management apps at the recovery site |
For the most common event, a change you regret, the fastest recovery is often not a restore at all. If you manage organizations, projects and catalog with the vmware/vcfa Terraform provider, reapplying a known-good state can be quicker and cleaner than a full file-based restore, and it gives you a versioned history of what the config should be. Treat config-as-code as recovery insurance that complements backups, not a replacement for them, because it does not capture component-internal data.
Disaster recovery for the management apps
Site loss is a different problem from a corrupted component, and it has its own answer. VCF Automation is one of the SDDC management applications covered by site protection and disaster recovery for VCF, which orchestrates a planned migration or failover of the management plane to a recovery site. This is not the same as a file-based restore into the same instance; it is moving the management applications, VCF Automation included, to surviving infrastructure. If your organization has a real site-loss requirement, the recovery point and recovery time for VCF Automation are part of that DR design, not an afterthought you bolt on with backups alone.
Upgrade: order, snapshots and the 9.1 shift
Upgrades are orchestrated through Fleet Management lifecycle, and the order is not negotiable. You upgrade the lifecycle layer first, because it coordinates everything else, and only then the other management components: VCF Operations, VCF Automation, the Identity Broker, and the Operations for Logs and Networks pieces, in no strict order among themselves. Upgrade VCF Automation before the lifecycle layer that drives it and you are working against the tool that is supposed to manage the change.
Snapshot is a rollback, not a backup
Take a snapshot of the appliance before you patch, so you have a rollback point if the upgrade misbehaves. Then delete it once the upgrade is confirmed, because lingering snapshots degrade performance and quietly become their own incident. A snapshot is a short-lived safety net for the upgrade window. It is not a backup, it is not DR, and it does not belong in your retention story.
My take on the 9.1 change: VCF 9.1 replaces the standalone Fleet Management appliance from 9.0 with two services, Fleet Lifecycle and SDDC Lifecycle, running natively in VCF Management Services. If your runbooks assume a separate appliance to snapshot and upgrade first, rewrite them. The principle holds, the lifecycle layer still goes first, but the thing you are upgrading is now a set of services, not an OVA.
Suppose the business asks for a 24-hour recovery point on the platform config. A daily backup of VCF Automation and the Identity Broker, time-aligned, meets it on paper. To make it real: keep at least 7 daily backups so a corruption noticed on day three is still recoverable, verify the SFTP target has headroom for that retention, and run a restore drill into an isolated environment once a quarter.
The drill is the part that counts. A schedule that has never been restored is a 24-hour recovery point in theory and an unknown in practice. One tested restore per quarter converts hope into a number you can put in front of an auditor.
Disclaimer: Restore and upgrade are production-changing operations on the management plane. Run restore drills in an isolated environment, never test a restore over a healthy production instance, take and later remove a pre-upgrade snapshot, and follow the version-specific upgrade sequence for your exact release before you begin.
The SFTP target nobody watches
The backup target is the quiet failure point. It is configured once, it works, and then nobody looks at it again until a restore needs a backup that was never written. The two ways it betrays you are predictable. It fills up, and new backups silently fail or rotate out the older ones you were counting on for a longer retention window. Or its credentials or host key change during some unrelated maintenance, and the scheduled job starts failing while the dashboard still shows the last good run from before the change.
Treat the target as a monitored dependency, not a set-and-forget setting. Alert on a missing or failed backup the same day it happens, not at quarter end. Size the storage for your full retention with headroom, because a 7-day retention that only fits 4 days of backups is a 4-day retention you do not know about. And confirm the target lives off the appliance and ideally off the same failure domain, since a backup on the thing you are trying to recover is not a backup at all.
Write the recovery runbook before you need it
Backups are a capability; recovery is a procedure, and the procedure is what fails under pressure. Write the runbook while everything is calm, and make it specific to your environment, not a paraphrase of the docs. It should name the exact backup path layout for your instance, the order to restore VCF Automation and the Identity Broker, who has the Administrator role in Fleet Management to run it, and how you verify success, which is a real user signing in and seeing their catalog, not a green task in the console.
The quarterly restore drill is what keeps that runbook honest. Each drill surfaces the small things that turn a 30-minute recovery into a half-day outage: an expired credential, a missing prerequisite, a step that assumed a version you have since upgraded past. Fix them on the drill, update the runbook, and the next real incident becomes a procedure you have already rehearsed rather than a problem you are solving for the first time with people watching.
The Bottom Line
Schedule file-based backups of VCF Automation and the Identity Broker together, send them to an SFTP target off the appliance, and prove the whole thing with a restore drill every quarter, because the only backup that counts is the one you have restored. Keep config-as-code alongside it so the common case, a change you regret, has a fast and versioned path back. For upgrades, snapshot, upgrade the lifecycle layer first, then the components, and delete the snapshot afterward. I would not rely on snapshots as backups, I would not back up Automation without the Identity Broker, and I would not write a 9.x upgrade runbook without accounting for the 9.1 move from a Fleet Management appliance to the Fleet Lifecycle and SDDC Lifecycle services. Validate the SFTP target capacity and one real restore before you call any of this done.
Book one restore drill into an isolated environment this quarter. If your orgs, catalog and logins all come back and people can sign in, your recovery plan is real. If anything is missing, you found it on your schedule instead of during an outage. The final Part wraps the series with troubleshooting, hardening and a verdict.
References
- Broadcom TechDocs — Backup and restore of VMware Cloud Foundation
- Broadcom TechDocs — Restore VCF Automation from a file-based backup
- Broadcom TechDocs — Planned recovery of VCF Automation (site protection / DR)
- Broadcom TechDocs — Upgrade the VCF management components in a fleet



