Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, , ,

Backup, DR and Upgrade of the VCF Automation Instance (VCF Automation 9 Series, Part 29)

An untested backup is a hope, not a recovery plan. Here is how to protect, recover and upgrade the VCF Automation instance in VCF 9: file-based backups through Fleet Management, the Identity Broker trap, the three recovery scenarios, and the upgrade order that actually matters.

VCF Automation 9 Series · Part 29 of 41
TL;DR · Key Takeaways
  • VCF Automation is protected by file-based backups configured and run through VCF Operations Fleet Management, written to a fleet-level SFTP target.
  • Back up VCF Automation and the Identity Broker together. A restored Automation instance nobody can authenticate to is not a recovery.
  • There are three different recoveries, not one: roll back a config mistake, restore a corrupted component from backup, or fail the whole instance over for site loss. Each uses a different tool.
  • A pre-upgrade snapshot is a rollback for the upgrade window, not a backup and not DR. Take it, then delete it after, because lingering snapshots degrade performance.
  • Upgrade order is fixed: the lifecycle layer first, then VCF Automation and the other management components. In VCF 9.1 the standalone Fleet Management appliance becomes the Fleet Lifecycle and SDDC Lifecycle services.

Who this is for: Cloud and platform admins who own the VCF Automation instance and have to answer for its recovery point, recovery time, and upgrade safety.

Prerequisites: A VCF Automation 9.x instance on the management domain, VCF Operations Fleet Management access with an Administrator role, and a reachable SFTP backup target. The deployment context from the Fleet Management Part helps.

An untested backup is a hope, not a recovery plan. I have lost count of the environments where a backup schedule was green for a year and the first real restore attempt failed because the SFTP target had filled up, or because someone backed up VCF Automation and never the Identity Broker, and ended up with a perfectly restored instance that no one could log into. Backup, disaster recovery, and upgrade are the three operations where VCF Automation either proves it is production-grade or quietly is not, and they are the easiest three to defer until the day you need them.

This Part is the operations runbook for the instance itself, not the workloads it provisions. I am writing against VCF 9.1, the current release, and I will flag the one architecture change from 9.0 that rewrites every upgrade runbook. VCF Automation is deployed and lifecycle-managed through Fleet Management, and that same layer owns its backup, restore and upgrade.

What you are actually protecting

VCF Automation is not a single appliance. It is a clustered set of virtual machines on the management domain, and it depends on the VCF Identity Broker for authentication. Protect the cluster and forget the Identity Broker and you have protected a locked building while losing the keys. The configuration inside it, your organizations, projects, templates, catalog and policies, is the part with real business value, and it is the slowest to rebuild by hand. Backups capture all of it; the question is whether you can put it back.

What a backup has to capture Fleet Management writes file-based backups to an SFTP target VCF Automation cluster Identity Broker Config: orgs, catalog, policies Fleet Managementschedules + runs SFTP backup targetoff the appliance
Three things to protect, one layer that backs them up, one target that must be off the appliance and tested.

Backup: file-based, through Fleet Management

You configure a backup schedule per component in Fleet Management, covering VCF Operations Fleet Management itself, VCF Automation, and the Identity Broker. On-demand backups are available alongside the schedule for the moment before a risky change. Backups are file-based and land on a fleet-level SFTP server you configure once, with a predictable path layout so you can find a specific backup later. Restore prerequisites are easy to overlook: the Automation cluster VMs must be running, the SFTP server must be reachable, and you need at least one recent backup that actually completed.

# Backup path layout on the SFTP target
vcf/backups/<cluster-name>/<version>/<component-name>/<timestamp>/
vcf/backups/<cluster-name>/<version>/<component-name>/<timestamp>/<backup-file.tgz>

# Restore prerequisites (all must be true)
#   1. VCF Automation cluster VMs are running
#   2. The fleet-level SFTP server is reachable
#   3. At least one recent, completed backup exists

# Where: VCF Operations -> Fleet Management -> Lifecycle
#        -> VCF Management -> Components -> VCF Automation
#        -> Backup & Restore
Gotcha · The Identity Broker trap

Back up VCF Automation on its own and you can restore an instance that holds all your orgs and catalog and refuses every login, because authentication lives in the Identity Broker. Schedule both components, keep their backups time-aligned, and restore them as a set. The first time you discover the dependency should not be during an outage.

Three recoveries, three tools

The word recovery hides three very different events, and conflating them is how teams reach for the wrong tool under pressure. A bad config change is not a disaster; a corrupted component is not a site loss. Match the event to the tool before you need to.

EventToolWhat it brings back
Bad config changeFile-based restore, or redeploy config as codeOrgs, projects, catalog, policies
Corrupted / failed componentFleet Management component restoreThe component and its data
Full instance lossInstance recovery from file-based backupThe whole Automation instance
Site lossSite protection / DR failoverThe management apps at the recovery site
Which recovery for which event Match the blast radius to the tool What went wrong?scope it first Bad changerestore backup orredeploy config-as-code Component / instance lossfile-based restore viaFleet Management Site lossfail over withsite protection / DR
Config-as-code from the Terraform Part is a fast path back for the most common event: a change you regret.

For the most common event, a change you regret, the fastest recovery is often not a restore at all. If you manage organizations, projects and catalog with the vmware/vcfa Terraform provider, reapplying a known-good state can be quicker and cleaner than a full file-based restore, and it gives you a versioned history of what the config should be. Treat config-as-code as recovery insurance that complements backups, not a replacement for them, because it does not capture component-internal data.

Disaster recovery for the management apps

Site loss is a different problem from a corrupted component, and it has its own answer. VCF Automation is one of the SDDC management applications covered by site protection and disaster recovery for VCF, which orchestrates a planned migration or failover of the management plane to a recovery site. This is not the same as a file-based restore into the same instance; it is moving the management applications, VCF Automation included, to surviving infrastructure. If your organization has a real site-loss requirement, the recovery point and recovery time for VCF Automation are part of that DR design, not an afterthought you bolt on with backups alone.

Upgrade: order, snapshots and the 9.1 shift

Upgrades are orchestrated through Fleet Management lifecycle, and the order is not negotiable. You upgrade the lifecycle layer first, because it coordinates everything else, and only then the other management components: VCF Operations, VCF Automation, the Identity Broker, and the Operations for Logs and Networks pieces, in no strict order among themselves. Upgrade VCF Automation before the lifecycle layer that drives it and you are working against the tool that is supposed to manage the change.

Snapshot is a rollback, not a backup

Take a snapshot of the appliance before you patch, so you have a rollback point if the upgrade misbehaves. Then delete it once the upgrade is confirmed, because lingering snapshots degrade performance and quietly become their own incident. A snapshot is a short-lived safety net for the upgrade window. It is not a backup, it is not DR, and it does not belong in your retention story.

The upgrade order that matters Lifecycle layer first, then the rest; snapshot around it Snapshotrollback point 1Lifecycle layerFleet / SDDC 2Operations, Automation,Identity Broker, Logs Delete snapshotafter confirming
Snapshot, upgrade the lifecycle layer, then the components, then remove the snapshot. Do not leave it behind.

My take on the 9.1 change: VCF 9.1 replaces the standalone Fleet Management appliance from 9.0 with two services, Fleet Lifecycle and SDDC Lifecycle, running natively in VCF Management Services. If your runbooks assume a separate appliance to snapshot and upgrade first, rewrite them. The principle holds, the lifecycle layer still goes first, but the thing you are upgrading is now a set of services, not an OVA.

Worked example · A recovery point you can defend

Suppose the business asks for a 24-hour recovery point on the platform config. A daily backup of VCF Automation and the Identity Broker, time-aligned, meets it on paper. To make it real: keep at least 7 daily backups so a corruption noticed on day three is still recoverable, verify the SFTP target has headroom for that retention, and run a restore drill into an isolated environment once a quarter.

The drill is the part that counts. A schedule that has never been restored is a 24-hour recovery point in theory and an unknown in practice. One tested restore per quarter converts hope into a number you can put in front of an auditor.

Disclaimer: Restore and upgrade are production-changing operations on the management plane. Run restore drills in an isolated environment, never test a restore over a healthy production instance, take and later remove a pre-upgrade snapshot, and follow the version-specific upgrade sequence for your exact release before you begin.

The SFTP target nobody watches

The backup target is the quiet failure point. It is configured once, it works, and then nobody looks at it again until a restore needs a backup that was never written. The two ways it betrays you are predictable. It fills up, and new backups silently fail or rotate out the older ones you were counting on for a longer retention window. Or its credentials or host key change during some unrelated maintenance, and the scheduled job starts failing while the dashboard still shows the last good run from before the change.

Treat the target as a monitored dependency, not a set-and-forget setting. Alert on a missing or failed backup the same day it happens, not at quarter end. Size the storage for your full retention with headroom, because a 7-day retention that only fits 4 days of backups is a 4-day retention you do not know about. And confirm the target lives off the appliance and ideally off the same failure domain, since a backup on the thing you are trying to recover is not a backup at all.

Write the recovery runbook before you need it

Backups are a capability; recovery is a procedure, and the procedure is what fails under pressure. Write the runbook while everything is calm, and make it specific to your environment, not a paraphrase of the docs. It should name the exact backup path layout for your instance, the order to restore VCF Automation and the Identity Broker, who has the Administrator role in Fleet Management to run it, and how you verify success, which is a real user signing in and seeing their catalog, not a green task in the console.

The quarterly restore drill is what keeps that runbook honest. Each drill surfaces the small things that turn a 30-minute recovery into a half-day outage: an expired credential, a missing prerequisite, a step that assumed a version you have since upgraded past. Fix them on the drill, update the runbook, and the next real incident becomes a procedure you have already rehearsed rather than a problem you are solving for the first time with people watching.


The Bottom Line

Schedule file-based backups of VCF Automation and the Identity Broker together, send them to an SFTP target off the appliance, and prove the whole thing with a restore drill every quarter, because the only backup that counts is the one you have restored. Keep config-as-code alongside it so the common case, a change you regret, has a fast and versioned path back. For upgrades, snapshot, upgrade the lifecycle layer first, then the components, and delete the snapshot afterward. I would not rely on snapshots as backups, I would not back up Automation without the Identity Broker, and I would not write a 9.x upgrade runbook without accounting for the 9.1 move from a Fleet Management appliance to the Fleet Lifecycle and SDDC Lifecycle services. Validate the SFTP target capacity and one real restore before you call any of this done.

Book one restore drill into an isolated environment this quarter. If your orgs, catalog and logins all come back and people can sign in, your recovery plan is real. If anything is missing, you found it on your schedule instead of during an outage. The final Part wraps the series with troubleshooting, hardening and a verdict.

VCF Automation 9 Series · Part 29 of 41
« Previous: Part 28  |  VCF Automation Guide  |  Next: Part 30 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF Automation 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading