Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Stretched Cluster vs Site Recovery in VCF 9: Choosing Your DR Architecture (VCF 9 Series, Part 28)

A vSAN stretched cluster is not a disaster recovery plan. Here is how stretched clustering and Site Recovery differ in VCF 9, what changed in the tooling, and when to use each (or both).

VCF 9 Series · Part 28 of 36
Who this is for: VCF architects and infrastructure leads designing business continuity for VCF 9.  Prerequisites: a working knowledge of vSAN, vSphere HA, and your RPO/RTO targets per application tier.

Walk into enough design workshops and you hear the same sentence: we run a stretched cluster, so disaster recovery is covered. It is not. A stretched cluster is an availability architecture. Site Recovery is a recovery architecture. They answer different failure modes, and in VCF 9 the distinction matters more than ever, because the tooling on both sides changed names, changed defaults, and a third capability arrived that neither one historically covered.

Two different problems wearing one label

Availability is about surviving a host or site failure without anyone filing a ticket. Disaster recovery is about bringing workloads back after something larger: a full site loss, a regional event, an operator mistake, or an attack, with a defined recovery point and a runbook you can actually rehearse. A vSAN stretched cluster is excellent at the first and largely blind to the second. The reason is mechanical. A stretched cluster writes synchronously to both sites, so whatever you commit, good data or a ransomware payload, lands in both copies at once. There is no point in time to roll back to, because both copies are always identical and always current.

Site Recovery flips that. It keeps an asynchronous, slightly older copy at a separate site, orchestrates a structured failover, and lets you test the whole thing in an isolated bubble before you ever need it. The cost is a non-zero recovery point and a manual or planned cutover instead of an automatic restart. Same goal, business continuity, two very different shapes.

Two architectures for one goalSynchronous metro mirror versus asynchronous regional recoveryvSAN Stretched Cluster (HA)Site A (active)Site B (active)sync, <=5msWitness (3rd site)Automatic HA restart; mirrors corruption tooSite Recovery (DR)Production siteasync 1-5 minRecovery region(2nd VCF instance)Orchestrated, testable; tolerates distance
A stretched cluster is high availability; Site Recovery is the actual disaster-recovery plan.

vSAN stretched cluster: synchronous metro availability

A vSAN stretched cluster spans two active data sites plus a witness at a third location. vSAN keeps one full copy of each object at the preferred site and one full copy at the secondary site, with a small witness component holding the quorum vote. Writes are acknowledged only after both sites commit, which is what gives you the near-zero recovery point. If a whole site goes down, vSphere HA restarts the affected VMs at the surviving site immediately, because a complete copy already lives there. No replication catch-up, no rebuild wait.

The constraint that decides feasibility is latency. The two data sites must sit within 5 ms round-trip time of each other, which in practice means metro distance, not cross-country. The witness tolerates much higher latency, up to roughly 200 ms RTT, so it can live in a small third site or a cloud edge. VCF 9.0 also made stretched-cluster site maintenance far less painful, reducing what used to be a careful manual dance to a single UI action or API call.

The mistake I see most often is capacity planning. For both sites to survive the loss of the other, each site must be able to run the entire workload on its own. If you let utilization drift past roughly half of usable capacity, a site failure leaves you unable to power everything back on, and the elegant automatic failover quietly turns into a partial outage. Plan the sizing for single-site survival from day one, and revisit it as the cluster grows. For the underlying storage decisions that feed this, see vSAN ESA vs OSA in VCF 9, and for how the cluster sits inside the broader topology, the VCF 9 reference architecture.

The hard limit: a stretched cluster does nothing for logical corruption. Delete a database, encrypt a file server, push a bad change, and the synchronous mirror dutifully replicates the damage to the other site in milliseconds. It is high availability, not recoverability.

Site Recovery: asynchronous, orchestrated, testable DR

The async side of the house is where the names shifted in VCF 9, so it is worth getting the terms straight. The product is now VMware Live Recovery, and the individual appliances for Live Site Recovery, vSAN Data Protection, and vSphere Replication have been consolidated into a single Live Recovery appliance. Live Site Recovery 9.0.3 is the release that adds VCF 9.0 compatibility. In VCF 9.1 the capability is folded into the platform as VCF Protection and Recovery.

The bigger change is under the hood. Enhanced vSphere Replication is now the default and only supported site-to-site replication mode going forward; legacy vSphere Replication configurations are not supported after vSphere Replication 9.0.2.2. Enhanced mode moves the replication data path directly between the ESX hosts that own the storage rather than funneling it through the replication appliance, auto load-balances replicas across target hosts every 30 minutes, and requires network encryption. It opens a new outbound firewall requirement on the hosts: port 32032 for the hbr-agent service. Plan for that port before you cut over, because it is the most common reason an otherwise correct Enhanced configuration refuses to pair.

On recovery point, watch the licensing detail. With a legacy SRM license key the minimum RPO stays at 5 minutes; once a VMware Live Recovery subscription is applied, you can drive it down to 1 minute. The real value of Site Recovery, though, is not the RPO number. It is the recovery plan: an ordered, scripted failover with priority groups, IP customization, and dependency sequencing, plus non-disruptive test recovery into an isolated network so you can prove the runbook works without touching production. A stretched cluster cannot give you that rehearsal. Because it is asynchronous, Site Recovery also tolerates real distance, so your recovery site can sit in another region entirely. That usually means a second VCF instance; the parallel-instance pattern covers how to stand one up. And do not let replication replace backups; the failure modes in VCF 9 backup failures still apply.

The decision matrix

DimensionvSAN Stretched ClusterSite Recovery (async)
Protection mechanismSynchronous mirror across two active sitesAsynchronous replication to a recovery site
Typical RPONear zero, no data loss1 min (VLR subscription) to 5 min (legacy SRM key)
Typical RTOAutomatic vSphere HA restart, seconds to minutesOrchestrated recovery plan, minutes
Distance and latencyUp to 5 ms RTT between data sites (metro)Latency tolerant, regional or cross-country
Extra infrastructurevSAN witness appliance at a third siteSecond VCF instance plus Live Recovery appliance
Failover styleAutomatic, no orchestrationScripted recovery plans with priority groups
Non-disruptive testNo isolated test recoveryYes, test in an isolated bubble network
Logical corruption / ransomwareNo protection, damage is mirrored instantlyOnly with Protection and Recovery cyber recovery
LicensingIncluded in vSAN / VCF entitlementVMware Live Recovery subscription

The gap both options leave open: cyber recovery

Here is the uncomfortable part. Neither a stretched cluster nor plain async replication saves you from ransomware. The stretched cluster mirrors the encryption instantly. Async replication faithfully ships the encrypted blocks to the recovery site within the RPO window, so unless you have immutable, point-in-time restore points and somewhere clean to validate them, your second site just inherits the infection.

VCF 9.1 closes this with integrated cyber recovery built on vSAN snapshots. The workflow runs through an Isolated Recovery Environment: suspect VMs are restored into an isolated state, their snapshots scanned and cleaned by built-in AI/ML-powered EDR (with CrowdStrike Falcon integration available), promoted to a clean replica once verified, recovered to resume operations, then reprotected and failed back. Paired with the VMware Advanced Cyber Compliance add-on, you can build a fully customer-owned on-premises clean room instead of renting a cloud recovery site, and new Cyber Recovery Ready Nodes with high-capacity QLC drives bring the cost of that clean room down. Treat this as a third leg of the design, not an afterthought, because it is the only one of the three that assumes the adversary is already inside.

Disclaimer: Before committing a DR design, validate the target BOM against the interoperability matrix, confirm site latency budgets with real measurements (not vendor brochures), size each stretched-cluster site for single-site survival, back up independently of replication, and rehearse a full recovery-plan test before you rely on it in production.
The third leg: cyber recoveryThe only leg that assumes the adversary is already inside1Restore toIsolated Recovery Env2Scan + cleanbuilt-in EDR3Promoteclean replica4Recover+ reprotectNeither a stretched cluster nor async replication saves you from ransomware; both faithfully copy the damage.
VCF 9.1 cyber recovery runs through an Isolated Recovery Environment with snapshot-based clean restore.

What I’d Do

Use a stretched cluster when you have two data centers inside a 5 ms RTT envelope and tier-1 applications that cannot tolerate data loss or a manual restart. Use Site Recovery when you need genuine geographic separation and a testable, orchestrated runbook, which is most real DR programs. The strongest designs compose all three: a metro stretched cluster for intra-region resilience, async Site Recovery to a distant region for true disaster recovery, and integrated cyber recovery for the threat the first two cannot touch.

If someone forces a single answer to the question what is your DR, it is Site Recovery, not the stretched cluster. A stretched cluster cannot survive the loss of the region, and it cannot survive ransomware. It is the best high-availability tool VCF gives you, and it is not a disaster recovery plan. Which leg of this is weakest in your current design, the distance, the test discipline, or the cyber recovery?

References

VCF 9 Series · Part 28 of 36
« Previous: Part 27  |  VCF 9 Complete Guide  |  Next: Part 29 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading