Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

VCF 9 Stretched Clusters and Multi-Site Design: Topology, Witness and the Latency Budget (VCF 9 Series, Part 30)

A reference design for stretching VMware Cloud Foundation 9 across two sites: vSAN site awareness, witness placement, the 5ms latency budget, storage policy, and when a single stretched instance beats multi-instance federation.

VCF 9 Series · Part 30 of 36

TL;DR · Key Takeaways

  • A VCF 9 stretched cluster is one logical cluster across two data sites plus a third witness site. It is a single VCF instance, one vCenter, one set of NSX managers, not two.
  • The hard number that governs the whole design is 5 ms RTT between the two data sites. Cross it and synchronous mirroring punishes every write.
  • VCF 9 finally makes a vSphere compute cluster site aware so it can mount a stretched vSAN storage cluster correctly. Disaggregated stretched topologies now behave like a stretched HCI cluster.
  • Set site disaster tolerance (dual-site mirroring) and local FTT as two separate policy decisions. Conflating them is the most common capacity mistake.
  • If your sites are far apart or links are unreliable, do not stretch. Run two instances and use multi-instance federation instead.
Who this is for: Architects and senior admins designing a metro or dual-site VCF 9 deployment.  Prerequisites: A validated inter-site link, a third witness location, and a sizing target for the management and workload domains you intend to stretch.

Most stretched cluster projects do not fail in the architecture review. They fail six months later, on the day the inter-site link degrades from 4 ms to 9 ms because someone re-routed a fibre path, and suddenly every VM with synchronous mirroring is crawling. Stretched clustering in VCF 9 is genuinely good now, better than the bolt-on Metro Storage Cluster era it replaces. But it is also unforgiving of a sloppy network design. This part lays out the reference topology, the numbers that actually constrain it, and the decision of when not to stretch at all.

Two multi-site models, and they are not interchangeable

VCF 9 gives you two fundamentally different ways to span more than one location, and choosing the wrong one is the root of most multi-site pain.

A stretched cluster is one VCF instance. A single vCenter, one SDDC Manager, one set of NSX managers, and one vSAN datastore are spread across two data sites that behave as active-active availability zones. vSphere HA restarts a VM at the surviving site automatically if a site is lost. It looks and operates like a single cluster that happens to have hardware in two buildings. This is the model when your two sites are close, well connected, and you want transparent recovery with no failover runbook to execute.

Multi-instance federation is the opposite. Each region is its own complete VCF instance with its own SDDC Manager, vCenter, and NSX, and there is no shared management plane. You manage them together through fleet management for consistent policy and visibility, but recovery between instances is a deliberate, orchestrated action, not an automatic HA event. This is the model for distant regions, high-latency links, or any time you want hard fault isolation between sites.

The dividing line is latency and link quality, not distance on a map. If you can hold the inter-site budget below 5 ms RTT reliably, stretching is on the table. If you cannot, the decision is already made for you.

Two multi-site models, not interchangeableThe dividing line is latency and link quality, not distance on a mapStretched cluster (one instance)One vCenter, SDDC, NSX across 2 sitesAutomatic vSphere HA failoverInter-site latency <=5ms RTTMulti-instance federationSeparate full VCF per regionOrchestrated, deliberate recoveryLatency-tolerant, fault isolationHold below 5ms RTT reliably and stretching is on the table; otherwise the decision is made for you.
Stretch for transparent HA across a metro link; federate for distance and hard fault isolation.

The stretched topology

A stretched vSAN cluster always uses three fault domains: the preferred site, the secondary site, and the witness. Data lives at the two data sites; the witness holds only metadata and component votes so the cluster can break a tie and avoid split-brain when the inter-site link drops. The witness is a lightweight virtual appliance at a third location, and it never holds VM data.

VCF 9 Stretched Cluster Topology One VCF instance across two data sites, with a witness at a third location Site A · Preferred Fault domain 1 (active) ESXi hosts + stretched vSAN NSX edges (active) Full copy of data (mirror) Site B · Secondary Fault domain 2 (active) ESXi hosts + stretched vSAN NSX edges (standby) Full copy of data (mirror) Inter-site link ≤ 5 ms RTT · 10-25 Gbps Witness Site vSAN witness appliance metadata + votes only ≤ 200 ms RTT ~100 Mbps
Three fault domains: two active data sites under a tight latency budget, plus a metadata-only witness that tolerates much higher latency.

The improvement that matters most in VCF 9 is site awareness for disaggregated storage. With vSAN storage clusters (formerly vSAN Max), you can separate compute from storage. Before VCF 9, a vSphere cluster mounting a stretched vSAN storage datastore had no concept of fault domains, so it could not pick the local data path and could not restart VMs correctly after a site failure. VCF 9 introduces the “vSAN compute cluster,” a vSphere cluster with a thin vSAN layer that is itself site aware. You name a fault domain per site, assign hosts, then couple those fault domains to the storage cluster’s fault domains. Now the VM reads and writes over the optimal local path, and HA restarts it on the surviving site if a site goes down. Note one detail teams miss: the stretched vSAN storage cluster needs the witness, but the stretched vSphere compute cluster mounting it does not need its own separate witness.


The latency and bandwidth budget

This is the part of the design you cannot negotiate away, so put real numbers on the table early. The two data sites carry synchronous mirrored writes, which means write acknowledgement waits for both sites. The witness link, by contrast, carries only metadata and tolerates far more latency.

Link / ComponentRequirementWhy it matters
Inter-site data latency≤ 5 ms RTTEvery write is mirrored synchronously; breach adds latency to all I/O and risks resync storms
Inter-site bandwidth10 Gbps minimum, 25 Gbps recommendedCarries mirrored writes plus vMotion plus rebuild and resync traffic
Witness latency≤ 200 ms RTTWitness stores metadata and votes only, so it can sit far away and even in the cloud
Witness bandwidth~100 Mbps (sizing dependent)Light traffic; scales with component count, not capacity
MTU9000 (jumbo) end to endRecommended for vSAN performance; must be consistent across the whole path
Witness appliance (9.1)Medium 4 vCPU / 16 GB, Large 4 / 32, XL 8 / 64Pick by total component count, not raw TB
Site disaster toleranceDual-site mirroringOne full copy at each site; this is what doubles raw capacity
Local protectionFTT=1, RAID-5 or RAID-1 per siteSurvives a host failure within a site without a cross-site rebuild

A practical warning on the 5 ms figure: design to it as a ceiling you stay well under, not a target you skate against. Inter-site paths drift. A link that measures 3 ms at install can creep upward after a carrier re-route or a maintenance window, and vSAN does not care about your excuses when write latency climbs. Put active monitoring on the inter-site RTT and alert below the limit, not at it. The witness can sit much further out, which is why a cloud-hosted or remote-office witness is a perfectly reasonable choice when you genuinely have only two facilities.

Storage policy: two decisions, not one

The single biggest capacity surprise in stretched designs comes from treating protection as one knob. It is two. Site disaster tolerance decides how data is mirrored across sites, and the standard choice is dual-site mirroring: one complete copy of the object at each data site. That alone means your usable capacity math starts from a 2x raw footprint before you have protected anything within a site.

Local failures to tolerate is the separate, second knob that protects each site copy against host or disk loss inside that site. FTT=1 with RAID-5 erasure coding inside each site is the common production choice on ESA hardware, because it keeps a host failure local rather than forcing a cross-site rebuild over your precious inter-site bandwidth. If you set only site mirroring and leave local FTT at zero, a single host failure at one site degrades that site’s copy and any further fault gets ugly fast. Enable Auto-Policy Management on the vSAN storage cluster so the default policy is generated to match the actual topology and host count, which takes the guesswork out of getting these two settings coherent.

Storage protection is two decisions, not oneCross-site mirroring and within-site resilience are separate knobsSite disaster toleranceDual-site mirroringOne full copy per site2x raw before local protectionLocal failures to tolerateFTT=1, RAID-5 inside each siteSurvives a host loss locallyNo cross-site rebuildSet both; enable Auto-Policy Management so the policy matches the actual topology.
Mirror across sites for disaster tolerance, then protect each copy locally so a host failure stays local.

Compute, DRS and NSX across the sites

Stretching the storage is only half the design. Without DRS rules, VMs drift to whichever site has spare capacity, and on a site failure you can end up restarting far more than you planned. Create host groups per site, VM groups, and should-rules that keep each workload affined to its primary site under normal operation. Use should-rules, not must-rules, so HA can still restart a VM at the opposite site when its home site is gone. And size each site to run the combined load alone. A stretched cluster that cannot absorb the full workload on one site is a stretched cluster that fails its only real test.

On the network side, a stretched single instance keeps one set of NSX managers, with edge clusters typically active at the preferred site and standby at the secondary. The overlay spans both sites, so an east-west segment and its distributed firewall rules follow the VM wherever HA places it, which is exactly the transparent behaviour you stretch for. Get the underlay right first: the inter-site link must carry vSAN, vMotion, and NSX overlay traffic together, so the bandwidth table above is a floor, not a goal. Many of the same fault lines from single-site work apply here, only amplified, so it is worth revisiting the VCF 9 network design mistakes before you commit the physical layout.

A design question worth settling early: do you stretch the management domain too, or only workload domains? Stretching the management domain protects SDDC Manager, vCenter, and the NSX managers against a full site loss, which is the cleaner outcome. It also costs you more hosts and tightens the same latency budget on the components you least want to lose. For many designs, stretching the workload domains while keeping a well-protected management domain at the primary site, backed by tested restore, is the more honest trade. Decide it deliberately rather than letting the installer defaults decide for you.

Disclaimer: Stretching a cluster is a production-impacting change. Validate the target BOM against the compatibility guide, confirm inter-site latency and bandwidth under real load, back up management components, run the vSAN and NSX prechecks, and test a full site-failure scenario in a non-production instance before you rely on it.

When NOT to stretch

Stretching is the wrong answer more often than vendors imply. Do not stretch if you cannot guarantee the inter-site budget under peak load, not just at 3am. Do not stretch across an unreliable or shared link where a brief partition will trigger needless failovers. Do not stretch when your sites are genuinely far apart, because synchronous mirroring over high latency is not a performance problem you can tune away, it is physics. And do not stretch when what you actually want is fault isolation, where a problem at one site must never be able to affect the other.

In those cases run two independent VCF instances under multi-instance federation, and recover between them with an orchestrated DR product rather than HA. If you are weighing automatic stretched recovery against an orchestrated runbook, the trade-offs are covered directly in Stretched Cluster vs Site Recovery in VCF 9. For the broader sizing and topology context this design sits inside, see the VCF 9 reference architecture.

What I’d Do

For two metro sites with a clean, dedicated, monitored link comfortably inside 5 ms, stretch it. VCF 9 has closed the old gaps, disaggregated storage is finally site aware, and the operational payoff of transparent HA recovery is real. Stretch the workload domains, set dual-site mirroring with FTT=1 RAID-5 locally, size each site for the full load, and put a cloud or remote witness at the third point. But the moment the inter-site link is shared, unreliable, or pushing the latency ceiling, stop. Run two instances and federate. A stretched cluster on a marginal link is not high availability, it is a shared failure domain wearing a costume. Which way is your next site pair leaning, and what is the real measured RTT between them?


References

VCF 9 Series · Part 30 of 36
« Previous: Part 29  |  VCF 9 Complete Guide  |  Next: Part 31 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading