TL;DR · Key Takeaways
- A VCF 9 stretched cluster is one logical cluster across two data sites plus a third witness site. It is a single VCF instance, one vCenter, one set of NSX managers, not two.
- The hard number that governs the whole design is 5 ms RTT between the two data sites. Cross it and synchronous mirroring punishes every write.
- VCF 9 finally makes a vSphere compute cluster site aware so it can mount a stretched vSAN storage cluster correctly. Disaggregated stretched topologies now behave like a stretched HCI cluster.
- Set site disaster tolerance (dual-site mirroring) and local FTT as two separate policy decisions. Conflating them is the most common capacity mistake.
- If your sites are far apart or links are unreliable, do not stretch. Run two instances and use multi-instance federation instead.
Most stretched cluster projects do not fail in the architecture review. They fail six months later, on the day the inter-site link degrades from 4 ms to 9 ms because someone re-routed a fibre path, and suddenly every VM with synchronous mirroring is crawling. Stretched clustering in VCF 9 is genuinely good now, better than the bolt-on Metro Storage Cluster era it replaces. But it is also unforgiving of a sloppy network design. This part lays out the reference topology, the numbers that actually constrain it, and the decision of when not to stretch at all.
Two multi-site models, and they are not interchangeable
VCF 9 gives you two fundamentally different ways to span more than one location, and choosing the wrong one is the root of most multi-site pain.
A stretched cluster is one VCF instance. A single vCenter, one SDDC Manager, one set of NSX managers, and one vSAN datastore are spread across two data sites that behave as active-active availability zones. vSphere HA restarts a VM at the surviving site automatically if a site is lost. It looks and operates like a single cluster that happens to have hardware in two buildings. This is the model when your two sites are close, well connected, and you want transparent recovery with no failover runbook to execute.
Multi-instance federation is the opposite. Each region is its own complete VCF instance with its own SDDC Manager, vCenter, and NSX, and there is no shared management plane. You manage them together through fleet management for consistent policy and visibility, but recovery between instances is a deliberate, orchestrated action, not an automatic HA event. This is the model for distant regions, high-latency links, or any time you want hard fault isolation between sites.
The dividing line is latency and link quality, not distance on a map. If you can hold the inter-site budget below 5 ms RTT reliably, stretching is on the table. If you cannot, the decision is already made for you.
The stretched topology
A stretched vSAN cluster always uses three fault domains: the preferred site, the secondary site, and the witness. Data lives at the two data sites; the witness holds only metadata and component votes so the cluster can break a tie and avoid split-brain when the inter-site link drops. The witness is a lightweight virtual appliance at a third location, and it never holds VM data.
The improvement that matters most in VCF 9 is site awareness for disaggregated storage. With vSAN storage clusters (formerly vSAN Max), you can separate compute from storage. Before VCF 9, a vSphere cluster mounting a stretched vSAN storage datastore had no concept of fault domains, so it could not pick the local data path and could not restart VMs correctly after a site failure. VCF 9 introduces the “vSAN compute cluster,” a vSphere cluster with a thin vSAN layer that is itself site aware. You name a fault domain per site, assign hosts, then couple those fault domains to the storage cluster’s fault domains. Now the VM reads and writes over the optimal local path, and HA restarts it on the surviving site if a site goes down. Note one detail teams miss: the stretched vSAN storage cluster needs the witness, but the stretched vSphere compute cluster mounting it does not need its own separate witness.
The latency and bandwidth budget
This is the part of the design you cannot negotiate away, so put real numbers on the table early. The two data sites carry synchronous mirrored writes, which means write acknowledgement waits for both sites. The witness link, by contrast, carries only metadata and tolerates far more latency.
| Link / Component | Requirement | Why it matters |
|---|---|---|
| Inter-site data latency | ≤ 5 ms RTT | Every write is mirrored synchronously; breach adds latency to all I/O and risks resync storms |
| Inter-site bandwidth | 10 Gbps minimum, 25 Gbps recommended | Carries mirrored writes plus vMotion plus rebuild and resync traffic |
| Witness latency | ≤ 200 ms RTT | Witness stores metadata and votes only, so it can sit far away and even in the cloud |
| Witness bandwidth | ~100 Mbps (sizing dependent) | Light traffic; scales with component count, not capacity |
| MTU | 9000 (jumbo) end to end | Recommended for vSAN performance; must be consistent across the whole path |
| Witness appliance (9.1) | Medium 4 vCPU / 16 GB, Large 4 / 32, XL 8 / 64 | Pick by total component count, not raw TB |
| Site disaster tolerance | Dual-site mirroring | One full copy at each site; this is what doubles raw capacity |
| Local protection | FTT=1, RAID-5 or RAID-1 per site | Survives a host failure within a site without a cross-site rebuild |
A practical warning on the 5 ms figure: design to it as a ceiling you stay well under, not a target you skate against. Inter-site paths drift. A link that measures 3 ms at install can creep upward after a carrier re-route or a maintenance window, and vSAN does not care about your excuses when write latency climbs. Put active monitoring on the inter-site RTT and alert below the limit, not at it. The witness can sit much further out, which is why a cloud-hosted or remote-office witness is a perfectly reasonable choice when you genuinely have only two facilities.
Storage policy: two decisions, not one
The single biggest capacity surprise in stretched designs comes from treating protection as one knob. It is two. Site disaster tolerance decides how data is mirrored across sites, and the standard choice is dual-site mirroring: one complete copy of the object at each data site. That alone means your usable capacity math starts from a 2x raw footprint before you have protected anything within a site.
Local failures to tolerate is the separate, second knob that protects each site copy against host or disk loss inside that site. FTT=1 with RAID-5 erasure coding inside each site is the common production choice on ESA hardware, because it keeps a host failure local rather than forcing a cross-site rebuild over your precious inter-site bandwidth. If you set only site mirroring and leave local FTT at zero, a single host failure at one site degrades that site’s copy and any further fault gets ugly fast. Enable Auto-Policy Management on the vSAN storage cluster so the default policy is generated to match the actual topology and host count, which takes the guesswork out of getting these two settings coherent.
Compute, DRS and NSX across the sites
Stretching the storage is only half the design. Without DRS rules, VMs drift to whichever site has spare capacity, and on a site failure you can end up restarting far more than you planned. Create host groups per site, VM groups, and should-rules that keep each workload affined to its primary site under normal operation. Use should-rules, not must-rules, so HA can still restart a VM at the opposite site when its home site is gone. And size each site to run the combined load alone. A stretched cluster that cannot absorb the full workload on one site is a stretched cluster that fails its only real test.
On the network side, a stretched single instance keeps one set of NSX managers, with edge clusters typically active at the preferred site and standby at the secondary. The overlay spans both sites, so an east-west segment and its distributed firewall rules follow the VM wherever HA places it, which is exactly the transparent behaviour you stretch for. Get the underlay right first: the inter-site link must carry vSAN, vMotion, and NSX overlay traffic together, so the bandwidth table above is a floor, not a goal. Many of the same fault lines from single-site work apply here, only amplified, so it is worth revisiting the VCF 9 network design mistakes before you commit the physical layout.
A design question worth settling early: do you stretch the management domain too, or only workload domains? Stretching the management domain protects SDDC Manager, vCenter, and the NSX managers against a full site loss, which is the cleaner outcome. It also costs you more hosts and tightens the same latency budget on the components you least want to lose. For many designs, stretching the workload domains while keeping a well-protected management domain at the primary site, backed by tested restore, is the more honest trade. Decide it deliberately rather than letting the installer defaults decide for you.
When NOT to stretch
Stretching is the wrong answer more often than vendors imply. Do not stretch if you cannot guarantee the inter-site budget under peak load, not just at 3am. Do not stretch across an unreliable or shared link where a brief partition will trigger needless failovers. Do not stretch when your sites are genuinely far apart, because synchronous mirroring over high latency is not a performance problem you can tune away, it is physics. And do not stretch when what you actually want is fault isolation, where a problem at one site must never be able to affect the other.
In those cases run two independent VCF instances under multi-instance federation, and recover between them with an orchestrated DR product rather than HA. If you are weighing automatic stretched recovery against an orchestrated runbook, the trade-offs are covered directly in Stretched Cluster vs Site Recovery in VCF 9. For the broader sizing and topology context this design sits inside, see the VCF 9 reference architecture.
What I’d Do
For two metro sites with a clean, dedicated, monitored link comfortably inside 5 ms, stretch it. VCF 9 has closed the old gaps, disaggregated storage is finally site aware, and the operational payoff of transparent HA recovery is real. Stretch the workload domains, set dual-site mirroring with FTT=1 RAID-5 locally, size each site for the full load, and put a cloud or remote witness at the third point. But the moment the inter-site link is shared, unreliable, or pushing the latency ceiling, stop. Run two instances and federate. A stretched cluster on a marginal link is not high availability, it is a shared failure domain wearing a costume. Which way is your next site pair leaning, and what is the real measured RTT between them?
References
- VCF Blog: Stretched Topologies using vSAN Storage Clusters in VCF 9.0
- Broadcom TechDocs: Stretching vSAN Clusters in VMware Cloud Foundation
- Broadcom TechDocs: vSAN Bandwidth and Latency Requirements



