Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

VCF 9 Storage Design: 6 vSAN ESA Pitfalls That Wreck Capacity and Performance (VCF 9 Series, Part 6)

vSAN ESA is the default storage layer in VMware Cloud Foundation 9, and most storage problems trace back to design decisions made before the first host is racked. Here are six pitfalls that quietly wreck capacity and performance, with the fix for each.

VCF 9 Series · Part 6 of 36

TL;DR · Key Takeaways

  • vSAN ESA is the default storage layer in VCF 9, and it has strict hardware rules: certified NVMe TLC devices, no RAID controllers, tri-mode controllers, or HBAs in the path.
  • Default to RAID-5/6 erasure coding, not RAID-1 mirroring. On ESA, erasure coding matches or beats mirror performance while using far less capacity.
  • Size on usable capacity, not raw: budget for FTT overhead plus operations and host-rebuild slack, and do not pre-spend deduplication or compression ratios.
  • Keep hosts uniform, give vSAN a 25GbE or faster network, and scale out by adding hosts rather than stuffing disks into a few.
  • Let VCF 9.1 Automated Storage Policy Management pick the optimal fault tolerance and erasure coding for your cluster size.
Who this is for: VCF architects and storage admins designing the vSAN layer for a new VCF 9 deployment.  Prerequisites: a working grasp of vSAN concepts (storage policies, FTT) and access to the vSAN ReadyNode Sizer and the VMware Compatibility Guide.

Storage is where VMware Cloud Foundation deployments quietly go wrong. The cluster comes up healthy, workloads run, and then six months later capacity is tighter than the spreadsheet promised, a rebuild stalls during a host failure, or latency spikes under load. Almost every one of those problems traces back to a design decision made before the first host was racked. In VCF 9, vSAN Express Storage Architecture (ESA) is the default and recommended storage layer, and it changes several rules that experienced OSA admins still apply out of habit. This post walks through six storage design pitfalls we see most often, and the fix for each.

1. Treating ESA like OSA on the hardware side

ESA is not OSA with faster disks. There are no disk groups and no separate cache tier: every claimed device contributes to both capacity and performance in a single storage pool per host. That design only works on the hardware it was built for. ESA requires certified NVMe TLC flash devices, and it is not supported behind RAID controllers, tri-mode controllers, or HBAs. Plugging ESA-class NVMe into a server with a RAID controller in the path is one of the most common ways a build fails certification.

  • Build from a vSAN ESA ReadyNode, or validate every component against the VMware Compatibility Guide before you buy.
  • Confirm the devices are on the ESA-specific HCL, not just the general vSAN list.
  • Verify the controller path is direct-attached NVMe with no RAID or tri-mode layer in between.

Verify what each host actually presents before you enable the cluster:

# List devices vSAN can claim and their type
esxcli vsan storagepool list

# Confirm the architecture the host is running
esxcli vsan cluster get

# Check the storage controller in the path
esxcli storage core adapter list

2. Defaulting to RAID-1 mirroring

On OSA, RAID-1 mirroring was the go-to for performance and RAID-5/6 was the capacity-saving compromise you reached for reluctantly. ESA flips that trade-off. Because of its log-structured design, ESA delivers erasure-coding performance that is equal to or better than mirroring, so there is no longer a performance reason to default to RAID-1. The capacity difference is large: at FTT=1, a mirror consumes roughly 2x the raw capacity of the data, while RAID-5 consumes about 1.33x. Choosing mirroring out of habit can silently waste a third or more of your usable storage.

  • Use RAID-5/6 erasure coding as the default for general workloads on ESA.
  • Reserve RAID-1 for the rare cases where a specific workload genuinely needs it.
  • On VCF 9.1, enable Automated Storage Policy Management so vSAN applies the highest fault tolerance and optimal erasure coding for the cluster size automatically.

3. Sizing on raw capacity and pre-spending data reduction

Raw capacity is not usable capacity. After you account for the FTT overhead from your storage policy, you still need to reserve slack for vSAN itself: an operations reserve for internal tasks and a host-rebuild reserve so the cluster can re-protect data when a host fails. A cluster sized to the edge of raw capacity cannot complete a rebuild, which turns a single host failure into a capacity emergency. The second half of this trap is pre-spending data reduction: global deduplication can reach up to 8x and the new VCF 9.1 compression improves ratios further, but those numbers are workload-dependent and must never be baked into your primary capacity plan.

  • Drive sizing with the vSAN ReadyNode Sizer, ideally fed with real usage data rather than guesses.
  • Plan to usable capacity after FTT, operations reserve, and host-rebuild reserve, and leave headroom on top.
  • Treat deduplication and compression as a bonus that lowers cost, not as capacity you can commit to in advance.

For how usable capacity ties back to what you are entitled to consume, see the vSAN TiB entitlement discussion in VCF 9 Licensing Explained, and review your inputs against the readiness checklist in VCF 9 Planning and Prerequisites.

4. Mixing host configurations in one cluster

vSAN works best when every host in a cluster has a similar or identical configuration, especially storage. Mixed device counts or capacities create an unbalanced datastore where some hosts fill up faster than others, components cluster on the larger nodes, and rebuild and rebalance behavior becomes unpredictable. The convenience of adding whatever hardware is on hand costs you in lopsided utilization and harder troubleshooting later.

  • Standardize on a single host specification per cluster: same device class, count, and capacity.
  • When you must introduce a newer node spec, build a new uniform cluster rather than diluting an existing one.
  • Keep CPU and memory aligned too, so storage policy outcomes and DRS behavior stay predictable.

5. Under-provisioning the vSAN network

vSAN is a distributed storage system, so the network is part of the storage subsystem, not an afterthought. ESA moves more data across the fabric and expects a fast, low-latency network: plan for 25GbE or faster, with redundant uplinks and jumbo frames configured consistently end to end. A 10GbE network carried over from an older cluster, or an MTU mismatch somewhere in the path, shows up as latency and rebuild slowness that looks like a storage problem but is really a network one.

  • Provision 25GbE or faster for ESA, with redundancy at the NIC and switch layers.
  • Set MTU 9000 consistently across vmknics, switches, and uplinks, then validate it actually passes.
  • Separate or prioritize vSAN traffic with Network I/O Control so it is not starved by vMotion or workload traffic.

Network design and storage design are tightly coupled, so pair this with the fabric guidance in VCF 9 Network Design: 7 Mistakes That Break Your Deployment.

6. Scaling up disks instead of scaling out hosts

When capacity runs low, the tempting move is to add disks to the existing hosts. vSAN prefers the opposite: scaling out by adding hosts is the recommended approach over adding or replacing devices in existing nodes. More hosts means more failure domains, more aggregate performance, and a larger pool to absorb a rebuild. Concentrating capacity on a few dense hosts increases the blast radius of a single host failure and can leave too little headroom to re-protect data. If you genuinely need to separate storage growth from compute, VCF 9 supports disaggregation through vSAN storage clusters rather than over-stuffing HCI nodes.

  • Grow capacity by adding hosts first; add devices to existing hosts only when it keeps the cluster uniform.
  • Use a vSAN storage cluster when storage and compute need to scale independently.
  • Keep enough hosts that losing one still leaves room to rebuild within policy.
Disclaimer: Before changing a production vSAN cluster, validate hardware against the ESA Compatibility Guide and your target BOM, confirm interoperability across the VCF BOM, back up your workloads, run vSAN health and skyline pre-checks, and test policy or scaling changes in a non-production cluster first.

Final Thoughts

None of these pitfalls are exotic. They are the result of carrying OSA-era habits into an ESA-first platform, sizing optimistically, and treating the network as separate from storage. Get the hardware certified, default to erasure coding, size on usable capacity with real slack, keep hosts uniform, give vSAN the network it needs, and scale out rather than up. Do that, and the storage layer becomes the part of your VCF 9 deployment you stop worrying about, which is exactly where it should be.

References


« Previous: VCF 9 Network Design: 7 Mistakes That Break Your Deployment (Part 5)
Next: VCF 9 Reference Architecture (Part 7, coming soon)
Back to the VCF 9 Complete Guide.

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading