TL;DR · Key Takeaways
- Performance tuning and efficiency optimization are two different projects that fight over the same cluster. Treating them as one is how you end up with a fast platform you cannot afford, or a cheap one that misses SLAs.
- Run efficiency first: Reclaim idle VMs, snapshots and orphaned disks, then Rightsize. You cannot tune a cluster you cannot see clearly through the noise of dead capacity.
- Memory tiering is the rare lever that helps density and TCO at once, and in VCF 9.1 it no longer needs a host reboot to enable.
- Do not point aggressive rightsizing at latency-sensitive VMs. That is where the two goals collide hardest.
- vSAN ESA changes the old storage trade-off: RAID-5/6 erasure coding now gives RAID-1 class performance, so capacity efficiency stops costing you IOPS.
Optimization gets sold as one tidy project. It is two, and they want opposite things. Performance tuning spends resources to make workloads faster. Efficiency optimization takes resources away to make the platform cheaper. Run them as a single pass and you spend a week tightening reservations on the VMs your AI forecaster is about to flag as oversized, then undo it. The skill in VCF 9 is not knowing the levers. It is knowing which goal each lever serves and the order to pull them in.
Two goals, one cluster, opposite pull
Performance work answers “is this workload fast enough,” and its instinct is to add headroom: bigger NUMA-aligned VMs, reservations, RAID-1, dedicated networks. Efficiency work answers “are we paying for capacity nobody uses,” and its instinct is to claw headroom back: rightsize the oversized, reclaim the idle, push density up with memory tiering. Both are correct. They just cannot both win on the same VM at the same time.
The trap I see most often on customer fleets is teams optimizing for cost on infrastructure they have not measured, or chasing latency on a cluster that is half full of dead capacity. So before any tuning, get the picture clean. If your VCF Operations monitoring is not yet trustworthy, fix that first. Everything below assumes you can actually see contention and waste, not guess at them.
The performance levers worth your time
vSphere in VCF 9 shipped real scheduler changes, not cosmetic ones. The NUMA scheduler now behaves more like DRS, using a fairness model with resource pools and min/max shares plus a multi-resource goodness model that factors in page-migration costs. Topology-aware scheduling applies chip-aware logic to NUMA placement, which matters on high-core-count CPUs where a badly placed wide VM pays a remote-memory tax on every access. For the busiest, widest VMs this is the single highest-yield thing to get right, and it costs nothing but attention to sizing.
DRS parallel vMotion removes the old sequential bottleneck in cluster balancing by letting multiple migrations run at once, so a rebalance after maintenance settles in minutes rather than dragging. On storage, vSAN ESA quietly killed the classic performance-versus-capacity argument: RAID-5/6 erasure coding now delivers performance equal to or better than RAID-1 mirroring, so the old reflex of mirroring your hot tier for IOPS is usually wasted capacity. The log-structured file system ingests writes, coalesces them and acknowledges from a durable log to keep latency low. For stretched topologies, keep inter-site vSAN traffic on a dedicated segment with jumbo frames at MTU 9000, which is one of the few raw-network knobs that still moves the needle.
The efficiency levers worth your time
VCF Operations is where most of the cost actually hides. Rightsizing uses AI-driven demand forecasting to flag VMs that are oversized or undersized, and in 9.1 it can return several recommendations for a single VM regardless of state, idle, powered off, or oversized, in one pass. Reclaim cleans out idle VMs, old snapshots, orphaned disks and powered-off VMs, and the Reclamation Dashboard gives you a running total of recovered capacity so the savings are defensible to whoever signs the budget. This pairs naturally with the discipline in VCF 9 capacity and cost management.
On the platform side, memory tiering is the standout. The NVMe second tier intelligently offloads roughly 20 to 25 percent of memory accesses to flash, raising VM density and lowering server TCO without a noticeable hit to responsiveness for most workloads. In 9.1 you can enable it without the host reboot earlier versions required, and the UI now surfaces eligible clusters plus device wear so you can plan replacement instead of getting surprised. Storage efficiency improved too: vSAN ESA compression is cluster-based and always-on using the ZSTD algorithm in 9.1 for better ratios than LZ4, global deduplication arrived, and Auto-RAID picks the resilience scheme for you, FTT=2 with RAID-6 on six or more hosts and FTT=1 with RAID-5 on three to five.
Where the two goals collide
Most levers lean one way. A few serve both, and a few will actively hurt the other goal if you push them blind. This is the matrix I keep in front of me when planning an optimization cycle.
| Lever | Performance impact | Efficiency / cost impact | When to favour it |
|---|---|---|---|
| NUMA / topology-aware sizing | High gain on wide VMs | Neutral | Always, for tier-1 workloads |
| Memory tiering (NVMe) | Slight, acceptable for most | Strong: higher density, lower TCO | General-purpose clusters |
| vSAN ESA RAID-5/6 + compression | RAID-1 class, effectively neutral | Strong capacity savings | Default for almost everything |
| Rightsizing (downsize oversized) | Risk on bursty / latency-sensitive | Strong cost reduction | Steady-state, non-critical VMs |
| Reclaim idle / snapshots / orphans | Neutral to positive (less contention) | Pure savings, low risk | First, every cycle |
| Reservations on hot VMs | Protects performance | Hurts density / consolidation | Sparingly, only proven hot VMs |
The two rows that cause real arguments are rightsizing and reservations. Aggressive rightsizing on a bursty database looks like free savings in the forecast and shows up as a latency incident three weeks later when month-end hits. Reservations are the mirror image: they feel responsible but they fragment your capacity and quietly cap how dense the cluster can run. Both are right in their lane and wrong everywhere else.
The order I actually run them
- Reclaim first. Idle VMs, stale snapshots, orphaned and powered-off disks. It is the lowest-risk capacity you will ever recover and it cleans the signal for everything after.
- Rightsize the safe majority. Accept downsizing on steady-state, non-critical VMs. Exclude latency-sensitive and bursty workloads from automatic action and review those by hand.
- Set storage and density defaults. RAID-5/6 with compression as the standard policy, Auto-RAID for resilience, memory tiering on general-purpose clusters.
- Then tune performance where it counts. NUMA-aligned sizing and selective reservations on the genuinely hot tier-1 VMs, validated against your VCF 9 reference architecture sizing.
The Bottom Line
If you only have budget for one direction this quarter, do efficiency first. Reclaim and rightsizing pay back immediately and at low risk, and a clean cluster is a prerequisite for honest performance work anyway. Save real performance tuning for the small set of workloads where users feel the difference, and protect those with NUMA-aware sizing and targeted reservations rather than blanket headroom. The one lever I would turn on across the board is memory tiering: in VCF 9.1 it buys density and TCO without a reboot and without hurting most workloads, which is as close as this stack gets to a free win. Verdict: efficiency before performance, always, with memory tiering as the bridge between them. What does your current cycle lead with, reclaim or tuning?
References
- Broadcom TechDocs: How to Optimize Capacity and Improve Performance in VCF Operations
- VCF Blog: Configuring Memory Tiering in VCF 9.1
- VCF Blog: Auto-RAID in vSAN for VCF 9.1
- VMware: Performance Recommendations for vSAN ESA



