VCF 9 Performance Tuning vs Cost Optimization: Where to Spend Your Effort (VCF 9 Series, Part 35)

Performance tuning and cost optimization in VCF 9 pull in opposite directions. Here is which levers help which goal, where they collide, and the order I run them in on real clusters.

by

Dr. Pranay Jha

June 14, 2026

No comments

8 minutes

Read Time

VCF 9 Series · Part 35 of 36

TL;DR · Key Takeaways

Performance tuning and efficiency optimization are two different projects that fight over the same cluster. Treating them as one is how you end up with a fast platform you cannot afford, or a cheap one that misses SLAs.
Run efficiency first: Reclaim idle VMs, snapshots and orphaned disks, then Rightsize. You cannot tune a cluster you cannot see clearly through the noise of dead capacity.
Memory tiering is the rare lever that helps density and TCO at once, and in VCF 9.1 it no longer needs a host reboot to enable.
Do not point aggressive rightsizing at latency-sensitive VMs. That is where the two goals collide hardest.
vSAN ESA changes the old storage trade-off: RAID-5/6 erasure coding now gives RAID-1 class performance, so capacity efficiency stops costing you IOPS.

Who this is for: VCF architects and operations leads running a live 9.0 or 9.1 fleet who are being asked to cut cost and protect performance at the same time. Prerequisites: a bedded-in VCF Operations deployment with at least a few weeks of metrics, and change control for any policy or sizing change.

Optimization gets sold as one tidy project. It is two, and they want opposite things. Performance tuning spends resources to make workloads faster. Efficiency optimization takes resources away to make the platform cheaper. Run them as a single pass and you spend a week tightening reservations on the VMs your AI forecaster is about to flag as oversized, then undo it. The skill in VCF 9 is not knowing the levers. It is knowing which goal each lever serves and the order to pull them in.

Two goals, one cluster, opposite pull

Performance work answers “is this workload fast enough,” and its instinct is to add headroom: bigger NUMA-aligned VMs, reservations, RAID-1, dedicated networks. Efficiency work answers “are we paying for capacity nobody uses,” and its instinct is to claw headroom back: rightsize the oversized, reclaim the idle, push density up with memory tiering. Both are correct. They just cannot both win on the same VM at the same time.

The trap I see most often on customer fleets is teams optimizing for cost on infrastructure they have not measured, or chasing latency on a cluster that is half full of dead capacity. So before any tuning, get the picture clean. If your VCF Operations monitoring is not yet trustworthy, fix that first. Everything below assumes you can actually see contention and waste, not guess at them.

Measure first; do not chase latency on a cluster half full of dead capacity.

The performance levers worth your time

vSphere in VCF 9 shipped real scheduler changes, not cosmetic ones. The NUMA scheduler now behaves more like DRS, using a fairness model with resource pools and min/max shares plus a multi-resource goodness model that factors in page-migration costs. Topology-aware scheduling applies chip-aware logic to NUMA placement, which matters on high-core-count CPUs where a badly placed wide VM pays a remote-memory tax on every access. For the busiest, widest VMs this is the single highest-yield thing to get right, and it costs nothing but attention to sizing.

DRS parallel vMotion removes the old sequential bottleneck in cluster balancing by letting multiple migrations run at once, so a rebalance after maintenance settles in minutes rather than dragging. On storage, vSAN ESA quietly killed the classic performance-versus-capacity argument: RAID-5/6 erasure coding now delivers performance equal to or better than RAID-1 mirroring, so the old reflex of mirroring your hot tier for IOPS is usually wasted capacity. The log-structured file system ingests writes, coalesces them and acknowledges from a durable log to keep latency low. For stretched topologies, keep inter-site vSAN traffic on a dedicated segment with jumbo frames at MTU 9000, which is one of the few raw-network knobs that still moves the needle.

The efficiency levers worth your time

VCF Operations is where most of the cost actually hides. Rightsizing uses AI-driven demand forecasting to flag VMs that are oversized or undersized, and in 9.1 it can return several recommendations for a single VM regardless of state, idle, powered off, or oversized, in one pass. Reclaim cleans out idle VMs, old snapshots, orphaned disks and powered-off VMs, and the Reclamation Dashboard gives you a running total of recovered capacity so the savings are defensible to whoever signs the budget. This pairs naturally with the discipline in VCF 9 capacity and cost management.

On the platform side, memory tiering is the standout. The NVMe second tier intelligently offloads roughly 20 to 25 percent of memory accesses to flash, raising VM density and lowering server TCO without a noticeable hit to responsiveness for most workloads. In 9.1 you can enable it without the host reboot earlier versions required, and the UI now surfaces eligible clusters plus device wear so you can plan replacement instead of getting surprised. Storage efficiency improved too: vSAN ESA compression is cluster-based and always-on using the ZSTD algorithm in 9.1 for better ratios than LZ4, global deduplication arrived, and Auto-RAID picks the resilience scheme for you, FTT=2 with RAID-6 on six or more hosts and FTT=1 with RAID-5 on three to five.

Where the two goals collide

Most levers lean one way. A few serve both, and a few will actively hurt the other goal if you push them blind. This is the matrix I keep in front of me when planning an optimization cycle.

Lever	Performance impact	Efficiency / cost impact	When to favour it
NUMA / topology-aware sizing	High gain on wide VMs	Neutral	Always, for tier-1 workloads
Memory tiering (NVMe)	Slight, acceptable for most	Strong: higher density, lower TCO	General-purpose clusters
vSAN ESA RAID-5/6 + compression	RAID-1 class, effectively neutral	Strong capacity savings	Default for almost everything
Rightsizing (downsize oversized)	Risk on bursty / latency-sensitive	Strong cost reduction	Steady-state, non-critical VMs
Reclaim idle / snapshots / orphans	Neutral to positive (less contention)	Pure savings, low risk	First, every cycle
Reservations on hot VMs	Protects performance	Hurts density / consolidation	Sparingly, only proven hot VMs

The two rows that cause real arguments are rightsizing and reservations. Aggressive rightsizing on a bursty database looks like free savings in the forecast and shows up as a latency incident three weeks later when month-end hits. Reservations are the mirror image: they feel responsible but they fragment your capacity and quietly cap how dense the cluster can run. Both are right in their lane and wrong everywhere else.

Disclaimer: Rightsizing, storage policy and memory tiering changes alter live workload behaviour. Validate against your supported BOM, apply changes to a non-critical cluster first, take backups, and stage downsizing in steps with a rollback plan rather than accepting every recommendation in bulk.

The order I actually run them

Reclaim first. Idle VMs, stale snapshots, orphaned and powered-off disks. It is the lowest-risk capacity you will ever recover and it cleans the signal for everything after.
Rightsize the safe majority. Accept downsizing on steady-state, non-critical VMs. Exclude latency-sensitive and bursty workloads from automatic action and review those by hand.
Set storage and density defaults. RAID-5/6 with compression as the standard policy, Auto-RAID for resilience, memory tiering on general-purpose clusters.
Then tune performance where it counts. NUMA-aligned sizing and selective reservations on the genuinely hot tier-1 VMs, validated against your VCF 9 reference architecture sizing.

Reclaim, then rightsize, then set defaults, and save real tuning for the workloads users feel.

The Bottom Line

If you only have budget for one direction this quarter, do efficiency first. Reclaim and rightsizing pay back immediately and at low risk, and a clean cluster is a prerequisite for honest performance work anyway. Save real performance tuning for the small set of workloads where users feel the difference, and protect those with NUMA-aware sizing and targeted reservations rather than blanket headroom. The one lever I would turn on across the board is memory tiering: in VCF 9.1 it buys density and TCO without a reboot and without hurting most workloads, which is as close as this stack gets to a free win. Verdict: efficiency before performance, always, with memory tiering as the bridge between them. What does your current cycle lead with, reclaim or tuning?

References

VCF 9 Series · Part 35 of 36
« Previous: Part 34 | VCF 9 Complete Guide | Next: Part 36 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts

Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Tags: Cost Optimization, Performance Tuning, Rightsizing, VCF 9 Series, VCF9, vSAN ESA

June 14, 2026

Dr. Pranay Jha