Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

VCF 9 Performance Tuning vs Cost Optimization: Where to Spend Your Effort (VCF 9 Series, Part 35)

Performance tuning and cost optimization in VCF 9 pull in opposite directions. Here is which levers help which goal, where they collide, and the order I run them in on real clusters.

VCF 9 Series · Part 35 of 36

TL;DR · Key Takeaways

  • Performance tuning and efficiency optimization are two different projects that fight over the same cluster. Treating them as one is how you end up with a fast platform you cannot afford, or a cheap one that misses SLAs.
  • Run efficiency first: Reclaim idle VMs, snapshots and orphaned disks, then Rightsize. You cannot tune a cluster you cannot see clearly through the noise of dead capacity.
  • Memory tiering is the rare lever that helps density and TCO at once, and in VCF 9.1 it no longer needs a host reboot to enable.
  • Do not point aggressive rightsizing at latency-sensitive VMs. That is where the two goals collide hardest.
  • vSAN ESA changes the old storage trade-off: RAID-5/6 erasure coding now gives RAID-1 class performance, so capacity efficiency stops costing you IOPS.
Who this is for: VCF architects and operations leads running a live 9.0 or 9.1 fleet who are being asked to cut cost and protect performance at the same time.  Prerequisites: a bedded-in VCF Operations deployment with at least a few weeks of metrics, and change control for any policy or sizing change.

Optimization gets sold as one tidy project. It is two, and they want opposite things. Performance tuning spends resources to make workloads faster. Efficiency optimization takes resources away to make the platform cheaper. Run them as a single pass and you spend a week tightening reservations on the VMs your AI forecaster is about to flag as oversized, then undo it. The skill in VCF 9 is not knowing the levers. It is knowing which goal each lever serves and the order to pull them in.

Two goals, one cluster, opposite pull

Performance work answers “is this workload fast enough,” and its instinct is to add headroom: bigger NUMA-aligned VMs, reservations, RAID-1, dedicated networks. Efficiency work answers “are we paying for capacity nobody uses,” and its instinct is to claw headroom back: rightsize the oversized, reclaim the idle, push density up with memory tiering. Both are correct. They just cannot both win on the same VM at the same time.

The trap I see most often on customer fleets is teams optimizing for cost on infrastructure they have not measured, or chasing latency on a cluster that is half full of dead capacity. So before any tuning, get the picture clean. If your VCF Operations monitoring is not yet trustworthy, fix that first. Everything below assumes you can actually see contention and waste, not guess at them.

Two goals, one cluster, opposite pullBoth are correct; they cannot both win on the same VM at oncePerformance (add headroom)Bigger NUMA-aligned VMsReservations on hot VMsRAID-1, dedicated networksEfficiency (claw it back)Rightsize the oversizedReclaim idle + snapshotsMemory tiering, higher density
Measure first; do not chase latency on a cluster half full of dead capacity.

The performance levers worth your time

vSphere in VCF 9 shipped real scheduler changes, not cosmetic ones. The NUMA scheduler now behaves more like DRS, using a fairness model with resource pools and min/max shares plus a multi-resource goodness model that factors in page-migration costs. Topology-aware scheduling applies chip-aware logic to NUMA placement, which matters on high-core-count CPUs where a badly placed wide VM pays a remote-memory tax on every access. For the busiest, widest VMs this is the single highest-yield thing to get right, and it costs nothing but attention to sizing.

DRS parallel vMotion removes the old sequential bottleneck in cluster balancing by letting multiple migrations run at once, so a rebalance after maintenance settles in minutes rather than dragging. On storage, vSAN ESA quietly killed the classic performance-versus-capacity argument: RAID-5/6 erasure coding now delivers performance equal to or better than RAID-1 mirroring, so the old reflex of mirroring your hot tier for IOPS is usually wasted capacity. The log-structured file system ingests writes, coalesces them and acknowledges from a durable log to keep latency low. For stretched topologies, keep inter-site vSAN traffic on a dedicated segment with jumbo frames at MTU 9000, which is one of the few raw-network knobs that still moves the needle.

The efficiency levers worth your time

VCF Operations is where most of the cost actually hides. Rightsizing uses AI-driven demand forecasting to flag VMs that are oversized or undersized, and in 9.1 it can return several recommendations for a single VM regardless of state, idle, powered off, or oversized, in one pass. Reclaim cleans out idle VMs, old snapshots, orphaned disks and powered-off VMs, and the Reclamation Dashboard gives you a running total of recovered capacity so the savings are defensible to whoever signs the budget. This pairs naturally with the discipline in VCF 9 capacity and cost management.

On the platform side, memory tiering is the standout. The NVMe second tier intelligently offloads roughly 20 to 25 percent of memory accesses to flash, raising VM density and lowering server TCO without a noticeable hit to responsiveness for most workloads. In 9.1 you can enable it without the host reboot earlier versions required, and the UI now surfaces eligible clusters plus device wear so you can plan replacement instead of getting surprised. Storage efficiency improved too: vSAN ESA compression is cluster-based and always-on using the ZSTD algorithm in 9.1 for better ratios than LZ4, global deduplication arrived, and Auto-RAID picks the resilience scheme for you, FTT=2 with RAID-6 on six or more hosts and FTT=1 with RAID-5 on three to five.


Where the two goals collide

Most levers lean one way. A few serve both, and a few will actively hurt the other goal if you push them blind. This is the matrix I keep in front of me when planning an optimization cycle.

LeverPerformance impactEfficiency / cost impactWhen to favour it
NUMA / topology-aware sizingHigh gain on wide VMsNeutralAlways, for tier-1 workloads
Memory tiering (NVMe)Slight, acceptable for mostStrong: higher density, lower TCOGeneral-purpose clusters
vSAN ESA RAID-5/6 + compressionRAID-1 class, effectively neutralStrong capacity savingsDefault for almost everything
Rightsizing (downsize oversized)Risk on bursty / latency-sensitiveStrong cost reductionSteady-state, non-critical VMs
Reclaim idle / snapshots / orphansNeutral to positive (less contention)Pure savings, low riskFirst, every cycle
Reservations on hot VMsProtects performanceHurts density / consolidationSparingly, only proven hot VMs

The two rows that cause real arguments are rightsizing and reservations. Aggressive rightsizing on a bursty database looks like free savings in the forecast and shows up as a latency incident three weeks later when month-end hits. Reservations are the mirror image: they feel responsible but they fragment your capacity and quietly cap how dense the cluster can run. Both are right in their lane and wrong everywhere else.

Disclaimer: Rightsizing, storage policy and memory tiering changes alter live workload behaviour. Validate against your supported BOM, apply changes to a non-critical cluster first, take backups, and stage downsizing in steps with a rollback plan rather than accepting every recommendation in bulk.

The order I actually run them

  1. Reclaim first. Idle VMs, stale snapshots, orphaned and powered-off disks. It is the lowest-risk capacity you will ever recover and it cleans the signal for everything after.
  2. Rightsize the safe majority. Accept downsizing on steady-state, non-critical VMs. Exclude latency-sensitive and bursty workloads from automatic action and review those by hand.
  3. Set storage and density defaults. RAID-5/6 with compression as the standard policy, Auto-RAID for resilience, memory tiering on general-purpose clusters.
  4. Then tune performance where it counts. NUMA-aligned sizing and selective reservations on the genuinely hot tier-1 VMs, validated against your VCF 9 reference architecture sizing.
The order I actually run themLowest-risk savings first; a clean cluster makes performance work honest1Reclaimfirst2Rightsizethe safe majority3Set storage +density defaults4Tune performancewhere it countsEfficiency before performance, every cycle; memory tiering is the bridge between them.
Reclaim, then rightsize, then set defaults, and save real tuning for the workloads users feel.

The Bottom Line

If you only have budget for one direction this quarter, do efficiency first. Reclaim and rightsizing pay back immediately and at low risk, and a clean cluster is a prerequisite for honest performance work anyway. Save real performance tuning for the small set of workloads where users feel the difference, and protect those with NUMA-aware sizing and targeted reservations rather than blanket headroom. The one lever I would turn on across the board is memory tiering: in VCF 9.1 it buys density and TCO without a reboot and without hurting most workloads, which is as close as this stack gets to a free win. Verdict: efficiency before performance, always, with memory tiering as the bridge between them. What does your current cycle lead with, reclaim or tuning?

References

VCF 9 Series · Part 35 of 36
« Previous: Part 34  |  VCF 9 Complete Guide  |  Next: Part 36 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading