Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

VCF 9 Operations Capacity and Cost Management: A Practical Runbook (VCF 9 Series, Part 20)

A hands-on runbook for capacity planning, rightsizing, reclamation and cost reporting in VCF 9 Operations, including the model and policy choices that decide whether the recommendations are worth trusting.

VCF 9 Series · Part 20 of 36

TL;DR · Key Takeaways

  • The Capacity Engine turns two inputs, Demand and Usable Capacity, into Time Remaining, Capacity Remaining and a Recommended Size. Those are forecasts, not facts, so set the policy before you trust the number.
  • Leave the demand model on for general clusters, but switch contention-sensitive or reservation-heavy clusters to the allocation model. The default under-reports risk where you have committed resources.
  • Reclamation frees physical capacity (idle VMs, stale snapshots, orphaned disks). Rightsizing changes a running VM’s allocation. Different operations, very different blast radius.
  • Start cost reporting with showback, not chargeback. Your rate card is wrong on day one, and chargeback disputes burn more ops time than they save.
  • Give the engine three to four weeks of clean data and exclude known anomalous spikes before you act on any projection.
Who this is for: VCF admins and architects on VCF Operations 9.0 or 9.1 who own capacity and cost reporting.  Prerequisites: a management domain with VCF Operations deployed, at least one workload domain reporting metrics, and policy edit rights.

Here is the failure pattern I see most often. A team stands up VCF 9, opens VCF Operations a week later, sees a cluster flashing “Critical” on Time Remaining, and either buys hosts they did not need or panic-migrates workloads they should have left alone. The capacity data was fine. The reading of it was not. Capacity and cost management in VCF 9 is less about the dashboards and more about the four or five policy decisions that sit underneath them. Get those right and the recommendations are genuinely useful. Get them wrong and you are automating bad guesses at scale.

This runbook walks the workflow end to end: how the engine computes capacity, which policy to set first, how to reclaim and rightsize safely, how to model growth with what-if scenarios, and how to turn all of that into a cost story your application teams will actually accept.

What the Capacity Engine Actually Computes

VCF Operations 9 runs an AI/ML-driven Capacity Engine that takes two inputs, Demand and Usable Capacity, and produces four outputs you will see everywhere in the UI: Time Remaining, Capacity Remaining, Recommended Size and Recommended Total Capacity. Time Remaining is the one people fixate on, reported per cluster as Critical, Medium, Normal or Unknown, with a Most Constrained Resource flag telling you whether CPU, memory or disk space is the bottleneck.

The detail that changes everything is the projection model. The engine defaults to the demand model, which forecasts based on what workloads are actually consuming. That is the right default for general-purpose clusters where you want to chase real utilisation. It is the wrong default for any cluster where you have made commitments: reservations, latency-sensitive tiers, or workloads sized to a contract. There the allocation model is what you want, because it counts what you have promised, not just what is being used right now. A cluster can look healthy on demand and be effectively full on allocation. That gap is where teams get surprised.

VCF 9 Capacity-to-Cost Workflow Policy first, then assess, optimise, plan, and only then charge. 1 Set Policy Model + exclusions 2 Assess Time Remaining 3 Optimise Reclaim / Rightsize 4 Plan What-If scenarios 5 Cost Showback first
The capacity-to-cost workflow in VCF 9 Operations. Each stage depends on the policy you set in stage one.
Disclaimer: Capacity policy edits, rightsizing and reclamation are production changes. Validate against the running BOM, confirm there are no active reservations or affinity rules you would violate, snapshot or back up affected VMs, and apply resize and reclaim actions inside a maintenance window. Test on a non-critical cluster first.
Demand vs allocation: pick the right modelThe default is demand; commitments need allocationDemand model (default)Forecasts on what workloads actually consumeBest for general-purpose clustersAllocation modelCounts committed resources (reservations)Best for latency-sensitive, contractual clustersA cluster can look healthy on demand and be effectively full on allocation.
Set the model per cluster class; the gap between the two is where teams get surprised.

Step 1: Set the Capacity Policy Before You Trust a Number

  1. Open VCF Operations and go to Configure > Policies. Either edit the default policy or, better, clone it per cluster class so production and lab clusters do not share one set of thresholds.
  2. Set the capacity model per policy. Leave demand for general compute. Switch reservation-heavy or latency-sensitive clusters to allocation so the engine counts committed resources.
  3. Turn on Exclude Anomalous Time Ranges and mark any known events (a migration wave, a DDoS spike, a backup storm) so they stop skewing Time Remaining. Garbage input produces a confident, wrong forecast.
  4. If you intentionally run storage near full with just-in-time expansion, use the new Exclude Storage option so disk space stops being flagged as Critical and CPU or memory contention surfaces instead. Note that disk space can still appear as the Most Constrained Resource even when excluded from the Critical calculation.
  5. Confirm your thresholds for Critical, Medium and Normal. These drive every alert, so dynamic thresholding is fine for steady workloads but pin static values on clusters with predictable, contractual limits.
# Where the levers live in VCF Operations 9
Configure > Policies > [your policy] > Capacity
  - Capacity Model:            Demand (default)  |  Allocation
  - Exclude Anomalous Ranges:  On  (mark known spikes)
  - Exclude Storage:           On for just-in-time storage clusters
  - Buffers / Thresholds:      Critical / Medium / Normal

Assess:   Capacity > [cluster] > Time Remaining, Most Constrained Resource
Optimise: Capacity > Reclaim  |  Rightsize
Plan:     Capacity > What-If Analysis

Step 2: Read Time Remaining and Find Reclaimable Capacity

With the policy set, the Assess Capacity page becomes trustworthy. Read it in this order: Most Constrained Resource first (it tells you which dimension to act on), then the Time Remaining graph (it shows when a cluster is projected to exhaust CPU, memory or disk under the model you chose), then the Optimization Recommendations, which quantify potential savings from reclaiming unused resources. The engine offers two levers on every constrained cluster: Reclaim Resources or Add Capacity. Always exhaust reclaim before you cost a hardware purchase.

Reclamation is about physical capacity you are wasting: powered-off and idle VMs, snapshots that were never cleaned up, and orphaned disks left behind by failed provisioning or deletions. This is the safest optimisation in the toolkit because deleting a stale snapshot or an orphaned VMDK does not touch a running workload. My rule on a brownfield environment: run the reclamation report first, clear the obvious orphans and aged snapshots, and you will frequently recover enough headroom to push out a hardware refresh by a quarter or more. Check the snapshot age column before bulk-deleting, because some shops keep deliberate long-lived snapshots for compliance.

Step 3: Rightsize Without Breaking Workloads

Rightsizing is different from reclamation and it is where the real risk lives. Rightsizing changes a VM’s allocated CPU or memory to match observed usage, scaling up an under-provisioned VM or scaling down an over-provisioned one. The scale-up direction is low risk. The scale-down direction is not, and this is where I diverge from the “just apply the recommendations” advice.

  1. Open Capacity > Rightsize and read the rationale column, not just the recommendation. The engine explains why it suggests a change; if the justification is a short, quiet observation window, wait.
  2. Build exclusions deliberately. VCF Operations 9 lets you exclude VMs from rightsizing and reclamation by property tag, VM age and history, and it auto-excludes Broadcom appliances. Tag your databases, JVM-heavy app servers and latency-sensitive tiers as excluded from automated memory downsizing.
  3. Schedule applied resizes inside a maintenance window. A CPU or memory change that requires a reboot is not a daytime operation on a production VM.
  4. Apply in waves and verify performance after each wave before continuing.

In practice: never auto-apply memory downsizing to JVM or in-memory database workloads. The engine sees free guest memory and recommends a cut, but the JVM heap or buffer pool was sized for peak, and the reclaim shows up later as garbage-collection thrash or swap. Right the obvious CPU over-allocations automatically if you must, gate every memory reduction behind human review.


Reclaim first, rightsize carefullyReclamation is the safe win; scale-down is where the risk livesReclaim (do first)Idle / powered-off VMsStale snapshotsOrphaned VMDKsDoes not touch running workloadsRightsize (careful)Scale-up: low riskScale-down: gate behind reviewNever auto-downsize JVM/DB memoryApply in a maintenance window
Exhaust safe reclamation before you cost hardware; gate every memory reduction behind review.

Step 4: Model Growth with What-If Scenarios

Before you commit to a hardware purchase or onboard a new application, model it. VCF Operations 9 builds what-if scenarios on the same Capacity Engine forecasts, and the 9.0 release raised the ceiling to roughly 10,000 VMs and 200 TB per scenario, which covers most real planning needs. You can add hypothetical VMs or hosts to an existing cluster, and the newer releases let you model an entirely new cluster by entering server make, model, CPU, socket count, cores, memory, year and cost. The engine then tells you whether the workload fits within your timeframe, and if it does not, it recommends what to change so it will.

  • Use the new-cluster scenario to attach a real financial number to a growth plan, then export the result for the business case before you raise a purchase order.
  • Run the scenario against both demand and allocation models if the target cluster carries reservations. The two answers bracket your real risk.
  • Keep anomalous ranges excluded here too. What-if inherits the same input metrics, so a skewed history produces a skewed plan.

Step 5: Turn Capacity into Cost, Showback Before Chargeback

Cost Management in VCF Operations attributes infrastructure spend through Cost Drivers (server hardware, storage, licenses, applications, maintenance, labor, network, facilities and more), lets you define Pricing Policies or rate cards, and then supports two consumption models: Showback distributes cost based on actual usage, and Chargeback bills tenants or application teams against your rate card. VCF 9.1 extends costing across the Kubernetes stack (nodes, clusters, vSphere namespaces, projects and organizations), adds real-time pricing estimates when a VM or Kubernetes node is deployed, and can deliver bills as automated PDFs.

My take

, and it is a firm one: start with showback and stay there for at least a quarter. Chargeback is technically ready on day one, but your cost drivers are not. The first rate card always misses something (real license cost allocation, shared storage tiers, the labor line), and the moment you bill a team against a wrong number you spend the next month in disputes instead of operations. Run showback, let application owners see their real consumption, refine the cost drivers against actual invoices, and only flip to chargeback once the numbers survive a few cycles unchallenged. The cost data is most valuable as a behaviour-changer long before it becomes an invoice.

For the broader operations picture this builds on, see VCF Operations monitoring and observability in VCF 9. To ground capacity baselines against your design, revisit the VCF 9 reference architecture and sizing guide, and because cost attribution starts with entitlements, the VCF 9 licensing and consumption model is worth a re-read.

What I’d Do

On a fresh VCF 9 environment I would not even open the capacity dashboards in week one. Set the policies, pick the model per cluster class, exclude the spikes you already know about, and let the engine collect three to four weeks of clean data. Then work the order: reclaim the free wins, rightsize CPU broadly and memory carefully, model anything that needs new hardware, and report cost as showback until the drivers are trustworthy. The platform is good at the math. Your job is to make sure it is doing the math on the right assumptions.

What is the first thing you check when a cluster goes Critical, the model or the data behind it? Tell me how you run capacity reviews in your environment.


References

VCF 9 Series · Part 20 of 36
« Previous: Part 19  |  VCF 9 Complete Guide  |  Next: Part 21 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading