Autoscaling VKS: Cluster Autoscaler and Node Pool Scaling (VKS Series, Part 9)

The Cluster Autoscaler turns fixed node pools elastic, but only within your quota. Here is how min and max really work, how it differs from the HPA, and why the quota always wins.

by

Dr. Pranay Jha

June 19, 2026

No comments

8 minutes

Read Time

Autoscaling VKS: Cluster Autoscaler & Node Pools

VKS Series · Part 9 of 17

TL;DR · Key Takeaways

The Cluster Autoscaler adds and removes worker nodes based on pending pods and idle capacity. The Horizontal Pod Autoscaler scales pods. Run both, tuned to cooperate.
You install the autoscaler into the cluster (VCF CLI or kubectl) and set per-node-pool min and max instead of hard-coding replicas.
Scale-up is triggered by unschedulable pods; scale-down removes underused nodes after a cool-down, and it respects pod disruption budgets.
The autoscaler can only act within your namespace quota and host capacity. An ambitious max against a tight quota just produces stuck pods, the quota always wins.

Who this is for: teams running variable workloads who want elasticity without over-provisioning. Prerequisites: Part 5 (node pools and VM classes) and a known namespace quota.

Static clusters waste money at the trough and fall over at the peak. The Cluster Autoscaler is how you stop guessing node counts and let demand drive them. On VKS it works the way it does on any Cluster API platform, adding and removing worker VMs, but with VKS-specific guardrails you must respect: the namespace quota and the VM classes from Part 5 bound what it is allowed to do. Understand those bounds and the autoscaler is reliable; ignore them and you get the “autoscaler is broken” ticket that is really a quota problem.

Two autoscalers, different jobs

It is worth separating the two things people lump together. The Horizontal Pod Autoscaler (HPA) changes how many pod replicas a workload runs, based on CPU, memory or custom metrics. The Cluster Autoscaler (CA) changes how many nodes the cluster has, based on whether pods can be scheduled. They are complementary: the HPA creates more pods, and if those pods cannot fit, the CA adds nodes for them; when load drops, the HPA removes pods and the CA reclaims the now-idle nodes. Tune them so they cooperate rather than fight, an HPA that flaps will make the CA churn nodes expensively.

The two autoscalers hand off: pods first, then nodes, then back down again as demand subsides.

Setting min and max, and what they really mean

On VKS you install the Cluster Autoscaler into the workload cluster, the docs cover both a VCF CLI install and a kubectl install, and configure it against the node pools. The key shift is that you no longer hard-code the worker replica count; you give each pool a minimum and a maximum, and Kubernetes manages the actual number between them. Set the minimum to the steady-state load you always expect, so the cluster is never caught flat-footed, and set the maximum to what your namespace quota and host capacity can genuinely satisfy. A maximum the quota cannot meet is worse than useless: the autoscaler tries to scale up, fails to get resources, and your pending pods stay pending.

Behaviour	Trigger	What to watch
Scale up	Unschedulable (pending) pods	New-node provisioning time; quota headroom
Scale down	Nodes underused past a cool-down	Pod disruption budgets; pods that block drain
No-op (stuck)	Max reached or quota exhausted	Raise max only if quota and capacity allow

What bites in production

Three things catch people. First, scale-up is not instant: a new node is a new VM that must be provisioned, booted and joined, so latency-sensitive bursts still need a sensible minimum rather than relying on cold scale-up. Second, scale-down is polite: it respects pod disruption budgets and refuses to drain a node if doing so would violate one, so a single mis-set PDB or an unevictable pod can pin nodes you wanted reclaimed. Third, the autoscaler and your namespace quota must agree, and the quota always wins. The most common “autoscaler is broken” report is really “the quota was exhausted and nobody told the autoscaler it could not have more.”

Capacity-check your max: never set a node-pool maximum you have not confirmed the namespace quota and host capacity can actually satisfy. An aggressive max against a tight quota does not give you elasticity, it gives you pending pods and a confusing incident.

How the autoscaler chooses which pool to grow

When you have more than one node pool, the autoscaler does not pick at random; it grows the pool whose nodes can actually satisfy the pending pods. If a pod requests a GPU, only the GPU pool can host it, so that is the pool that scales. If a pod just needs CPU and memory, the autoscaler weighs the candidate pools and, with the default expander, grows the one that wastes the least capacity for the pending workload. This matters because it gives you a lever: by shaping pods with the right resource requests, node selectors and taints, you steer which pool absorbs which growth. Tag your expensive pools (GPU, high-memory) so that only the workloads that truly need them can trigger their growth, or you will watch a mislabelled batch job scale up a pool of GPU nodes nobody intended to pay for.

The corollary is that resource requests are not a formality. The autoscaler reasons entirely from requests, not actual usage, so a pod that requests far more than it uses makes the cluster scale up for capacity it never consumes, and a pod that requests too little gets packed onto a node and then starved. Right-sizing requests is half of making autoscaling behave, and it is the half developers most often skip.

Tuning scale-down so it actually reclaims

Scale-up tends to just work; scale-down is where the autoscaler quietly underperforms, and almost always for the same reasons. A node will only be removed when its pods can be rescheduled elsewhere and no pod disruption budget blocks the eviction. So a single replica with a strict PDB, a pod with local storage the scheduler will not move, or a system daemon without a tolerant budget can pin an otherwise idle node indefinitely. The cluster looks like the autoscaler is broken; really it is doing exactly what it was told and refusing to violate a constraint you set. The fix is to audit pod disruption budgets and make sure batch and stateless workloads can actually be evicted, and to give system add-ons sane budgets rather than implicit ones that block drains.

The scale-down delay and utilisation threshold are also worth tuning to your workload’s rhythm. Set them too aggressive and the cluster thrashes, adding and removing nodes as load wobbles, which on VKS means constantly provisioning and destroying VMs. Set them too conservative and you pay for idle nodes long after the peak passed. Match the timing to how spiky your real traffic is, and remember that on VKS a node is a VM with a real provisioning cost, so a little hysteresis is cheaper than churn.

Pairing it with the HPA, and the metrics it needs

The Cluster Autoscaler only does half the job. It adds nodes when pods cannot fit, but it does not create the pods, that is the Horizontal Pod Autoscaler’s role, and the two have to be wired together to get real elasticity. The HPA scales replicas based on metrics, so it needs a working metrics pipeline: the metrics server for basic CPU and memory, or a custom and external metrics adapter (often fed from Prometheus) for anything more interesting like queue depth or requests per second. Without that pipeline the HPA has nothing to act on, and the Cluster Autoscaler then has no surge of pending pods to respond to, so the cluster sits at its floor while the workload struggles.

So the chain you actually want is: real metrics feed the HPA, the HPA adds pods, pods that do not fit make the Cluster Autoscaler add nodes, and the whole thing reverses cleanly as load falls. Tune the HPA so it does not flap, because a flapping HPA makes the Cluster Autoscaler churn VMs, which is the expensive failure mode. Scale on a metric that reflects actual demand rather than a lagging proxy, give the HPA a stabilisation window, and the node-level and pod-level autoscalers will cooperate instead of fighting.

What I’d Do

I run the Cluster Autoscaler and the HPA together on anything with variable load, and I tune the HPA’s thresholds so it does not flap and make the CA churn VMs. I set the node-pool minimum to real steady-state so bursts are not waiting on a cold node boot, and I set the maximum only after checking it against the namespace quota and host headroom, never as an aspiration. I audit pod disruption budgets so scale-down is not silently pinned by a single-replica workload. Done well, autoscaling turns fixed pools into elastic ones and nobody notices it working, which is the goal. Looking at your busiest cluster: if it hit its autoscaler maximum tomorrow, would the quota actually let it grow, or would the pods just queue?

References

VKS Series · Part 9 of 17
« Prev: Part 8 | VKS Complete Guide | Next: Part 10 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts