Upgrading VKS: Kubernetes Versions, the Service and Rolling Node Replacement (VKS Series, Part 12)

VKS upgrades decouple the platform from the cluster and roll nodes one at a time. Here is the model, the version-skew gating, and why most stuck upgrades are preconditions, not the engine.

by

Dr. Pranay Jha

June 19, 2026

No comments

9 minutes

Read Time

Upgrading VKS: Versions & Rolling Replacement

VKS Series · Part 12 of 17

TL;DR · Key Takeaways

Two tracks upgrade separately: the Supervisor and VKS service, and the Kubernetes version of each workload cluster. New K8s versions arrive as VKr images in your content library, decoupled from the Supervisor update.
Cluster upgrades use rolling node replacement: control-plane nodes are replaced one at a time keeping quorum, then workers, for minimal downtime.
Respect version skew and run the prechecks. VKS blocks incompatible jumps and supports rollback if an upgrade does not succeed.
The re-engineered 9.1 Supervisor raised the ceilings hard, parallel upgrade throughput went from 64 to 256 clusters at once, provisioning is ~70% faster via linked clones, and a Supervisor scales to 500 clusters.

Who this is for: whoever owns the maintenance window and has to keep clusters current without an outage. Prerequisites: a maintained VKr content library (Part 3) and clusters built with headroom (Part 5).

Upgrades are where managed Kubernetes earns or loses your trust. Done well they are boring; done badly they are an outage. VKS leans toward boring, its rolling replacement model and version gating are built to keep clusters available through a Kubernetes bump, but there are moving parts, and a stuck upgrade is a real failure mode, so real that I wrote a whole post on why VKS upgrades get stuck. This part explains the model so that troubleshooting makes sense, and so you build clusters that roll cleanly in the first place.

Two upgrade tracks, deliberately decoupled

Two things upgrade, and conflating them causes confusion. One is the platform, the Supervisor and the VKS service running on it. The other is the Kubernetes version of each workload cluster. Since vSphere 8.0 Update 2, adding new Kubernetes releases was detached from the Supervisor update mechanism and handled through the service as vSphere Kubernetes releases (VKr). That decoupling is the point: new Kubernetes versions can flow into your content library as VKr images without upgrading the whole Supervisor first, and you might refresh Supervisor Kubernetes versions a few times a year to keep pace with the upstream cadence. So a cluster upgrade is usually: confirm the target VKr is in the library, confirm compatibility, then change the cluster’s version. The service handles the rest.

The rolling replacement model

VKS does not upgrade a node in place; it replaces it. During a cluster upgrade, the control plane is updated by adding a new, updated control-plane VM, migrating state to it, and removing an old one, one node at a time, so a three-node control plane keeps quorum throughout. Once the control plane is done, workers are replaced the same way: a new node on the target version joins, an old node is drained and removed, and the cycle repeats across the pool. Because it is rolling, a properly built cluster, three control-plane nodes, enough worker headroom, multiple replicas of each app, stays available the whole time.

Each node is replaced, not patched in place, so there is always somewhere for workloads to run.

Version skew, prechecks, and the 9.1 ceilings

Kubernetes only tolerates a bounded version skew, and you cannot leap arbitrarily many minor versions at once. VKS enforces this: it verifies cluster compatibility for updates and runs precheck conditions before it begins, blocking an upgrade that would violate the supported skew or land on an incompatible release. You step through versions in sequence rather than jumping, and if an upgrade does not succeed the model supports rollback. The better posture is to pass the prechecks first: confirm the target VKr is present, the cluster is healthy, and there is enough capacity headroom to roll a node without stranding workloads.

It is also worth knowing how much room you have. The re-engineered 9.1 Supervisor lifted the ceilings hard: up to 500 clusters per Supervisor, host capacity doubled to 5,000, parallel upgrade throughput raised from 64 to 256 clusters at once, and provisioning roughly 70% faster through linked-clone technology. Most shops will not approach those numbers, but the parallel-upgrade jump is the one that matters operationally, because cluster fleets that used to take a weekend to roll now fit inside a maintenance window.

Most stuck upgrades are preconditions, not the engine: no spare capacity to place the new node, a pod disruption budget that blocks the drain, or a single-replica workload that refuses to move. Build clusters that can roll, three control-plane nodes, headroom, replicated apps, and the upgrade becomes the non-event it should be.

What happens to etcd during a control-plane roll

The part of a control-plane upgrade worth understanding is what happens to etcd, because that is the cluster’s memory and the thing you least want to mishandle. As each new control-plane VM joins, it becomes a member of the etcd cluster and state is replicated to it; as the old member leaves, the quorum shifts to the new set. With three control plane nodes the cluster keeps a quorum throughout, so the API never goes dark. With a single control-plane node there is no quorum to keep, which is one more reason single-control-plane clusters are for throwaway work only: their upgrade is not smooth, and a failure mid-roll has nowhere to fall back to. The system manages all of this, but knowing it is happening explains why control-plane rolls are the slower, more careful half of an upgrade and why etcd needs the fast storage the sizing part argued for.

It also explains a class of stuck upgrade: if etcd is under memory or disk-latency pressure, the new member can struggle to catch up, and the roll slows or stalls not because the upgrade logic failed but because the data layer cannot keep pace. A control plane that was comfortable at idle can be marginal under the extra load of a member join on a busy cluster. Headroom on the control plane is not just for steady state; it is what lets the upgrade move.

Add-ons, CNI and the compatibility you have to check

A Kubernetes version bump is not just the core; everything you layered on top has to survive it too. The CNI, the GPU operator, ingress controllers, service meshes, storage drivers and CRDs all have version compatibility windows with Kubernetes, and a minor-version jump can move an API a controller depends on or deprecate one it uses. The platform gates the core skew for you, but it does not know about the third-party operator your team installed last quarter. So the precheck that actually prevents surprises is yours to run: inventory the add-ons on each cluster, confirm each supports the target Kubernetes version, and update the ones that do not before, not during, the upgrade. The classic post-upgrade incident is an ingress controller or an operator that quietly stops reconciling because an API it used was removed, and the cluster looks healthy while a whole capability is silently dead.

Bake this into the runbook as a named step. A test cluster carrying the same add-on stack, upgraded first, is the cheapest way to find these breaks before they hit production. If you run GitOps, the add-on versions live in Git, which makes the inventory trivial and the test-cluster rehearsal a matter of pointing the same manifests at a scratch cluster on the new version.

Upgrading a fleet, not a cluster

One cluster is easy; thirty is a programme, and this is where the 9.1 throughput numbers stop being trivia. With parallel upgrade capacity raised from 64 to 256 clusters at once, a fleet that used to roll over a weekend can fit inside a maintenance window, but only if you have sequenced it. Decide the order deliberately: non-production first to shake out add-on breaks, then production in waves grouped so that no single wave can take out a whole tier of a service. Confirm the target VKr is in the content library before you start, because nothing rolls without it, and stage the platform-level updates (Supervisor, service) separately from the per-cluster Kubernetes bumps so you are never changing two layers at once on the same window.

The discipline that makes fleet upgrades boring is treating them as a pipeline rather than a heroic push: test cluster, non-prod wave, prod waves, each gated on the previous one passing. The parallelism is there to make the waves fast, not to tempt you into upgrading everything at once. A fleet rolled in sequenced waves with prechecks at each gate is the difference between a routine quarterly task and the kind of all-nighter that makes people afraid of upgrades and lets clusters drift dangerously out of support.

What I’d Do

I keep the content library current so the VKr I want is always there, and I treat platform upgrades and cluster upgrades as separate, scheduled activities rather than one big bang. Before any cluster upgrade I run the prechecks and confirm three things: target VKr present, cluster healthy, capacity headroom to roll a node. I design clusters to roll from the start, three control-plane nodes, replicated apps, sane pod disruption budgets, because that single discipline turns upgrades from incidents into background tasks. And I lean on the 9.1 parallel-upgrade headroom to fit fleet rollouts into a real window instead of a lost weekend. Pick your most important cluster: if you triggered a Kubernetes upgrade right now, is there room to add one more node, or would the first drain stall on a budget nobody has looked at in months?

References

VKS Series · Part 12 of 17
« Prev: Part 11 | VKS Complete Guide | Next: Part 13 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts