TL;DR — Key Takeaways
- A VKS (vSphere Kubernetes Service) upgrade that stalls in VCF 9 is rarely a Kubernetes bug — it is almost always version compatibility, a blocked node drain, or control-plane health.
- Compatibility first: a
SystemChecksSucceededprecheck failure means the target Tanzu Kubernetes release (VKr) is not a valid hop from where you are. - The #1 stall is a PodDisruptionBudget with zero allowed disruptions, which pins a node in Ready,SchedulingDisabled forever.
- Control plane done but workers stuck? Look at MachineDeployment version and whether the cluster has spare capacity to roll a new node.
- When all else looks healthy, suspect etcd member health or leftover Pinniped-Concierge pods (VKS 3.3+).
kubectl against a Supervisor. Prerequisites: Supervisor context access and a cluster mid-upgrade or refusing to start one.
Provisioning a fresh VKS cluster is the easy part — the failure modes there are well known, and I covered them in 5 things that break VKS clusters in VCF 9. Upgrades are a different animal. A rolling upgrade replaces every node in place, drains workloads, and respects the guarantees your applications asked for — so it can wedge in places a clean build never touches. When the progress bar freezes at 40%, the instinct is to blame Kubernetes. It almost never is. Here are the five things that actually stall a VKS cluster upgrade in VCF 9, and how to clear each one.
1. Incompatible target version — the upgrade never even starts
If the update refuses to initiate — as opposed to starting and hanging — you are looking at a compatibility precheck, not a runtime fault. VKS will report update cannot be initiated with the condition SystemChecksSucceeded set to False. The usual cause is that the Tanzu Kubernetes release (VKr) you are targeting is not a legal single hop from your current one, or the Supervisor itself is too old to serve it. As a concrete 2026 example: VKS 3.6 requires Supervisor Kubernetes 1.30 or later, drops compatibility for VKr v1.31, and expects every cluster to already be on at least VKr 1.32 before you move — and you can only upgrade to VKS 3.6 directly if the installed VKS version is at least 3.3. Skip a rung on that ladder and the precheck blocks you.
# From the Supervisor context, list the releases you are allowed to move to
kubectl get tanzukubernetesrelease
# Check a specific cluster's compatible-upgrades annotation
kubectl get cluster <cluster-name> -n <namespace> \
-o jsonpath='{.metadata.annotations.run\.tanzu\.vmware\.com/resolve-tkr}'
# Read why the precheck failed
kubectl describe cluster <cluster-name> -n <namespace> | grep -A5 SystemChecks
The fix is to walk the ladder one supported VKr at a time rather than jumping to the newest. Always validate the hop against the Broadcom interoperability matrix and the “Verify VKS Cluster Compatibility for Updates” checklist before you touch production.
2. A PodDisruptionBudget pins a node and the drain never finishes
This is the single most common reason a started upgrade hangs. A rolling upgrade cordons a node and drains it before deleting it. If a workload on that node is protected by a PodDisruptionBudget (PDB) whose allowed disruptions is zero, the eviction can never succeed, the node sits in Ready,SchedulingDisabled and then Deleting, and the whole rollout waits on it. Note the behavior change: starting in VKS 3.3, if a node does not drain within the cluster’s configured node-drain timeout, the upgrade does not push past it. A frequent culprit is gatekeeper or Pinniped pods carrying restrictive PDBs.
# Inside the GUEST cluster, find PDBs with zero allowed disruptions
kubectl get pdb -A
# Look at the ALLOWED DISRUPTIONS column — any 0 is a suspect
# Confirm which node is wedged
kubectl get nodes | grep SchedulingDisabled
The correct fix is to work with the application owner to make the PDB tolerant (raise maxUnavailable or scale the workload so one pod can move). If the PDBs are genuinely correct and you have verified it, VKS 3.5.0 and later expose a deliberately scary escape hatch — the cluster annotation kubernetes.vmware.com/dangerous-skip-pdb-check-for-update: "true" — which skips the PDB gate. The word dangerous is in the key for a reason; only reach for it when you understand the disruption you are allowing.
3. Control plane upgraded, worker nodes stuck behind it
A classic half-finished state: kubectl get cluster shows the control plane on the new version, but the worker MachineDeployment is stuck on the old one and no new worker VM appears. Two things commonly cause this. First, the MachineDeployment version was never reconciled to the target — the control plane rolled but the worker spec did not follow. Second, and more often, the cluster has no room to roll a new node: a rolling update creates a replacement node before deleting the old one, so it needs spare host capacity, a free IP in the workload network, and an available VM class. Run out of any of those and the new Machine sits in Provisioning indefinitely.
# From the Supervisor context, compare control plane vs worker versions
kubectl get machinedeployment -n <namespace>
kubectl get machine -n <namespace> -o wide
# A Machine stuck in Provisioning? Read its events for the real reason
kubectl describe machine <machine-name> -n <namespace>
Free up capacity (reclaim hosts, widen the network pool, confirm the VM class exists and is assigned to the namespace) and the rollout resumes on its own. If you are unsure how the Supervisor, MachineDeployments, and guest nodes relate, my breakdown of VMware Tanzu in vSphere terms maps each Kubernetes object to the vSphere construct underneath it.
4. etcd or control-plane health quietly halts the rollout
Cluster API will not roll a control plane node while the control plane is unhealthy — a sane safety, but it produces a frustratingly silent stall. A known signature is the upgrade reporting EtcdMemberHealthy as unknown: a previous node replacement left an orphaned etcd member, quorum looks ambiguous, and the KubeadmControlPlane refuses to proceed until it is resolved. Until etcd reports healthy, nothing moves.
# From the Supervisor context, inspect control-plane conditions
kubectl get kubeadmcontrolplane -n <namespace>
kubectl describe kubeadmcontrolplane <kcp-name> -n <namespace> \
| grep -A3 -i etcd
If you find an orphaned or unhealthy etcd member, this is squarely a “open a support case before you improvise” situation — etcd surgery on a live control plane is how clusters get permanently lost. Confirm backups exist first, then follow Broadcom’s guidance to reconcile membership.
5. Leftover Pinniped-Concierge pods block the rollout (VKS 3.3+)
On VKS 3.3 and higher there is a documented case where a workload-cluster upgrade stalls because of Pinniped-Concierge pods — the component that brokers identity into the guest cluster. Old Pinniped pods fail to terminate or reschedule cleanly during the node roll, the drain stalls on them, and the upgrade waits. It looks like a generic stuck-drain (issue #2) but the offending pods live in the Pinniped system namespace, so it is worth checking specifically before you go hunting through every application PDB.
# Inside the guest cluster, check the Pinniped components
kubectl get pods -A | grep -i pinniped
kubectl describe pod <pinniped-pod> -n <pinniped-namespace>
Follow the Broadcom KB for the exact remediation for your VKS build rather than force-deleting pods blindly, since Pinniped underpins cluster authentication and a careless delete can lock you out.
A triage order that saves an evening
- Did the upgrade start? If it refused, it is compatibility — check VKr hops and Supervisor version (issue #1).
- Is a node stuck in SchedulingDisabled? Hunt PDBs with zero allowed disruptions, Pinniped first (issues #2 and #5).
- Control plane ahead of workers? Check MachineDeployment version and spare capacity / IPs / VM class (issue #3).
- Everything looks healthy but nothing moves? Inspect etcd and KubeadmControlPlane conditions (issue #4).
- Still stuck? Use the documented Restart a Failed VKS Cluster Upgrade procedure rather than re-issuing the upgrade.
Final Thoughts
VKS upgrades stall in a small number of predictable places, and almost all of them are about movement — can a node drain, can a replacement be built, can the control plane prove it is healthy enough to roll. Get into the habit of asking “what is this upgrade waiting on?” instead of “what is broken?”, and most stuck-upgrade tickets resolve in minutes. One forward-looking note worth planning around: to avoid an unplanned rolling update on the way to VCF 9.1, get your VKS to 3.6.1 or later, retire legacy v1alpha3 clusters, and move clusters to ClusterClass 3.6.0 or later so every node adopts the updated machine-agent. For how VKS sits inside the broader platform, see my what’s new in VCF 9 overview.
References
- Broadcom TechDocs — Verify VKS Cluster Compatibility for Updates
- Broadcom KB — Upgrade Stuck due to Node Stuck Deleting caused by PodDisruptionBudget
- Broadcom KB — Workload Cluster Upgrade Stuck on VKS 3.3+ due to Pinniped-Concierge Pods
- Broadcom TechDocs — Restart a Failed VKS Cluster Upgrade
- Broadcom TechDocs — VMware vSphere Kubernetes Service Release Notes








