Troubleshooting VKS: The Failure Modes That Actually Bite (VKS Series, Part 15)

Most VKS failures fall into three families, and all three are diagnosed from the same place. Here is the triage habit that turns a stuck cluster into a named cause fast.

by

Dr. Pranay Jha

June 19, 2026

No comments

8 minutes

Read Time

Troubleshooting VKS: Failure Modes That Bite

VKS Series · Part 15 of 17

TL;DR · Key Takeaways

Most VKS failures cluster into three families: provisioning that never finishes, networking that leaves things unreachable, and upgrades that stall mid-roll.
The fastest diagnosis path is the same every time: read the Cluster and Machine objects on the Supervisor, not just the symptoms inside the workload cluster.
Provisioning hangs are usually capacity, storage policy or content library. Networking hangs are usually CIDR overlap or a load balancer with no VIPs to give.
Upgrade stalls are usually a node that cannot drain, a pod disruption budget or an unevictable pod, or no capacity to place the replacement node.

Who this is for: whoever gets paged when a cluster will not build, will not route, or will not upgrade. Prerequisites: the rest of the series, this part assumes you know the subsystems and just need to find the broken one fast.

By now you have seen each VKS subsystem in isolation. This part is about what happens when they fail in the real world, and how to find the cause quickly instead of guessing. The good news is that VKS failures are not mysterious once you know where to look; they recur in predictable families. I have written standalone deep dives on the things that break VKS clusters and why upgrades get stuck; this consolidates the method into one triage habit.

Start at the Cluster and Machine objects, every time

Whatever the symptom, the first move is identical: log in to the Supervisor, switch to the namespace, and read the Cluster and Machine objects. They carry the conditions that tell you which phase is stuck and why. A workload cluster that never went ready announces it here long before you find it by poking around inside the cluster.

$ kubectl get cluster,machine -n team-a
NAME                    PHASE          AGE
cluster/prod            Provisioning   21m

NAME                    PHASE          AGE
machine/prod-cp-abcde   Provisioning   21m   # control plane stuck
machine/prod-np1-fghij  Pending        21m   # worker cannot place

$ kubectl describe cluster prod -n team-a
  Conditions:
    Type    Reason                   Message
    Ready   WaitingForControlPlane   0/3 control plane machines ready

A machine stuck in Pending or Provisioning is the thread to pull. The condition message usually names the cause, or points you straight at the right place to look next. From there, the three families branch.

The same starting point every time; the conditions tell you which of the three branches you are on.

Provisioning, networking, upgrade

When a new cluster will not become ready, the usual suspects are capacity, storage policy and the content library. Machines stuck Pending often cannot place VMs, no host headroom or exhausted namespace quota. A storage failure means the StorageClass points at an SPBM policy matching no compatible datastore. And if the cluster will not build at all while the Supervisor is healthy, check that a synced VKr content library is associated with it (Part 3).

Networking failures take two shapes. A Service type LoadBalancer stuck in <pending> with no external IP means the Supervisor’s load balancer has no VIP to give it, an NSX or Avi misconfiguration (Part 7). Pods that run but cannot reach something they should, weeks after a clean build, is very often CIDR overlap: the pod or service range collides with a network the cluster needs to route to (Part 6). The tell is that internal traffic works but one specific external range does not.

$ kubectl get svc -n app
NAME     TYPE           EXTERNAL-IP   PORT(S)        AGE
web-lb   LoadBalancer   <pending>     80:31840/TCP   9m   # no VIP allocated
# -> check the Supervisor load balancer (NSX native / Foundation / Avi) and its IP pool

A rolling upgrade that gets part-way and stops is almost always blocked on a drain. A node will not drain if a pod disruption budget would be violated, or if a pod has nowhere to go, a single-replica StatefulSet, an unevictable pod. The replacement node also needs somewhere to be placed, so a cluster at full capacity has no room to roll. Check the machine mid-replacement, check the PDBs, confirm there is headroom, clear the blocker and the roll resumes on its own.

Diagnose from the Supervisor, not the symptom: nine out of ten VKS incidents announce their own cause in the Cluster and Machine conditions. The teams that struggle are the ones debugging from inside the workload cluster while the answer is sitting on the Supervisor.

A worked example: stuck at WaitingForControlPlane

Walk through the most common stuck-provisioning case to see the method in action. A new cluster sits at WaitingForControlPlane with zero control plane machines ready. You start, as always, at the Supervisor objects, and you find the control-plane machine stuck in Provisioning. The condition message is the next breadcrumb. If it complains about storage, the StorageClass is pointing at an SPBM policy that matches no compatible datastore, fix the policy or the class. If it complains it cannot place the VM, you are out of capacity or quota, free some or raise the limit. If the machine provisions but never becomes ready, the node booted but cannot reach the Supervisor or pull its images, which is DNS, the network path, or registry trust. Each branch is a different fix, and the condition told you which one before you touched anything.

# the machine is the thread; its events carry the real cause
kubectl describe machine prod-cp-abcde -n team-a

# common condition messages and what each one means
#   "no compatible datastore"      -> SPBM policy / StorageClass mismatch
#   "insufficient resources"        -> host capacity or namespace quota
#   "failed to pull image"          -> DNS, proxy, or registry trust

The discipline is to read the message and act on it, not to delete and recreate the cluster hoping it works the second time. Recreating without fixing the cause just reproduces the same stuck state and wastes ten minutes per attempt. The condition is the diagnosis; trust it.

Where the logs actually live

When the conditions are not enough, you need logs, and on VKS they live at several layers that are easy to confuse. The Cluster API and VKS controllers run on the Supervisor, so provisioning and lifecycle logs are there. The workload cluster’s own control plane logs the API server, scheduler and controllers from inside that cluster once it is up enough to talk to. The node VMs have boot and kubelet logs, which matter when a node provisions but will not join. And vCenter holds the events for the VM placement and storage operations underneath all of it. Knowing which layer owns which symptom saves you grepping the wrong place: a node that will not join is a node and Supervisor problem, not a workload-cluster-API problem, because the workload API may not even be reachable yet.

This is exactly why centralised logging from the earlier observability part pays off under pressure. When a cluster is half-built or a node is flapping, the logs you need are spread across the Supervisor, the node, and vCenter, and having them aggregated rather than chasing them across systems turns a long incident into a short one.

When to escalate, and what to attach

Some failures are genuinely platform bugs, not environment problems, and knowing when to stop self-diagnosing is part of the skill. If you have confirmed the fundamentals, DNS, NTP, network, storage policy, content library, capacity, and the Cluster and Machine conditions point at an internal error rather than a missing prerequisite, that is the moment to open a support case rather than keep poking. Attach a support bundle from the Supervisor, the describe output of the stuck Cluster and Machine objects, and the relevant controller logs, because a case with that evidence gets resolved far faster than one that says only “my cluster will not build.” Note the exact VKS, Supervisor and VCF versions, since behaviour differs across them and that is the first thing support will ask.

The judgement to develop is the line between “I have not finished checking the environment” and “the environment is clean and this is the platform.” Most incidents are the former, which is why the triage habit comes first. But the genuine platform bugs do exist, and recognising one early, with the evidence already gathered, is far better than spending a day re-checking DNS for the fifth time on a problem that was never yours to fix.

What I’d Do

I make “read the Cluster and Machine conditions first” the non-negotiable opening move of every VKS incident, because it collapses guesswork into a named cause faster than anything else. I keep a one-page triage card pinned to the three families so whoever is on call branches correctly instead of thrashing. And I prevent most of these before they fire by building clusters with headroom, sane pod disruption budgets and non-overlapping CIDRs, which is just the advice from Parts 5, 6 and 12 paying off under pressure. Firefighting is cheaper when the design did not invite the fire. When your last VKS incident happened, did the responder start from the Supervisor objects, or from inside the cluster guessing?

References

VKS Series · Part 15 of 17
« Prev: Part 14 | VKS Complete Guide | Next: Part 16 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts