5 Things That Break VKS Clusters in VCF 9 (and How to Fix Them)

Most vSphere Kubernetes Service (VKS) failures in VCF 9 aren’t Kubernetes bugs—they’re infrastructure. Here are the 5 things that break VKS clusters most often and how to diagnose each one fast.

Dr. Pranay Jha

June 12, 2026

No comments

7 minutes

Read Time

TL;DR: Most VKS (vSphere Kubernetes Service) cluster failures in VCF 9 are not Kubernetes bugs — they trace back to plumbing: DNS, content libraries, storage policies, RBAC, and the Supervisor itself.
A cluster stuck in Creating almost always means the Supervisor cannot pull a node image — check the content library and DNS first.
kubectl vsphere login failures are frequently a full Supervisor control plane disk or the wrong server address, not a credential problem.
Missing StorageClasses and empty namespaces are permission and storage-policy symptoms, not cluster corruption.
When in doubt, work bottom-up: Supervisor health → namespace config → guest cluster.

Who this is for: vSphere admins and platform engineers running vSphere Kubernetes Service (VKS) on VCF 9, comfortable with vCenter and basic kubectl.

VKS — the service formerly known as TKG Service — makes spinning up conformant Kubernetes clusters feel like a one-line YAML apply. Right up until it doesn’t. When a cluster hangs, the instinct is to blame Kubernetes, but in VCF 9 the failure almost always lives in the infrastructure layer underneath: the Supervisor, NSX, storage policies, content libraries, or DNS. If you understand how the pieces relate (my breakdown of VMware Tanzu in vSphere terms is a good primer), the fix is usually fast. Here are the five things that break VKS clusters most often, and how to diagnose each one.

1. Cluster stuck in “Creating” — the content library can’t serve node images

This is the single most common VKS failure. You apply a cluster manifest, the Cluster object appears, and then… nothing. No control plane VM ever gets built. Nine times out of ten the Supervisor cannot retrieve the OVA node image from the content library — usually because vCenter can’t resolve its own FQDN to the right IP, so image retrieval silently fails. Start by confirming the Supervisor can actually see the images:

# List the VM images synced into the content library
kubectl get virtualmachineimages -A

# Check what Kubernetes Releases the Supervisor considers available
kubectl get tkr

If virtualmachineimages is empty or the release you asked for isn’t listed, the problem is the library, not the manifest. Verify the content library is created, synchronized with the desired Kubernetes releases, and associated with the Supervisor. A nasty timing edge case also exists: if you detach a content library and re-attach it before the old Kubernetes resources finish deleting, the images never get recreated. Give the cleanup time to finish, or detach and wait before re-attaching.

2. `kubectl vsphere login` fails — and it isn’t your password

A login that returns Login failed: bad gateway sends people down a credential rabbit hole. It’s rarely credentials. The most common root cause is a Supervisor control plane VM whose root disk is above 80% — or completely full. When the control plane can’t write, the auth proxy returns a gateway error. The second most common cause is simply pointing at the wrong address.

# Use the Supervisor Control Plane Node Address (vCenter > Workload
# Management > Supervisors), NOT a guest cluster IP
kubectl vsphere login --server=<supervisor-control-plane-vip> \
  --vsphere-username administrator@vsphere.local \
  --insecure-skip-tls-verify

# Then confirm your contexts resolved
kubectl config get-contexts

If login succeeds but you immediately get the address wrong, the fix is just the server flag. If you suspect disk pressure, check control plane VM disk usage from vCenter before anything else.

3. Empty namespace or missing StorageClass — PVCs stuck Pending

You log in, switch to your namespace, and there’s nothing there — no storage class, no VM classes, and your PersistentVolumeClaims sit in Pending forever. This is a configuration gap at the vSphere Namespace level, not a broken cluster. A StorageClass appears in Kubernetes only because a vSphere admin assigned a storage policy to the namespace. If that policy was never assigned (or was deleted from vSphere), the StorageClass vanishes and provisioning stalls. In VCF 9 this can even block a Supervisor upgrade when an expected StorageClass is missing.

# What storage classes does the namespace actually expose?
kubectl get storageclass

# Why is the PVC stuck? Read the events at the bottom.
kubectl describe pvc <pvc-name> -n <namespace>

Fix it in vCenter: assign (or re-assign) the storage policy, VM classes, and content library to the vSphere Namespace. Note the inverse gotcha too — a storage policy removed from vSphere can keep appearing as a Kubernetes StorageClass until the system reconciles, so don’t trust a stale class name.

4. You can log in but you can’t do anything — RBAC and SSO

A subtle one. The login succeeds, a context is created, and yet every command returns Forbidden or you see zero namespaces. You only get access to namespaces where a vSphere admin has explicitly granted you permission. With VCF SSO in the picture there’s an extra step: the VCF SSO group must be authorized on the Supervisor / vSphere Namespace, otherwise you can build a valid kubeconfig context with no rights attached to it.

# Confirm what you're actually allowed to do
kubectl auth can-i --list -n <namespace>
kubectl get namespaces

If can-i comes back nearly empty, stop debugging the cluster — the gap is the permission/identity binding in vCenter. Grant the user or SSO group a role on the target namespace and re-test.

5. The Supervisor itself is the problem — networking and control plane

When multiple clusters misbehave at once, suspect the Supervisor, not any single guest cluster. A few known VCF 9.x failure modes are worth memorizing. During a three-node Supervisor control plane deployment, a race condition can occur where the apiserver on the 2nd and 3rd control plane VMs become ready before cert-manager injects the CA certificate into the nsop validating webhook — the documented workaround is to deploy with a single control plane node, then activate Control Plane HA afterward. In high-scale environments, NSX-NCP pods can run out of memory, which delays or drops network updates to NSX and leaves cluster VirtualNetwork objects not ready — and node creation can’t proceed until the VirtualNetwork is ready. Also confirm there’s no IP overlap between the Supervisor load balancer VIP range and your guest cluster networks, since VKS nodes must reach the Supervisor API through that VIP.

# From the Supervisor context, sanity-check the building blocks
kubectl get virtualnetwork -A
kubectl get pods -n vmware-system-nsx | grep ncp
kubectl get cluster -A   # Look for clusters not in Provisioned/Running

For deeper context on what these clusters are made of and how the Supervisor orchestrates them, my Learn Kubernetes series covers the fundamentals, and the what’s new in VCF 9 overview explains how VKS fits into the unified private-cloud stack.

A quick triage order that saves time

Can you log in to the Supervisor? If not → disk usage and server address (issue #2).
Is the Supervisor healthy? Check control plane VMs, NCP, VirtualNetworks (issue #5).
Is the namespace configured? Storage policy, VM classes, permissions (issues #3 and #4).
Only then debug the guest cluster — and start with the content library (issue #1).

Disclaimer: The commands above are diagnostic, but any remediation that touches a production Supervisor — reassigning storage policies, modifying content libraries, or reconfiguring control plane HA — should be validated against your target BOM, checked for interoperability, backed up, and tested in a non-production environment first. Run prechecks before changing anything live.

Final Thoughts

The pattern across all five is the same: VKS clusters break because of the infrastructure they sit on, not because Kubernetes is fragile. Train yourself to debug bottom-up — Supervisor, then namespace, then guest cluster — and most “stuck cluster” tickets resolve in minutes instead of hours. Keep an eye on disk, DNS, storage policies, and RBAC, and VKS becomes the boring, reliable platform it’s meant to be.

References

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts

Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Tags: Kubernetes, Troubleshooting, VCF, VCF9, VKS, vsphere, vSphere Supervisor

Architect’s Toolkit

About the Author

Dr Pranay Jha

You May Have Missed

View All

AI Stack, VCF, VMware & Cloud

VMware Private AI Foundation Architecture and Components, Layer by Layer (Private AI Series, Part 2)

June 15, 2026
AI Stack, VMware & Cloud

What VMware Private AI Foundation with NVIDIA Actually Is (Private AI Series, Part 1)

June 15, 2026
VCF, VMware & Cloud

VMware Cloud Foundation 9: Lessons From the Whole Series and My Verdict (VCF 9 Series, Part 36)

June 14, 2026
VCF, VMware & Cloud

VCF 9 Performance Tuning vs Cost Optimization: Where to Spend Your Effort (VCF 9 Series, Part 35)

June 14, 2026
VCF, VMware & Cloud

VCF 9 Troubleshooting: The Stuck Workflows, Locks and Log Trails That Actually Bite (VCF 9 Series, Part 34)

June 14, 2026

Dr. Pranay Jha