Enabling vSphere Supervisor and the VKS Runtime in VCF 9 (VKS Series, Part 3)

Two gates stand between a workload domain and a running VKS cluster: a healthy Supervisor and a content library. Here are the prerequisites, and the blockers that actually bite.

by

Dr. Pranay Jha

June 19, 2026

No comments

9 minutes

Read Time

Enabling vSphere Supervisor & the VKS Runtime

VKS Series · Part 3 of 17

TL;DR · Key Takeaways

Two gates stand between a plain workload domain and a running VKS cluster: enable a healthy Supervisor, then give it a content library of vSphere Kubernetes releases.
Decide one-zone versus three-zone up front. It is mandatory for cluster-level HA and it is expensive to change after enablement.
On VCF 9.1 the only supported networking is VDS or NSX VPCs. If your plan still assumes NSX Classic Tier-0/Tier-1 segments, it is already legacy.
Without a synced VKr content library associated with the Supervisor, the Supervisor can be perfectly healthy and still refuse to create a single cluster.
The overwhelming majority of failed enablements are environmental, DNS, NTP, network or storage policy, not the Supervisor software.

Who this is for: the person who actually has to click Enable, and wants it to work the first time. Prerequisites: a working VI workload domain, an IP plan, and a decision on VDS versus NSX VPC networking.

VKS is a Supervisor Service, so it can do precisely nothing until the Supervisor is enabled and healthy. Enablement has a reputation for being fiddly, and it earns it, but almost never because of the software. It earns it because the workflow provisions real infrastructure as it runs, and real infrastructure is unforgiving about the boring fundamentals. This part covers the path from a bare workload domain to a Supervisor that can provision VKS clusters, and the short list of blockers that account for most stuck runs. If you want the deeper troubleshooting companion, I wrote one on what blocks Supervisor enablement.

Get the prerequisites right, or nothing else matters

Before you start, four things must be true and tested. vSphere Zones are defined if you want zonal HA, one zone for single-site or dev/test, three zones for cluster-level resilience. Storage policies exist for the control plane and for workloads, and they actually match compatible datastores. The network path is chosen and configured, and on VCF 9.1 that means VDS or NSX VPCs; NSX Classic is gone from 9.1 onward. And the management and workload networks are reachable with working forward and reverse DNS and time in sync everywhere. Zones instruct the Supervisor where it may place deployments; you can add more later, but the initial topology is far easier to get right than to change.

Enablement is two gates. People celebrate at step 2 and then cannot work out why step 4 produces nothing.

Enabling the Supervisor, and the content library people forget

In VCF 9 you can enable the Supervisor as part of creating a workload domain, or through the Supervisor enablement workflow in vCenter against an existing cluster. Enablement deploys the three control plane VMs, configures the spherelet on each ESX host, and wires the Supervisor into the storage policies and networks you nominated. It runs in stages and reports progress; resist intervening early, because plenty of “failures” are a stage simply waiting on an IP address or a certificate that resolves itself. When it genuinely stalls, the cause is almost always one of the environmental blockers below.

A healthy Supervisor is necessary but not sufficient. To provision VKS clusters you must associate a content library with vSphere Kubernetes Service on the Supervisor. That library holds the VKr images, the OS and Kubernetes versions, that workload-cluster nodes are built from. Create a subscribed library, let it sync, then assign it under content distribution. If it is missing, empty or unsynced, cluster creation has nothing to build from and fails in a way that looks mysterious until you check the library. Keep it maintained, too: new Kubernetes versions arrive as new VKr images, and a stale library is a quiet way to find yourself unable to provision or upgrade to the version you want (Part 12).

The blockers that actually bite

Blocker	Symptom	Fix
DNS / NTP	Control plane VMs never become ready; certificate errors	Forward and reverse DNS for every endpoint; time in sync across hosts and vCenter
Network path	Stalls allocating addresses or load-balancer VIPs	Validate the VDS / NSX VPC config, IP pools and load balancer before enabling
Storage policy	No compatible datastore for the control plane or nodes	Create SPBM policies that match real, compatible datastores
Content library	Supervisor healthy, but no cluster will create	Create, sync and associate a VKr content library
Capacity	Fails to place control-plane VMs	Leave headroom; the Supervisor control plane is not free

Test it in non-prod first: enablement touches storage, networking and certificates at once, so the cheapest insurance is a dry run in a non-production instance. Validate DNS, NTP, your storage policy and your network path there, and the production run becomes anticlimactic.

What enablement actually provisions

It helps to know what the workflow is doing while it sits at a stage for ten minutes, because that tells you where to look when it does not move. Enablement deploys three Supervisor control plane VMs and assigns them management IPs, then stands up the Kubernetes API server, etcd and the controllers across them. It configures the spherelet on every ESX host so the hosts can act as Kubernetes nodes, wires in the storage policies you nominated for the control plane and for ephemeral disks, and attaches the workload network so pods and load balancer VIPs can be addressed. On an NSX deployment it also programs segments and the load balancer; on VDS it binds the VLAN-backed port groups. Only once all of that converges does the Supervisor report Running.

The practical takeaway is that enablement is not one action, it is a dozen dependent ones, and each can wait on something external. A stage that looks frozen is usually a control plane VM waiting on a DHCP lease, a certificate waiting on a reachable NTP source, or a load balancer waiting on an IP pool that ran dry. None of those are the Supervisor failing. They are the environment not being ready, which is why the validation discipline below matters more than any retry.

Single-zone or three-zone, decided here

The zone decision is made at enablement and is genuinely expensive to undo, so it deserves a moment even though we revisit it in the architecture parts. A single-zone Supervisor places all three control plane VMs in one vSphere cluster and relies on vSphere HA to restart them if a host fails. That is fine for a single site, for dev and test, or for any environment where a whole-cluster outage is an acceptable risk. A three-zone Supervisor spreads one control plane VM per zone, so the platform control plane survives the loss of an entire cluster, which is the configuration you want for production Kubernetes that has to outlive a rack or a maintenance event.

The trade nobody mentions in the demo: on three zones a vSphere Namespace draws its resources equally from all three underlying clusters, so your capacity math becomes a three-way constraint rather than a single pool, and a namespace can span one to three zones. Decide deliberately which namespaces pin to a single zone for performance and which span all three for availability. Three zones also raises your host floor sharply, you need a quorum in each, so the resilience comes with a real hardware bill. Validate that bill against the workload before you commit, because retrofitting a third zone after the fact means re-enabling the Supervisor.

Proving the Supervisor is genuinely ready

A Running status in vCenter is necessary but not the same as proven. Before you hand a namespace to a team, confirm the control plane is reachable, the VKr content library is associated and synced, and a throwaway cluster actually builds. The fastest end-to-end proof is to log in and create the smallest possible cluster, watch it reach Provisioned, pull its kubeconfig and run a single command against it:

# list the Kubernetes releases the content library actually offers
kubectl get vkr

# create a tiny test cluster, then watch it converge
kubectl apply -f test-cluster.yaml
kubectl get cluster,machine -n test-ns -w

# once Provisioned, prove the API answers
kubectl --kubeconfig test.kubeconfig get nodes

If that cluster builds and answers, your platform works end to end and you can onboard teams with confidence. If kubectl get vkr returns nothing, the content library is the problem, not the Supervisor. Tear the test cluster down afterward so it is not sitting in your capacity math. This five-minute proof has saved me from handing a half-ready platform to a team that then spent a day convinced their manifests were wrong when the real issue was an unsynced library.

What I’d Do

Before clicking Enable, I would validate four things and only four: name resolution both ways, time sync everywhere, a network path that is genuinely finished, and a storage policy that matches a real datastore. Every stuck enablement I have helped unpick came down to one of those, never the Supervisor binary. Decide your zone count deliberately, three for production resilience, one where the extra hosts are not justified, because it is painful to change later. And the moment the Supervisor reports healthy, do not declare victory: create and associate the VKr content library immediately, then prove the whole thing by provisioning one throwaway cluster. If that cluster comes up, your platform works. If it does not, you have a much smaller problem to chase than if you had found out three weeks and ten tickets later. What does your current enablement runbook check before it clicks the button, and is the content library on that list?

References

VKS Series · Part 3 of 17
« Prev: Part 2 | VKS Complete Guide | Next: Part 4 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts