TL;DR · Key Takeaways
- Two gates stand between a plain workload domain and a running VKS cluster: enable a healthy Supervisor, then give it a content library of vSphere Kubernetes releases.
- Decide one-zone versus three-zone up front. It is mandatory for cluster-level HA and it is expensive to change after enablement.
- On VCF 9.1 the only supported networking is VDS or NSX VPCs. If your plan still assumes NSX Classic Tier-0/Tier-1 segments, it is already legacy.
- Without a synced VKr content library associated with the Supervisor, the Supervisor can be perfectly healthy and still refuse to create a single cluster.
- The overwhelming majority of failed enablements are environmental, DNS, NTP, network or storage policy, not the Supervisor software.
VKS is a Supervisor Service, so it can do precisely nothing until the Supervisor is enabled and healthy. Enablement has a reputation for being fiddly, and it earns it, but almost never because of the software. It earns it because the workflow provisions real infrastructure as it runs, and real infrastructure is unforgiving about the boring fundamentals. This part covers the path from a bare workload domain to a Supervisor that can provision VKS clusters, and the short list of blockers that account for most stuck runs. If you want the deeper troubleshooting companion, I wrote one on what blocks Supervisor enablement.
Get the prerequisites right, or nothing else matters
Before you start, four things must be true and tested. vSphere Zones are defined if you want zonal HA, one zone for single-site or dev/test, three zones for cluster-level resilience. Storage policies exist for the control plane and for workloads, and they actually match compatible datastores. The network path is chosen and configured, and on VCF 9.1 that means VDS or NSX VPCs; NSX Classic is gone from 9.1 onward. And the management and workload networks are reachable with working forward and reverse DNS and time in sync everywhere. Zones instruct the Supervisor where it may place deployments; you can add more later, but the initial topology is far easier to get right than to change.
Enabling the Supervisor, and the content library people forget
In VCF 9 you can enable the Supervisor as part of creating a workload domain, or through the Supervisor enablement workflow in vCenter against an existing cluster. Enablement deploys the three control plane VMs, configures the spherelet on each ESX host, and wires the Supervisor into the storage policies and networks you nominated. It runs in stages and reports progress; resist intervening early, because plenty of “failures” are a stage simply waiting on an IP address or a certificate that resolves itself. When it genuinely stalls, the cause is almost always one of the environmental blockers below.
A healthy Supervisor is necessary but not sufficient. To provision VKS clusters you must associate a content library with vSphere Kubernetes Service on the Supervisor. That library holds the VKr images, the OS and Kubernetes versions, that workload-cluster nodes are built from. Create a subscribed library, let it sync, then assign it under content distribution. If it is missing, empty or unsynced, cluster creation has nothing to build from and fails in a way that looks mysterious until you check the library. Keep it maintained, too: new Kubernetes versions arrive as new VKr images, and a stale library is a quiet way to find yourself unable to provision or upgrade to the version you want (Part 12).
The blockers that actually bite
| Blocker | Symptom | Fix |
|---|---|---|
| DNS / NTP | Control plane VMs never become ready; certificate errors | Forward and reverse DNS for every endpoint; time in sync across hosts and vCenter |
| Network path | Stalls allocating addresses or load-balancer VIPs | Validate the VDS / NSX VPC config, IP pools and load balancer before enabling |
| Storage policy | No compatible datastore for the control plane or nodes | Create SPBM policies that match real, compatible datastores |
| Content library | Supervisor healthy, but no cluster will create | Create, sync and associate a VKr content library |
| Capacity | Fails to place control-plane VMs | Leave headroom; the Supervisor control plane is not free |
What enablement actually provisions
It helps to know what the workflow is doing while it sits at a stage for ten minutes, because that tells you where to look when it does not move. Enablement deploys three Supervisor control plane VMs and assigns them management IPs, then stands up the Kubernetes API server, etcd and the controllers across them. It configures the spherelet on every ESX host so the hosts can act as Kubernetes nodes, wires in the storage policies you nominated for the control plane and for ephemeral disks, and attaches the workload network so pods and load balancer VIPs can be addressed. On an NSX deployment it also programs segments and the load balancer; on VDS it binds the VLAN-backed port groups. Only once all of that converges does the Supervisor report Running.
The practical takeaway is that enablement is not one action, it is a dozen dependent ones, and each can wait on something external. A stage that looks frozen is usually a control plane VM waiting on a DHCP lease, a certificate waiting on a reachable NTP source, or a load balancer waiting on an IP pool that ran dry. None of those are the Supervisor failing. They are the environment not being ready, which is why the validation discipline below matters more than any retry.
Single-zone or three-zone, decided here
The zone decision is made at enablement and is genuinely expensive to undo, so it deserves a moment even though we revisit it in the architecture parts. A single-zone Supervisor places all three control plane VMs in one vSphere cluster and relies on vSphere HA to restart them if a host fails. That is fine for a single site, for dev and test, or for any environment where a whole-cluster outage is an acceptable risk. A three-zone Supervisor spreads one control plane VM per zone, so the platform control plane survives the loss of an entire cluster, which is the configuration you want for production Kubernetes that has to outlive a rack or a maintenance event.
The trade nobody mentions in the demo: on three zones a vSphere Namespace draws its resources equally from all three underlying clusters, so your capacity math becomes a three-way constraint rather than a single pool, and a namespace can span one to three zones. Decide deliberately which namespaces pin to a single zone for performance and which span all three for availability. Three zones also raises your host floor sharply, you need a quorum in each, so the resilience comes with a real hardware bill. Validate that bill against the workload before you commit, because retrofitting a third zone after the fact means re-enabling the Supervisor.
Proving the Supervisor is genuinely ready
A Running status in vCenter is necessary but not the same as proven. Before you hand a namespace to a team, confirm the control plane is reachable, the VKr content library is associated and synced, and a throwaway cluster actually builds. The fastest end-to-end proof is to log in and create the smallest possible cluster, watch it reach Provisioned, pull its kubeconfig and run a single command against it:
# list the Kubernetes releases the content library actually offers
kubectl get vkr
# create a tiny test cluster, then watch it converge
kubectl apply -f test-cluster.yaml
kubectl get cluster,machine -n test-ns -w
# once Provisioned, prove the API answers
kubectl --kubeconfig test.kubeconfig get nodes
If that cluster builds and answers, your platform works end to end and you can onboard teams with confidence. If kubectl get vkr returns nothing, the content library is the problem, not the Supervisor. Tear the test cluster down afterward so it is not sitting in your capacity math. This five-minute proof has saved me from handing a half-ready platform to a team that then spent a day convinced their manifests were wrong when the real issue was an unsynced library.
What I’d Do
Before clicking Enable, I would validate four things and only four: name resolution both ways, time sync everywhere, a network path that is genuinely finished, and a storage policy that matches a real datastore. Every stuck enablement I have helped unpick came down to one of those, never the Supervisor binary. Decide your zone count deliberately, three for production resilience, one where the extra hosts are not justified, because it is painful to change later. And the moment the Supervisor reports healthy, do not declare victory: create and associate the VKr content library immediately, then prove the whole thing by provisioning one throwaway cluster. If that cluster comes up, your platform works. If it does not, you have a much smaller problem to chase than if you had found out three weeks and ten tickets later. What does your current enablement runbook check before it clicks the button, and is the content library on that list?
References
- Broadcom TechDocs: vSphere Supervisor (VCF 9.0)
- Broadcom TechDocs: Add or Update VKS Content Libraries on a Supervisor
- VKS Deployment & Setup, interactive step-by-step walkthrough









