TL;DR · Key Takeaways
- You provision a VKS cluster the Cluster API way: write a
Clustermanifest that references a built-in ClusterClass, target a vSphere Namespace, and apply it with kubectl. - The
TanzuKubernetesClusterAPI is deprecated. Start on the Cluster v1beta1/v1beta2 API with a versionedbuiltin-genericclass, not on copied legacy YAML. - Pin nothing by hand that the catalog can tell you: the ClusterClass version, the Kubernetes version and the VM classes must come from what your VKS build and content library actually offer.
- Watch the
ClusterandMachineobjects, not the kubectl exit code. The apply records intent; the cluster is ready only when the control plane and nodes report healthy.
This is where VKS stops being theory. We provision an actual workload cluster, and the mechanics are deliberately Kubernetes-native: you describe the cluster you want as a manifest, apply it, and let the service reconcile reality to match. If you have used Cluster API anywhere else, the shape will feel familiar. If you have not, it is simpler than it looks once you see the moving parts, and most of the friction comes from two avoidable mistakes, the deprecated API and copied version strings, that this part heads off.
The API you use, and the one you should leave alone
Current VKS provisions clusters through the upstream Cluster API using the Cluster v1beta1/v1beta2 resource. Your Cluster object references a ClusterClass, and VKS ships built-in, versioned classes, builtin-generic-v3.3.0, builtin-generic-v3.4.0 and so on, that encode a tested cluster topology. You set variables, the Kubernetes version, node counts, VM classes, storage, and the class assembles the cluster. The older TanzuKubernetesCluster (TKC) API was deprecated with the move to ClusterClass. It still exists for backward compatibility through a legacy class, but new clusters should use the Cluster API. This is the trap when you copy YAML from older blog posts: a TKC manifest may still apply, yet you have started on a deprecated path you will have to migrate off. Begin with the versioned built-in class instead.
The workflow, end to end
You authenticate to the Supervisor, which gives kubectl a context for each vSphere Namespace you can reach. You switch context to the target namespace, then apply a Cluster manifest. A minimal manifest names the cluster, picks a ClusterClass and Kubernetes version, and declares the control plane and worker topology with the VM classes and storage class to use:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: prod
namespace: team-a
spec:
topology:
class: builtin-generic-v3.3.0 # from your VKS build
version: v1.31.x---vmware.x # from your content library
controlPlane:
replicas: 3 # 3 for HA, 1 for throwaway
workers:
machineDeployments:
- class: node-pool
name: np-1
replicas: 3
variables:
- name: vmClass
value: guaranteed-large # must exist in the namespace
- name: storageClass
value: vks-storage-policy # maps to an SPBM policy
Apply that, and VKS takes over: it provisions the control plane VMs, then the workers in the node pool, installs the CNI, and brings the cluster to Ready. The exact class and version strings come from what your content library and VKS version offer, so list the available releases rather than copying a version verbatim, that one habit prevents a surprising share of provisioning failures.
Lifecycle states, read honestly
A cluster does not go from absent to ready instantly. It moves through phases as the control plane provisions, the workers join, and add-ons settle. The mistake is treating a clean kubectl apply as success, the apply only records intent. Watch the objects:
| What you check | Healthy sign | If it stalls |
|---|---|---|
Cluster phase | Provisioning → Provisioned | Read the Ready condition message |
Control plane Machines | All Running, API reachable | Capacity or storage policy (Part 15) |
Worker Machines | Joined and Ready | Pending usually means no room to place |
Once ready, you pull the cluster’s kubeconfig and hand it to developers or wire it into tooling. From that point it is a normal Kubernetes cluster, kubectl, Helm, GitOps, whatever you run elsewhere. The VKS-specific part was getting it built; consuming it is standard Kubernetes, which is exactly the point.
What the ClusterClass variables actually control
The variables block is where most of the real configuration lives, and it is worth knowing what is on offer before you accept the defaults. Beyond the VM class and storage class, the built-in ClusterClass exposes the pod and service CIDRs, the CNI choice (Antrea or Calico), proxy settings for clusters that reach the internet through a corporate proxy, trust bundles for private registries and internal certificate authorities, and node labels and taints. These are not advanced extras. The CIDR fields are the ones that cause the routing failures from the networking part if you leave them on a default that overlaps a corporate subnet, and the trust and proxy fields are what make image pulls work in a locked-down enterprise rather than failing with TLS errors nobody can explain.
Treat the variable list as a checklist you walk once per cluster template, not a thing you discover field by field during an incident. Read the ClusterClass your VKS build ships and note which variables it actually supports, because the set grows between versions and a field that exists in builtin-generic-v3.4.0 may not exist in an older class. Copying a manifest that sets a variable your class does not recognise is a quiet way to have the value silently ignored.
Customisations worth setting on day one
A handful of choices are far cheaper to make at creation than to retrofit. Set the pod and service CIDRs from a planned, non-overlapping block rather than the default. Add your private registry and any internal certificate authority to the trust configuration so workloads pull images cleanly from the first deploy. If your nodes reach the internet through a proxy, set it in the cluster variables, not as a hand-edited daemonset later. Decide the CNI now, because it is baked in at provisioning. And label your node pools meaningfully (workload type, environment, cost centre) so scheduling, quotas and chargeback have something to key on.
None of these is exotic, but each one quietly determines whether the cluster is usable in production or merely demo-ready. The pattern I push on clients: maintain one reviewed Cluster manifest per environment in Git, with these fields already correct, and let people change only the cluster name, size and version. That turns cluster creation into filling in three blanks instead of rediscovering the same five settings every time, and it stops the slow drift where every team’s clusters are configured slightly differently and nobody can say why.
Deleting and recreating clusters cleanly
Because clusters are declarative, deletion is also declarative, and that is mostly a strength. Delete the Cluster object and VKS reconciles the worker and control plane VMs out of existence. The one thing that catches people is persistent volumes: depending on the reclaim policy and how the storage was provisioned, deleting the cluster does not always delete the underlying disks, and you can leave orphaned First Class Disks consuming datastore capacity. Before you tear a stateful cluster down, account for its volumes deliberately, back up anything that matters with Velero, then delete, then confirm in vCenter that the disks actually went away.
This declarative lifecycle is why ephemeral, disposable clusters are a realistic pattern on VKS rather than an aspiration. A dev team can stand up a cluster for a sprint, run their work, and delete it, and the only residue is whatever storage they explicitly chose to keep. Lean into that. Long-lived pet clusters accumulate drift and risk; cattle clusters defined in Git and recreated on demand do not.
What I’d Do
Standardise on the versioned built-in ClusterClass and never let a TKC manifest into a new environment, it is a debt you take on for nothing. Keep a tiny, known-good Cluster manifest in Git as your template, with the class, version and VM class as the only fields anyone edits, and have people query the available releases before they change them. Treat provisioning as asynchronous: apply, then watch the Cluster and Machine objects until they are genuinely Provisioned, and resist re-applying out of impatience. Do that and cluster creation becomes a thirty-second action plus a wait, which is exactly what you want from a platform. When you create your next cluster, are you reading the cluster conditions, or just trusting that a clean apply meant it worked?
References
- Broadcom TechDocs: Using builtin-generic-v3.3.0 ClusterClass
- Broadcom TechDocs: ClusterClass Variables for Customizing a Cluster
- Broadcom TechDocs: Using the legacy tanzukubernetescluster ClusterClass









