Provisioning VKS Clusters: ClusterClass and the Cluster API Workflow (VKS Series, Part 4)

Provisioning a VKS cluster is a Cluster API workflow, not a wizard. Here is the manifest, the deprecated API to avoid, and how to read the cluster lifecycle honestly.

by

Dr. Pranay Jha

June 19, 2026

No comments

8 minutes

Read Time

Provisioning VKS Clusters with ClusterClass

VKS Series · Part 4 of 17

TL;DR · Key Takeaways

You provision a VKS cluster the Cluster API way: write a Cluster manifest that references a built-in ClusterClass, target a vSphere Namespace, and apply it with kubectl.
The TanzuKubernetesCluster API is deprecated. Start on the Cluster v1beta1/v1beta2 API with a versioned builtin-generic class, not on copied legacy YAML.
Pin nothing by hand that the catalog can tell you: the ClusterClass version, the Kubernetes version and the VM classes must come from what your VKS build and content library actually offer.
Watch the Cluster and Machine objects, not the kubectl exit code. The apply records intent; the cluster is ready only when the control plane and nodes report healthy.

Who this is for: platform engineers and DevOps users who will actually create clusters. Prerequisites: a healthy Supervisor with a synced content library (Part 3), and access to a vSphere Namespace with VM classes and a storage policy assigned.

This is where VKS stops being theory. We provision an actual workload cluster, and the mechanics are deliberately Kubernetes-native: you describe the cluster you want as a manifest, apply it, and let the service reconcile reality to match. If you have used Cluster API anywhere else, the shape will feel familiar. If you have not, it is simpler than it looks once you see the moving parts, and most of the friction comes from two avoidable mistakes, the deprecated API and copied version strings, that this part heads off.

The API you use, and the one you should leave alone

Current VKS provisions clusters through the upstream Cluster API using the Cluster v1beta1/v1beta2 resource. Your Cluster object references a ClusterClass, and VKS ships built-in, versioned classes, builtin-generic-v3.3.0, builtin-generic-v3.4.0 and so on, that encode a tested cluster topology. You set variables, the Kubernetes version, node counts, VM classes, storage, and the class assembles the cluster. The older TanzuKubernetesCluster (TKC) API was deprecated with the move to ClusterClass. It still exists for backward compatibility through a legacy class, but new clusters should use the Cluster API. This is the trap when you copy YAML from older blog posts: a TKC manifest may still apply, yet you have started on a deprecated path you will have to migrate off. Begin with the versioned built-in class instead.

The workflow, end to end

Log in, target the namespace, apply, let VKS reconcile, then consume. Steps 1-3 are yours; step 4 is the service.

You authenticate to the Supervisor, which gives kubectl a context for each vSphere Namespace you can reach. You switch context to the target namespace, then apply a Cluster manifest. A minimal manifest names the cluster, picks a ClusterClass and Kubernetes version, and declares the control plane and worker topology with the VM classes and storage class to use:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: prod
  namespace: team-a
spec:
  topology:
    class: builtin-generic-v3.3.0      # from your VKS build
    version: v1.31.x---vmware.x        # from your content library
    controlPlane:
      replicas: 3                      # 3 for HA, 1 for throwaway
    workers:
      machineDeployments:
      - class: node-pool
        name: np-1
        replicas: 3
    variables:
    - name: vmClass
      value: guaranteed-large          # must exist in the namespace
    - name: storageClass
      value: vks-storage-policy         # maps to an SPBM policy

Apply that, and VKS takes over: it provisions the control plane VMs, then the workers in the node pool, installs the CNI, and brings the cluster to Ready. The exact class and version strings come from what your content library and VKS version offer, so list the available releases rather than copying a version verbatim, that one habit prevents a surprising share of provisioning failures.

Lifecycle states, read honestly

A cluster does not go from absent to ready instantly. It moves through phases as the control plane provisions, the workers join, and add-ons settle. The mistake is treating a clean kubectl apply as success, the apply only records intent. Watch the objects:

What you check	Healthy sign	If it stalls
`Cluster` phase	Provisioning → Provisioned	Read the Ready condition message
Control plane `Machine`s	All Running, API reachable	Capacity or storage policy (Part 15)
Worker `Machine`s	Joined and Ready	Pending usually means no room to place

Once ready, you pull the cluster’s kubeconfig and hand it to developers or wire it into tooling. From that point it is a normal Kubernetes cluster, kubectl, Helm, GitOps, whatever you run elsewhere. The VKS-specific part was getting it built; consuming it is standard Kubernetes, which is exactly the point.

Version strings are not portable: the ClusterClass and Kubernetes version available to you depend on your VKS build and your content library. Always query what is offered before you write the manifest; a version copied from another environment is the single most common reason a first apply does nothing.

What the ClusterClass variables actually control

The variables block is where most of the real configuration lives, and it is worth knowing what is on offer before you accept the defaults. Beyond the VM class and storage class, the built-in ClusterClass exposes the pod and service CIDRs, the CNI choice (Antrea or Calico), proxy settings for clusters that reach the internet through a corporate proxy, trust bundles for private registries and internal certificate authorities, and node labels and taints. These are not advanced extras. The CIDR fields are the ones that cause the routing failures from the networking part if you leave them on a default that overlaps a corporate subnet, and the trust and proxy fields are what make image pulls work in a locked-down enterprise rather than failing with TLS errors nobody can explain.

Treat the variable list as a checklist you walk once per cluster template, not a thing you discover field by field during an incident. Read the ClusterClass your VKS build ships and note which variables it actually supports, because the set grows between versions and a field that exists in builtin-generic-v3.4.0 may not exist in an older class. Copying a manifest that sets a variable your class does not recognise is a quiet way to have the value silently ignored.

Customisations worth setting on day one

A handful of choices are far cheaper to make at creation than to retrofit. Set the pod and service CIDRs from a planned, non-overlapping block rather than the default. Add your private registry and any internal certificate authority to the trust configuration so workloads pull images cleanly from the first deploy. If your nodes reach the internet through a proxy, set it in the cluster variables, not as a hand-edited daemonset later. Decide the CNI now, because it is baked in at provisioning. And label your node pools meaningfully (workload type, environment, cost centre) so scheduling, quotas and chargeback have something to key on.

None of these is exotic, but each one quietly determines whether the cluster is usable in production or merely demo-ready. The pattern I push on clients: maintain one reviewed Cluster manifest per environment in Git, with these fields already correct, and let people change only the cluster name, size and version. That turns cluster creation into filling in three blanks instead of rediscovering the same five settings every time, and it stops the slow drift where every team’s clusters are configured slightly differently and nobody can say why.

Deleting and recreating clusters cleanly

Because clusters are declarative, deletion is also declarative, and that is mostly a strength. Delete the Cluster object and VKS reconciles the worker and control plane VMs out of existence. The one thing that catches people is persistent volumes: depending on the reclaim policy and how the storage was provisioned, deleting the cluster does not always delete the underlying disks, and you can leave orphaned First Class Disks consuming datastore capacity. Before you tear a stateful cluster down, account for its volumes deliberately, back up anything that matters with Velero, then delete, then confirm in vCenter that the disks actually went away.

This declarative lifecycle is why ephemeral, disposable clusters are a realistic pattern on VKS rather than an aspiration. A dev team can stand up a cluster for a sprint, run their work, and delete it, and the only residue is whatever storage they explicitly chose to keep. Lean into that. Long-lived pet clusters accumulate drift and risk; cattle clusters defined in Git and recreated on demand do not.

What I’d Do

Standardise on the versioned built-in ClusterClass and never let a TKC manifest into a new environment, it is a debt you take on for nothing. Keep a tiny, known-good Cluster manifest in Git as your template, with the class, version and VM class as the only fields anyone edits, and have people query the available releases before they change them. Treat provisioning as asynchronous: apply, then watch the Cluster and Machine objects until they are genuinely Provisioned, and resist re-applying out of impatience. Do that and cluster creation becomes a thirty-second action plus a wait, which is exactly what you want from a platform. When you create your next cluster, are you reading the cluster conditions, or just trusting that a clean apply meant it worked?

References

VKS Series · Part 4 of 17
« Prev: Part 3 | VKS Complete Guide | Next: Part 5 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts