TL;DR · Key Takeaways
- Day-2 is where a VKS platform is won or lost: backup, multi-tenancy and capacity are the routine disciplines that keep it healthy long after launch.
- Draw a hard line between infrastructure and applications. The Supervisor is infrastructure and is not backed up with Velero; VKS workload clusters and their objects are, with Velero (now a CNCF Sandbox project).
- Velero captures Kubernetes resources plus persistent-volume data via File System Backup or CSI snapshots. VKS cluster management can back up whole clusters and namespaces.
- Multi-tenancy runs from separate clusters per tenant to shared clusters with namespace isolation. Capacity is governed by quotas and host headroom, manage both before they bite.
Standing a platform up is the easy part. Running it for two years without it quietly degrading is the real test, and that is day-2 operations: the unglamorous routines of backing things up, sharing the estate fairly, and keeping enough headroom that nothing falls over. VKS gives you concrete tools for each, and this part covers the three that matter most, starting with the one teams most often get subtly wrong.
Backup: draw a hard line between infra and apps
This is where the cleanest designs and the messiest ones diverge. Keep a hard line between infrastructure and applications. The Supervisor is infrastructure and is not backed up with Velero, it is rebuilt, not restored. VKS workload clusters and their Kubernetes objects are applications and are backed up with Velero, which Broadcom recently contributed to the CNCF as a Sandbox project. Treat those as two separate runbooks with two separate owners. Teams that try to make one tool cover both end up with a backup that restores neither cleanly.
Velero backs up Kubernetes resources, deployments, statefulsets, services, secrets, configmaps, and for stateful workloads also captures persistent-volume data, via File System Backup (FSB) or the CSI snapshot method. Above the workload level, VKS cluster management can back up and restore whole VKS clusters and their namespaces, using Velero under the hood. Think of it as two altitudes:
| Scope | Tool | Captures |
|---|---|---|
| Application / namespace | Velero | K8s resources + PV data (FSB or CSI snapshot) |
| Whole cluster | VKS cluster management | Cluster + namespaces with PV snapshots |
| Supervisor | Config + rebuild | Not Velero, document the rebuild |
Multi-tenancy and capacity
How you share the platform is a spectrum, and the right point depends on how much you trust your tenants and how much overhead you can absorb. At one end, a separate VKS cluster per tenant gives the strongest isolation, its own control plane, RBAC and failure domain, at the cost of more clusters to run and upgrade. At the other, a shared cluster carved into namespaces is efficient but demands discipline: network policy, resource quotas and Pod Security Admission (all from Part 10) doing the work separate clusters would do for free. The vSphere Namespace is the coarse tenancy boundary above the cluster; namespaces inside a cluster are the fine one. Most mature platforms land in the middle, a cluster per environment or major tenant, with namespace isolation inside, and avoid the extreme of one giant shared cluster whose blast radius is the whole organisation.
Capacity is governed by namespace quotas and by the physical headroom of the hosts behind them. The two routine failures are a tenant hitting its quota (their autoscaler stalls, their pods stay pending) and the underlying cluster running out of host capacity to place any more node VMs. Stay ahead of both: set quotas that reflect real entitlements, watch utilisation in VCF Operations (Part 11), and keep enough host headroom that rolling upgrades and autoscaler scale-ups always have somewhere to land. Capacity you only notice when it runs out is capacity you managed too late.
Designing a backup schedule and retention
A backup tool is only as useful as the schedule behind it, and the default of “someone runs Velero occasionally” is not a schedule. Decide a recovery point objective per workload class, how much data you can afford to lose, and let that drive frequency: a transactional database may need frequent backups with short intervals, while a stateless service needs little more than its manifests captured. Set retention to match the recovery you actually need, not the most you can store, because keeping a year of daily backups for a dev cluster is cost with no payoff. Velero supports scheduled backups, so encode this rather than relying on memory, and store the backups in object storage that is not in the same failure domain as the cluster, an off-cluster, ideally off-site target, because a backup that dies with the thing it was protecting is theatre.
Separate the schedules by tier so production and dev are not treated identically. The point of a per-class RPO is that you spend backup effort where data loss actually hurts. A clear matrix, workload class to RPO to retention to target, is worth writing down once and is exactly the kind of thing an auditor will ask to see.
What a restore actually restores, and what it does not
The dangerous assumption is that a backup captures everything, and it does not. Velero captures the Kubernetes objects and, for stateful workloads, the persistent-volume data via file system backup or CSI snapshots. What it does not magically restore is anything that lived outside those objects: external dependencies, data in a managed service the cluster only pointed at, secrets that were referenced from an external store rather than held in the cluster, or DNS and load balancer state that the rest of your environment owns. A restore of the cluster brings back the Kubernetes view; it does not rebuild the world around the cluster. So a real recovery plan accounts for the dependencies too, and the only way to know your plan is complete is to test it.
This is why the restore drill is non-negotiable rather than nice-to-have. Restoring into a scratch namespace or a scratch cluster on a cadence is what surfaces the gaps, the volume that was not actually captured, the external secret that the restored pod cannot reach, the assumption that did not hold. A backup you have never restored is a hypothesis. The first time you find out whether it works should not be during the incident it was supposed to cover.
Quota models and chargeback for multi-tenancy
Multi-tenancy only works if the quotas behind it reflect real entitlements, and that is as much an organisational decision as a technical one. The vSphere Namespace quota is the lever: it caps CPU, memory and storage a tenant can consume, which is what stops one team starving the rest and what makes the autoscaler’s maximum meaningful. Set those quotas to what a tenant has actually been allocated, and on three-zone Supervisors remember the quota draws from all three clusters, so the math is a shared constraint rather than a single pool. Because consumption is attributable at the namespace level, you also have the raw material for chargeback or at least showback, telling each team what they are using, which changes behaviour faster than any policy document.
The capacity signals worth alerting on follow directly: a tenant approaching its quota (their growth is about to stall), and the underlying clusters approaching host capacity (nobody’s growth can be satisfied). Catch both as warnings in VCF Operations before they become stalled deployments, because capacity you only notice when it runs out is capacity you managed too late. A platform that reports utilisation per tenant and warns before the ceiling is one that scales gracefully; one that discovers it is full when an autoscaler stalls is one that lurches from surprise to surprise.
What I’d Do
I write two backup documents from day one, one for the Supervisor as rebuildable infrastructure and one for VKS workloads with Velero, and I never let a tool or an owner straddle the line. I schedule a real restore drill into a scratch namespace on a cadence, because an untested backup is a guess. For tenancy I default to a cluster per environment or major tenant with disciplined namespace isolation inside, reserving separate-cluster-per-tenant for the cases that genuinely warrant it and refusing the one-giant-shared-cluster anti-pattern. And I manage capacity as a leading indicator in VCF Operations, quota utilisation and host headroom, rather than discovering it at the moment an autoscaler stalls. Day-2 is unglamorous and it is exactly where platforms quietly succeed or rot. Honest question: have you ever actually restored one of your VKS backups, or are you trusting that it would work?
References
- Broadcom TechDocs: Backing Up and Restoring VKS Cluster Workloads
- Broadcom TechDocs: Backup and Restore of VKS Clusters with VKS Cluster Management
- CormacHogan.com: Manually Backing Up VKS Clusters Using Velero









