Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

VKS Day-2 Operations: Backup, Multi-Tenancy and Capacity (VKS Series, Part 16)

Day-2 is where a VKS platform quietly succeeds or rots. Here is the hard line between infrastructure and app backup, the multi-tenancy spectrum, and the capacity math that bites late.

VKS Day-2: Backup, Multi-Tenancy & Capacity
VKS Series · Part 16 of 17

TL;DR · Key Takeaways

  • Day-2 is where a VKS platform is won or lost: backup, multi-tenancy and capacity are the routine disciplines that keep it healthy long after launch.
  • Draw a hard line between infrastructure and applications. The Supervisor is infrastructure and is not backed up with Velero; VKS workload clusters and their objects are, with Velero (now a CNCF Sandbox project).
  • Velero captures Kubernetes resources plus persistent-volume data via File System Backup or CSI snapshots. VKS cluster management can back up whole clusters and namespaces.
  • Multi-tenancy runs from separate clusters per tenant to shared clusters with namespace isolation. Capacity is governed by quotas and host headroom, manage both before they bite.
Who this is for: the team that has to run the platform for years, not just stand it up.  Prerequisites: Parts 5 and 10 (sizing/quota and isolation), and a backup target you control.

Standing a platform up is the easy part. Running it for two years without it quietly degrading is the real test, and that is day-2 operations: the unglamorous routines of backing things up, sharing the estate fairly, and keeping enough headroom that nothing falls over. VKS gives you concrete tools for each, and this part covers the three that matter most, starting with the one teams most often get subtly wrong.

Backup: draw a hard line between infra and apps

This is where the cleanest designs and the messiest ones diverge. Keep a hard line between infrastructure and applications. The Supervisor is infrastructure and is not backed up with Velero, it is rebuilt, not restored. VKS workload clusters and their Kubernetes objects are applications and are backed up with Velero, which Broadcom recently contributed to the CNCF as a Sandbox project. Treat those as two separate runbooks with two separate owners. Teams that try to make one tool cover both end up with a backup that restores neither cleanly.

Two runbooks, two owners Infrastructure: Supervisor The platform control plane NOT backed up with Velero Rebuilt from config, not restored Applications: VKS workloads K8s objects + persistent-volume data Backed up with Velero (FSB or CSI) Tenant / platform responsibility One tool for both restores neither cleanly. Keep them as two separate documents.
Back up the Supervisor as infrastructure and VKS workloads as applications, never with a single tool.

Velero backs up Kubernetes resources, deployments, statefulsets, services, secrets, configmaps, and for stateful workloads also captures persistent-volume data, via File System Backup (FSB) or the CSI snapshot method. Above the workload level, VKS cluster management can back up and restore whole VKS clusters and their namespaces, using Velero under the hood. Think of it as two altitudes:

ScopeToolCaptures
Application / namespaceVeleroK8s resources + PV data (FSB or CSI snapshot)
Whole clusterVKS cluster managementCluster + namespaces with PV snapshots
SupervisorConfig + rebuildNot Velero, document the rebuild

Multi-tenancy and capacity

How you share the platform is a spectrum, and the right point depends on how much you trust your tenants and how much overhead you can absorb. At one end, a separate VKS cluster per tenant gives the strongest isolation, its own control plane, RBAC and failure domain, at the cost of more clusters to run and upgrade. At the other, a shared cluster carved into namespaces is efficient but demands discipline: network policy, resource quotas and Pod Security Admission (all from Part 10) doing the work separate clusters would do for free. The vSphere Namespace is the coarse tenancy boundary above the cluster; namespaces inside a cluster are the fine one. Most mature platforms land in the middle, a cluster per environment or major tenant, with namespace isolation inside, and avoid the extreme of one giant shared cluster whose blast radius is the whole organisation.

Capacity is governed by namespace quotas and by the physical headroom of the hosts behind them. The two routine failures are a tenant hitting its quota (their autoscaler stalls, their pods stay pending) and the underlying cluster running out of host capacity to place any more node VMs. Stay ahead of both: set quotas that reflect real entitlements, watch utilisation in VCF Operations (Part 11), and keep enough host headroom that rolling upgrades and autoscaler scale-ups always have somewhere to land. Capacity you only notice when it runs out is capacity you managed too late.

Test your restores: backups that have never been restored are theatre, and the day you actually need one is a terrible time to discover the volume data was not captured. Schedule a periodic restore drill into a scratch namespace, it is the cheapest insurance on the platform.

Designing a backup schedule and retention

A backup tool is only as useful as the schedule behind it, and the default of “someone runs Velero occasionally” is not a schedule. Decide a recovery point objective per workload class, how much data you can afford to lose, and let that drive frequency: a transactional database may need frequent backups with short intervals, while a stateless service needs little more than its manifests captured. Set retention to match the recovery you actually need, not the most you can store, because keeping a year of daily backups for a dev cluster is cost with no payoff. Velero supports scheduled backups, so encode this rather than relying on memory, and store the backups in object storage that is not in the same failure domain as the cluster, an off-cluster, ideally off-site target, because a backup that dies with the thing it was protecting is theatre.

Separate the schedules by tier so production and dev are not treated identically. The point of a per-class RPO is that you spend backup effort where data loss actually hurts. A clear matrix, workload class to RPO to retention to target, is worth writing down once and is exactly the kind of thing an auditor will ask to see.

What a restore actually restores, and what it does not

The dangerous assumption is that a backup captures everything, and it does not. Velero captures the Kubernetes objects and, for stateful workloads, the persistent-volume data via file system backup or CSI snapshots. What it does not magically restore is anything that lived outside those objects: external dependencies, data in a managed service the cluster only pointed at, secrets that were referenced from an external store rather than held in the cluster, or DNS and load balancer state that the rest of your environment owns. A restore of the cluster brings back the Kubernetes view; it does not rebuild the world around the cluster. So a real recovery plan accounts for the dependencies too, and the only way to know your plan is complete is to test it.

This is why the restore drill is non-negotiable rather than nice-to-have. Restoring into a scratch namespace or a scratch cluster on a cadence is what surfaces the gaps, the volume that was not actually captured, the external secret that the restored pod cannot reach, the assumption that did not hold. A backup you have never restored is a hypothesis. The first time you find out whether it works should not be during the incident it was supposed to cover.

Quota models and chargeback for multi-tenancy

Multi-tenancy only works if the quotas behind it reflect real entitlements, and that is as much an organisational decision as a technical one. The vSphere Namespace quota is the lever: it caps CPU, memory and storage a tenant can consume, which is what stops one team starving the rest and what makes the autoscaler’s maximum meaningful. Set those quotas to what a tenant has actually been allocated, and on three-zone Supervisors remember the quota draws from all three clusters, so the math is a shared constraint rather than a single pool. Because consumption is attributable at the namespace level, you also have the raw material for chargeback or at least showback, telling each team what they are using, which changes behaviour faster than any policy document.

The capacity signals worth alerting on follow directly: a tenant approaching its quota (their growth is about to stall), and the underlying clusters approaching host capacity (nobody’s growth can be satisfied). Catch both as warnings in VCF Operations before they become stalled deployments, because capacity you only notice when it runs out is capacity you managed too late. A platform that reports utilisation per tenant and warns before the ceiling is one that scales gracefully; one that discovers it is full when an autoscaler stalls is one that lurches from surprise to surprise.

What I’d Do

I write two backup documents from day one, one for the Supervisor as rebuildable infrastructure and one for VKS workloads with Velero, and I never let a tool or an owner straddle the line. I schedule a real restore drill into a scratch namespace on a cadence, because an untested backup is a guess. For tenancy I default to a cluster per environment or major tenant with disciplined namespace isolation inside, reserving separate-cluster-per-tenant for the cases that genuinely warrant it and refusing the one-giant-shared-cluster anti-pattern. And I manage capacity as a leading indicator in VCF Operations, quota utilisation and host headroom, rather than discovering it at the moment an autoscaler stalls. Day-2 is unglamorous and it is exactly where platforms quietly succeed or rot. Honest question: have you ever actually restored one of your VKS backups, or are you trusting that it would work?

References

VKS Series · Part 16 of 17
« Prev: Part 15  |  VKS Complete Guide  |  Next: Part 17 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading