Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Disaster Recovery and Multi-Tenancy for VMware Private AI: What to Protect and How to Share (Private AI Series, Part 29)

Most of your AI platform is reproducible, a small part is not. Here is a reference design for backing up the stateful pieces of VMware Private AI and sharing GPU clusters across teams without a free-for-all.

VMware Private AI Series · Part 29 of 30

TL;DR · Key Takeaways

  • Most of a Private AI platform is reproducible from config. Back up the small stateful core, not the GPUs.
  • The three things you cannot regenerate: the vector database, fine-tuned adapters and their datasets, and your namespace and CR definitions.
  • Base model weights are not backup material. They re-pull from NGC or a mirror. Treat them as cache.
  • Multi-tenancy on Private AI is vSphere namespaces plus resource quotas plus, ideally, NSX segmentation per tenant.
  • Without GPU quotas, the first team to deploy a greedy model starves everyone else. Quota is the whole game.

Disaster recovery for an AI platform trips people up because they reach for the same playbook they use for a database tier, and most of it does not apply. A serving cluster is largely stateless. If you lost every GPU host tonight and rebuilt from your manifests tomorrow, the NIM Operator would re-pull the models and your endpoints would come back. The hard part of DR here is not the bulk, it is the small set of things that genuinely cannot be regenerated. Get specific about what those are and the rest of the design is easy.

Stateful or reproducible? Sort everything into two buckets

Walk the stack and put each component in one of two buckets: can I rebuild this from a manifest and a model pull, or is this the only copy of something? Everything in the first bucket needs a good infrastructure-as-code repo, not a backup job. Everything in the second bucket needs a real, tested backup.

Two buckets, very different treatment Reproducible (IaC, not backup) VKS clusters and worker nodes GPU Operator, NIM Operator installs Base model weights (re-pull from NGC) NIMService and cache definitions Lose it, redeploy from git. Minutes to hours. Stateful (back this up) Vector database (pgvector on DSM) Fine-tuned adapters + training data Namespace, quota and CR definitions Guardrail configs and secrets Lose it without a backup and it is gone.
The right-hand column is small. That is the good news: your real backup scope is narrow if you define it honestly.
ComponentStateful?ProtectionTarget RPO
Vector databaseYesDSM-managed Postgres backupsHours (or re-embed)
Fine-tuned adapters + dataYesObject store, versionedPer training run
Namespace + CR configsYesGit (GitOps) + etcd backupCommit-level
Base model weightsNoRe-pull from NGC or mirrorN/A

A practical DR topology

You do not need a hot standby GPU cluster sitting idle, that is an expensive way to protect a mostly reproducible platform. The pragmatic pattern is a warm config and stateful-data replica: keep your manifests in git, replicate the vector database and adapter store to a second site, and be able to stand up serving on recovery GPUs when needed. The recovery time is dominated by how fast you can get GPUs and re-pull models, which is exactly why a local model mirror at the recovery site pays for itself. The vector database design is the piece most worth replicating, because re-embedding a large corpus from scratch can take longer than the rest of the recovery combined.

Warm config, replicated state Primary site Serving cluster (active) Vector DB (primary) Adapter / model store Model mirror Recovery site GPUs available on demand Vector DB (replica) Adapter / model store (replica) Model mirror (local) replicate state git drives both
Replicate the stateful core and keep a local mirror. Recovery time is bounded by GPU availability and model pull, not data volume.

Multi-tenancy: quota is the whole game

When several teams share a GPU cluster, the failure mode is not security first, it is starvation. One team deploys a model that grabs four GPUs with generous autoscaling, and everyone else is suddenly pending. vSphere namespaces give you the isolation boundary, the self-service catalog gives you the provisioning path, but resource quotas on GPUs are what actually keep the peace. Set a hard GPU ceiling per namespace, and combine it with NSX segmentation if tenants must not see each other’s traffic.

One cluster, fenced tenants Shared GPU cluster (16 GPUs) Team A namespacequota: 6 GPUs Team B namespacequota: 6 GPUs Shared / burst pool4 GPUs, first come Reserve per-tenant quota, keep a small shared burst pool, and meter usage for chargeback.
Hard GPU quotas per namespace stop one tenant starving the rest. A small burst pool keeps utilization high.

For chargeback, meter GPU-hours per namespace from the same DCGM data your monitoring already collects and attribute it to the owning team. You do not need a billing platform on day one, a monthly report of GPU-hours by namespace changes behavior all by itself. The moment teams see their consumption, the speculative always-on deployments quietly disappear.

Disclaimer: test your restore, not just your backup. Periodically rebuild a namespace from git and restore the vector database into it, confirm the embeddings and model versions match, and validate that quota and segmentation policies reapply cleanly. A backup you have never restored is a hypothesis, not a recovery plan.

What I would do

Run your whole platform from git so the reproducible bucket is genuinely reproducible, then put real backup effort into only three things: the vector database, the adapter and dataset store, and the cluster state in etcd. Skip backing up base model weights entirely and keep a mirror instead. On the tenancy side, never hand a team a shared cluster without a GPU quota, because the alternative is a support ticket the first busy afternoon. Add a monthly GPU-hours-by-namespace report early, it is the cheapest governance you will ever deploy. The honest verdict: DR for Private AI is easier than for a traditional app tier if you are disciplined about IaC, and a nightmare if you are not, because then everything is a special snowflake you cannot rebuild.

Have you actually restored your vector database from backup, or only assumed you could? That is the test worth running this week.

References

VMware Private AI Series · Part 29 of 30
« Previous: Part 28  |  VMware Private AI Complete Guide  |  Next: Part 30 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading