TL;DR · Key Takeaways
- Most of a Private AI platform is reproducible from config. Back up the small stateful core, not the GPUs.
- The three things you cannot regenerate: the vector database, fine-tuned adapters and their datasets, and your namespace and CR definitions.
- Base model weights are not backup material. They re-pull from NGC or a mirror. Treat them as cache.
- Multi-tenancy on Private AI is vSphere namespaces plus resource quotas plus, ideally, NSX segmentation per tenant.
- Without GPU quotas, the first team to deploy a greedy model starves everyone else. Quota is the whole game.
Disaster recovery for an AI platform trips people up because they reach for the same playbook they use for a database tier, and most of it does not apply. A serving cluster is largely stateless. If you lost every GPU host tonight and rebuilt from your manifests tomorrow, the NIM Operator would re-pull the models and your endpoints would come back. The hard part of DR here is not the bulk, it is the small set of things that genuinely cannot be regenerated. Get specific about what those are and the rest of the design is easy.
Stateful or reproducible? Sort everything into two buckets
Walk the stack and put each component in one of two buckets: can I rebuild this from a manifest and a model pull, or is this the only copy of something? Everything in the first bucket needs a good infrastructure-as-code repo, not a backup job. Everything in the second bucket needs a real, tested backup.
| Component | Stateful? | Protection | Target RPO |
|---|---|---|---|
| Vector database | Yes | DSM-managed Postgres backups | Hours (or re-embed) |
| Fine-tuned adapters + data | Yes | Object store, versioned | Per training run |
| Namespace + CR configs | Yes | Git (GitOps) + etcd backup | Commit-level |
| Base model weights | No | Re-pull from NGC or mirror | N/A |
A practical DR topology
You do not need a hot standby GPU cluster sitting idle, that is an expensive way to protect a mostly reproducible platform. The pragmatic pattern is a warm config and stateful-data replica: keep your manifests in git, replicate the vector database and adapter store to a second site, and be able to stand up serving on recovery GPUs when needed. The recovery time is dominated by how fast you can get GPUs and re-pull models, which is exactly why a local model mirror at the recovery site pays for itself. The vector database design is the piece most worth replicating, because re-embedding a large corpus from scratch can take longer than the rest of the recovery combined.
Multi-tenancy: quota is the whole game
When several teams share a GPU cluster, the failure mode is not security first, it is starvation. One team deploys a model that grabs four GPUs with generous autoscaling, and everyone else is suddenly pending. vSphere namespaces give you the isolation boundary, the self-service catalog gives you the provisioning path, but resource quotas on GPUs are what actually keep the peace. Set a hard GPU ceiling per namespace, and combine it with NSX segmentation if tenants must not see each other’s traffic.
For chargeback, meter GPU-hours per namespace from the same DCGM data your monitoring already collects and attribute it to the owning team. You do not need a billing platform on day one, a monthly report of GPU-hours by namespace changes behavior all by itself. The moment teams see their consumption, the speculative always-on deployments quietly disappear.
What I would do
Run your whole platform from git so the reproducible bucket is genuinely reproducible, then put real backup effort into only three things: the vector database, the adapter and dataset store, and the cluster state in etcd. Skip backing up base model weights entirely and keep a mirror instead. On the tenancy side, never hand a team a shared cluster without a GPU quota, because the alternative is a support ticket the first busy afternoon. Add a monthly GPU-hours-by-namespace report early, it is the cheapest governance you will ever deploy. The honest verdict: DR for Private AI is easier than for a traditional app tier if you are disciplined about IaC, and a nightmare if you are not, because then everything is a special snowflake you cannot rebuild.
Have you actually restored your vector database from backup, or only assumed you could? That is the test worth running this week.
References
- VMware Private AI Foundation with NVIDIA documentation
- Deploying VCF Private AI Services (tenancy and namespaces)
« Previous: Part 28 | VMware Private AI Complete Guide | Next: Part 30 »



