TL;DR · Key Takeaways
- The retrieval tier is the part of a Private AI deployment that quietly decides whether your RAG answers are good. Treat the vector database as infrastructure, not an afterthought wired up next to the model.
- Data Services Manager 9.1 gives you enterprise PostgreSQL with the
pgvectorextension, HA, backups, and FIPS 140-2 compliance, managed from the VCF console. That is the supported path, and for most teams it beats hand-rolling a standalone vector store. - Place the database nodes on a separate vSAN ESA cluster from your GPU hosts to avoid noisy-neighbor contention. Run PostgreSQL in 3-node or 5-node HA for production.
- Index choice is the lever that matters: HNSW for recall and incremental writes, IVFFlat for very large bulk-loaded corpora with tight build budgets. The default is HNSW.
- Size for the index, not just the rows. An HNSW index can be roughly 2 to 3 times the size of the raw vectors, and it wants to live in RAM.
Most RAG demos work. The corpus is a few hundred PDFs, the embeddings fit in memory, and similarity search returns something plausible. Then the same design meets a real document estate of a few million chunks, the index no longer fits in RAM, recall quietly drops, and the chatbot starts confidently citing the wrong policy. The model did not get worse. The retrieval tier did, and nobody designed it on purpose. This post is about designing it on purpose: where the vector database sits in VMware Private AI Foundation, how to place and size it, and the one decision (the index type) that most teams get wrong.
Where pgvector fits in the Private AI stack
The vector database is not the AI. It is the memory the AI reads from. In a Private AI Foundation RAG pipeline the flow is straightforward: a user query gets turned into an embedding by an embedding model, that vector is compared against the stored embeddings of your documents, the closest chunks come back as context, and only then does the language model (served by a NIM microservice or a model endpoint from the Model Store) generate an answer. The vector search step is the gate. If it returns the wrong chunks, no amount of GPU horsepower downstream will fix the answer.
On Private AI Foundation, the supported vector store is PostgreSQL with the pgvector extension, provisioned and managed by Data Services Manager. Broadcom documents this directly in the Private AI Ready Infrastructure validated design: the pgvector extension in DSM-deployed PostgreSQL is what enables RAG for use cases like code generation, contact-center resolution, and IT automation. You can run a different vector engine if you insist, but you give up the integration story the platform was built around.
Why DSM-managed PostgreSQL beats a hand-rolled vector store
There is a strong pull, especially from data-science teams, to just docker run a popular standalone vector database on a VM and move on. It works in a lab. It becomes your problem in production, because now you own backups, HA, patching, certificate rotation, and capacity for a stateful service that the rest of the platform does not know about.
Data Services Manager 9.1 inverts that. You get full enterprise PostgreSQL with native pgvector, provisioned in under 15 minutes, with automated HA, continuous backups, point-in-time recovery, and FIPS 140-2 compliance, all driven from the VCF console and the same VCF Automation, Terraform, and GitOps patterns you already use for VMs and Kubernetes. The data never leaves infrastructure you control, which is the entire reason a regulated customer chose Private AI over a public endpoint in the first place.
My take
For a single throwaway proof of concept with a few thousand chunks, a standalone vector container is fine and faster to stand up. For anything that will carry production data, regulated content, or more than one application, use DSM-managed PostgreSQL. The break-even point arrives the first time someone asks who is backing up the embeddings, and on a hand-rolled box the answer is usually nobody.
One honest caveat: pgvector inside PostgreSQL is not the fastest vector engine in a synthetic benchmark, and at extreme scale (hundreds of millions of vectors with very high query concurrency) a purpose-built engine can pull ahead. Most enterprise RAG corpora are nowhere near that scale, and the operational integration wins. Validate your own corpus size against the sizing section below before you assume you are the exception.
The reference topology
The placement guidance in the validated design is specific, and it matters. The DSM appliance runs in the management domain (there is a 1:1 relationship between a DSM appliance and a vCenter Server instance). The database VMs themselves should run on a separate vSAN cluster from the GPU hosts, ideally inside the same VI workload domain. The reason is contention. Your GPU cluster is busy with embedding and inference; you do not want a vector index rebuild competing with that for the same storage and CPU. Separate clusters keep the noisy-neighbor problem off your latency-sensitive retrieval path.
For production, run PostgreSQL in HA mode with 3 or 5 nodes. Plan IP addressing carefully, because HA topologies consume more addresses than people expect. The validated design spells it out: a 5-node PostgreSQL cluster needs 7 IP addresses, one per node, one for the kube_VIP, and one for database load balancing. For storage, put the databases on vSAN ESA with RAID 5 or RAID 6 erasure coding (RAID 5 needs a minimum of 4 hosts at FTT=1, RAID 6 a minimum of 6 hosts at FTT=2), which gives you RAID-1-like performance with far better space efficiency. Backups go to an S3-compatible object store such as MinIO, configured with TLS.
| Design dimension | Validated-design guidance | Why it matters |
|---|---|---|
| DSM appliance placement | Management domain, 1:1 with each vCenter | Adds resource load to the mgmt domain; plan for one appliance per WLD vCenter |
| Database cluster placement | Separate vSAN cluster from GPU hosts, same VI WLD | Removes noisy-neighbor contention from the retrieval path |
| HA topology | 3-node or 5-node PostgreSQL for production | A single node makes your whole RAG system single-points-of-failure on a DB |
| IP planning | 5-node cluster = 7 IPs (nodes + kube_VIP + LB) | Undersized IP pools stall provisioning of the infrastructure policy |
| Storage policy | vSAN ESA, RAID 5 (4+ hosts) or RAID 6 (6+ hosts) | Performance close to RAID 1 with better usable capacity |
| Backup target | S3-compatible object store (e.g. MinIO) with TLS | Embeddings are expensive to recompute; protect them like any other data |
HNSW or IVFFlat: choosing the index
This is the decision that separates a vector store that works from one that quietly disappoints. pgvector ships two approximate-nearest-neighbor index types, and they behave very differently. HNSW (a graph index) gives excellent recall and high query throughput, handles incremental inserts gracefully, but is slow to build and memory-hungry. IVFFlat (a clustered/inverted-list index) builds fast and is compact, but its recall is lower and, importantly, it does not adapt as data changes: the cluster centroids are computed once and go stale silently as you add or remove vectors.
| Dimension | HNSW | IVFFlat |
|---|---|---|
| Recall at high QPS | Very high (e.g. 0.998 recall sustained) | Lower at the same throughput |
| Query throughput | Best (an order of magnitude faster in published benchmarks) | Modest |
| Build time | Slow (can be ~30x slower to build) | Fast |
| Index size | Large (roughly 2-3x the raw vectors) | Compact |
| Incremental writes | Handles them well | Centroids go stale; needs periodic rebuild |
| Key knobs | m=16, ef_construction=200; tune ef_search at query time | lists ~= sqrt(rows) under 1M, rows/1000 above; tune probes |
The verdict: default to HNSW. Most enterprise RAG corpora are well under a million chunks, documents trickle in continuously rather than arriving in one giant bulk load, and recall is the thing users actually feel. HNSW wins on all three. Reach for IVFFlat only when you have many millions of vectors, you can bulk-load before building, and the build time or memory footprint of HNSW genuinely rules it out. If you do choose IVFFlat, put a scheduled index rebuild on the calendar, because nothing in the system will warn you when stale centroids start eroding recall.
Sizing the vector store
The mistake here is sizing for rows and forgetting the index. Start from the embedding model, because the embedding dimension drives everything. A single vector stored as 32-bit floats is dimensions times 4 bytes: a 1024-dimension embedding is about 4 KB of raw vector per chunk before any overhead. Multiply by chunk count for the table, then add the index on top. With HNSW that index can be 2 to 3 times the size of the raw vectors, and for fast queries you want the working set resident in RAM, not paged off disk.
Define DSM VM classes that align to this. The validated design is explicit that you should size VM classes against use case, workload type, data volume, and transactions per second, and that applies doubly to a vector workload where the index wants memory. The table below is a planning starting point, not a guarantee; benchmark your own embedding model and query mix before you commit.
| Corpus scale (chunks) | Index | HA topology | Memory posture |
|---|---|---|---|
| Up to ~100k (PoC, single app) | HNSW | 3-node | Small VM class; index fits comfortably in RAM |
| ~100k to 1M (department RAG) | HNSW | 3-node | Size RAM for raw vectors x ~3 plus PG overhead |
| 1M to ~10M (enterprise corpus) | HNSW, watch build time | 5-node | Memory-optimized VM class; consider lower-dim embeddings |
| Tens of millions+, bulk-loaded | IVFFlat (or partition) | 5-node | Trade recall for build/memory; schedule rebuilds |
What actually bites in production
A few patterns show up again and again on retrieval-tier work, and none of them are in the quickstart. First, dimension mismatch between the embedding model and the column. The most common bring-up failure is embedding documents with one model, then querying with a different model or a different version, so the dimensions or the vector space no longer line up and similarity scores turn to noise. Pin the embedding model version the same way you pin everything else, and store it as metadata next to the vectors.
Second, the distance operator has to match how the embeddings were trained. Cosine distance, L2, and inner product are not interchangeable, and using the wrong one produces results that look almost right, which is worse than obviously broken. Third, teams forget that re-embedding the whole corpus after a model change is an expensive, GPU-bound batch job, not a quick migration. Plan for it, and back up the embeddings so you are not forced into it. Fourth, ef_search (HNSW) and probes (IVFFlat) are query-time knobs you should actually tune against your recall target rather than leaving at defaults; the default ef_search of 40 is conservative for many workloads.
What I’d Do
Provision the vector store as DSM-managed PostgreSQL with pgvector, in a 3-node HA cluster on a dedicated vSAN ESA cluster, separate from the GPU hosts. Index with HNSW unless your corpus genuinely runs to tens of millions of bulk-loaded vectors. Size the VM class for the index plus PostgreSQL working memory, not just the row count, and keep the working set in RAM. Pin the embedding model and store its version with the data. Then back up the embeddings, because the day you need them recomputed is the day you will not have GPUs to spare. What embedding dimension is your team standardizing on, and have you sized RAM for the index it implies?
References
- Broadcom TechDocs: VMware Data Services Manager Design for Private AI Ready Infrastructure
- VMware Cloud Foundation Blog: Data Services Manager 9.1
- pgvector: open-source vector similarity search for PostgreSQL
« Previous: Part 12 | VMware Private AI Complete Guide | Next: Part 14 »








