Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Vector Databases in VMware Private AI: Running pgvector on Data Services Manager (Private AI Series, Part 13)

A reference-architecture look at the retrieval tier of VMware Private AI: where DSM-managed PostgreSQL with pgvector sits, how to place and size it, and whether to index with HNSW or IVFFlat.

VMware Private AI Series · Part 13 of 24

TL;DR · Key Takeaways

  • The retrieval tier is the part of a Private AI deployment that quietly decides whether your RAG answers are good. Treat the vector database as infrastructure, not an afterthought wired up next to the model.
  • Data Services Manager 9.1 gives you enterprise PostgreSQL with the pgvector extension, HA, backups, and FIPS 140-2 compliance, managed from the VCF console. That is the supported path, and for most teams it beats hand-rolling a standalone vector store.
  • Place the database nodes on a separate vSAN ESA cluster from your GPU hosts to avoid noisy-neighbor contention. Run PostgreSQL in 3-node or 5-node HA for production.
  • Index choice is the lever that matters: HNSW for recall and incremental writes, IVFFlat for very large bulk-loaded corpora with tight build budgets. The default is HNSW.
  • Size for the index, not just the rows. An HNSW index can be roughly 2 to 3 times the size of the raw vectors, and it wants to live in RAM.
Who this is for: VCF architects and platform teams designing the data tier for a Private AI Foundation RAG deployment.  Prerequisites: a working PAIF 9.0 or 9.1 environment with a GPU workload domain, and DSM 9.1 licensed as a VCF Advanced Service.

Most RAG demos work. The corpus is a few hundred PDFs, the embeddings fit in memory, and similarity search returns something plausible. Then the same design meets a real document estate of a few million chunks, the index no longer fits in RAM, recall quietly drops, and the chatbot starts confidently citing the wrong policy. The model did not get worse. The retrieval tier did, and nobody designed it on purpose. This post is about designing it on purpose: where the vector database sits in VMware Private AI Foundation, how to place and size it, and the one decision (the index type) that most teams get wrong.

Where pgvector fits in the Private AI stack

The vector database is not the AI. It is the memory the AI reads from. In a Private AI Foundation RAG pipeline the flow is straightforward: a user query gets turned into an embedding by an embedding model, that vector is compared against the stored embeddings of your documents, the closest chunks come back as context, and only then does the language model (served by a NIM microservice or a model endpoint from the Model Store) generate an answer. The vector search step is the gate. If it returns the wrong chunks, no amount of GPU horsepower downstream will fix the answer.

The retrieval tier is the gate Where pgvector sits in a Private AI RAG request 1. Query user question 2. Embed embedding NIM 3. Vector search pgvector on DSM PostgreSQL 4. Context top-k chunks 5. Answer LLM NIM Document chunks are embedded and indexed here
The model endpoint gets the credit, but step 3 decides whether the answer is right.

On Private AI Foundation, the supported vector store is PostgreSQL with the pgvector extension, provisioned and managed by Data Services Manager. Broadcom documents this directly in the Private AI Ready Infrastructure validated design: the pgvector extension in DSM-deployed PostgreSQL is what enables RAG for use cases like code generation, contact-center resolution, and IT automation. You can run a different vector engine if you insist, but you give up the integration story the platform was built around.

Why DSM-managed PostgreSQL beats a hand-rolled vector store

There is a strong pull, especially from data-science teams, to just docker run a popular standalone vector database on a VM and move on. It works in a lab. It becomes your problem in production, because now you own backups, HA, patching, certificate rotation, and capacity for a stateful service that the rest of the platform does not know about.

Data Services Manager 9.1 inverts that. You get full enterprise PostgreSQL with native pgvector, provisioned in under 15 minutes, with automated HA, continuous backups, point-in-time recovery, and FIPS 140-2 compliance, all driven from the VCF console and the same VCF Automation, Terraform, and GitOps patterns you already use for VMs and Kubernetes. The data never leaves infrastructure you control, which is the entire reason a regulated customer chose Private AI over a public endpoint in the first place.

My take

For a single throwaway proof of concept with a few thousand chunks, a standalone vector container is fine and faster to stand up. For anything that will carry production data, regulated content, or more than one application, use DSM-managed PostgreSQL. The break-even point arrives the first time someone asks who is backing up the embeddings, and on a hand-rolled box the answer is usually nobody.

One honest caveat: pgvector inside PostgreSQL is not the fastest vector engine in a synthetic benchmark, and at extreme scale (hundreds of millions of vectors with very high query concurrency) a purpose-built engine can pull ahead. Most enterprise RAG corpora are nowhere near that scale, and the operational integration wins. Validate your own corpus size against the sizing section below before you assume you are the exception.

The reference topology

The placement guidance in the validated design is specific, and it matters. The DSM appliance runs in the management domain (there is a 1:1 relationship between a DSM appliance and a vCenter Server instance). The database VMs themselves should run on a separate vSAN cluster from the GPU hosts, ideally inside the same VI workload domain. The reason is contention. Your GPU cluster is busy with embedding and inference; you do not want a vector index rebuild competing with that for the same storage and CPU. Separate clusters keep the noisy-neighbor problem off your latency-sensitive retrieval path.

Reference placement DSM in the management domain, databases on their own vSAN ESA cluster Management Domain DSM appliance control plane 1:1 with vCenter vCenter VI WLD VI Workload Domain GPU cluster NIM / DL VMs vGPU profiles, MIG embedding + inference, latency-sensitive Separate vSAN ESA cluster pg node 1 pg node 2 pg node 3 PostgreSQL + pgvector, HA, RAID 5/6 erasure coding provisions + manages
Keep the database I/O and index rebuilds off the cluster that is serving GPU inference.

For production, run PostgreSQL in HA mode with 3 or 5 nodes. Plan IP addressing carefully, because HA topologies consume more addresses than people expect. The validated design spells it out: a 5-node PostgreSQL cluster needs 7 IP addresses, one per node, one for the kube_VIP, and one for database load balancing. For storage, put the databases on vSAN ESA with RAID 5 or RAID 6 erasure coding (RAID 5 needs a minimum of 4 hosts at FTT=1, RAID 6 a minimum of 6 hosts at FTT=2), which gives you RAID-1-like performance with far better space efficiency. Backups go to an S3-compatible object store such as MinIO, configured with TLS.

Design dimensionValidated-design guidanceWhy it matters
DSM appliance placementManagement domain, 1:1 with each vCenterAdds resource load to the mgmt domain; plan for one appliance per WLD vCenter
Database cluster placementSeparate vSAN cluster from GPU hosts, same VI WLDRemoves noisy-neighbor contention from the retrieval path
HA topology3-node or 5-node PostgreSQL for productionA single node makes your whole RAG system single-points-of-failure on a DB
IP planning5-node cluster = 7 IPs (nodes + kube_VIP + LB)Undersized IP pools stall provisioning of the infrastructure policy
Storage policyvSAN ESA, RAID 5 (4+ hosts) or RAID 6 (6+ hosts)Performance close to RAID 1 with better usable capacity
Backup targetS3-compatible object store (e.g. MinIO) with TLSEmbeddings are expensive to recompute; protect them like any other data

HNSW or IVFFlat: choosing the index

This is the decision that separates a vector store that works from one that quietly disappoints. pgvector ships two approximate-nearest-neighbor index types, and they behave very differently. HNSW (a graph index) gives excellent recall and high query throughput, handles incremental inserts gracefully, but is slow to build and memory-hungry. IVFFlat (a clustered/inverted-list index) builds fast and is compact, but its recall is lower and, importantly, it does not adapt as data changes: the cluster centroids are computed once and go stale silently as you add or remove vectors.

DimensionHNSWIVFFlat
Recall at high QPSVery high (e.g. 0.998 recall sustained)Lower at the same throughput
Query throughputBest (an order of magnitude faster in published benchmarks)Modest
Build timeSlow (can be ~30x slower to build)Fast
Index sizeLarge (roughly 2-3x the raw vectors)Compact
Incremental writesHandles them wellCentroids go stale; needs periodic rebuild
Key knobsm=16, ef_construction=200; tune ef_search at query timelists ~= sqrt(rows) under 1M, rows/1000 above; tune probes
Index decision flow When to reach for HNSW vs IVFFlat Corpus under ~1M vectors? and data arrives incrementally? yes no, very large + bulk-loaded Use HNSW recall + throughput first m=16, ef_construction=200 accept slower build + RAM Use IVFFlat fast build, compact index lists ~= sqrt(rows) schedule periodic rebuilds Default verdict Start with HNSW. Most enterprise RAG corpora are under 1M chunks with steady ingestion.
IVFFlat is the exception case, not the default.

The verdict: default to HNSW. Most enterprise RAG corpora are well under a million chunks, documents trickle in continuously rather than arriving in one giant bulk load, and recall is the thing users actually feel. HNSW wins on all three. Reach for IVFFlat only when you have many millions of vectors, you can bulk-load before building, and the build time or memory footprint of HNSW genuinely rules it out. If you do choose IVFFlat, put a scheduled index rebuild on the calendar, because nothing in the system will warn you when stale centroids start eroding recall.


Sizing the vector store

The mistake here is sizing for rows and forgetting the index. Start from the embedding model, because the embedding dimension drives everything. A single vector stored as 32-bit floats is dimensions times 4 bytes: a 1024-dimension embedding is about 4 KB of raw vector per chunk before any overhead. Multiply by chunk count for the table, then add the index on top. With HNSW that index can be 2 to 3 times the size of the raw vectors, and for fast queries you want the working set resident in RAM, not paged off disk.

Size for the index, not the rows Where the memory actually goes (1M chunks, 1024-dim example) Raw vectors ~4 GB (dims x 4 B) + HNSW index ~8-12 GB (2-3x) + PG working mem shared_buffers, cache RAM target keep working set resident, not paged Rule of thumb per-chunk bytes = embedding_dimensions x 4. Table size = per-chunk x chunk_count. Add 2-3x for an HNSW index, then size the VM class so the working set fits in RAM. Halving the embedding dimension roughly halves both storage and the RAM you must buy.
The index, not the raw vectors, is what dictates the VM class you pick in DSM.

Define DSM VM classes that align to this. The validated design is explicit that you should size VM classes against use case, workload type, data volume, and transactions per second, and that applies doubly to a vector workload where the index wants memory. The table below is a planning starting point, not a guarantee; benchmark your own embedding model and query mix before you commit.

Corpus scale (chunks)IndexHA topologyMemory posture
Up to ~100k (PoC, single app)HNSW3-nodeSmall VM class; index fits comfortably in RAM
~100k to 1M (department RAG)HNSW3-nodeSize RAM for raw vectors x ~3 plus PG overhead
1M to ~10M (enterprise corpus)HNSW, watch build time5-nodeMemory-optimized VM class; consider lower-dim embeddings
Tens of millions+, bulk-loadedIVFFlat (or partition)5-nodeTrade recall for build/memory; schedule rebuilds

What actually bites in production

A few patterns show up again and again on retrieval-tier work, and none of them are in the quickstart. First, dimension mismatch between the embedding model and the column. The most common bring-up failure is embedding documents with one model, then querying with a different model or a different version, so the dimensions or the vector space no longer line up and similarity scores turn to noise. Pin the embedding model version the same way you pin everything else, and store it as metadata next to the vectors.

Second, the distance operator has to match how the embeddings were trained. Cosine distance, L2, and inner product are not interchangeable, and using the wrong one produces results that look almost right, which is worse than obviously broken. Third, teams forget that re-embedding the whole corpus after a model change is an expensive, GPU-bound batch job, not a quick migration. Plan for it, and back up the embeddings so you are not forced into it. Fourth, ef_search (HNSW) and probes (IVFFlat) are query-time knobs you should actually tune against your recall target rather than leaving at defaults; the default ef_search of 40 is conservative for many workloads.

Disclaimer: Treat any change to a production retrieval tier as a real change. Validate the DSM and PostgreSQL versions against the PAIF interoperability matrix, confirm vSAN host counts support your chosen RAID level, verify IP-pool sizing for the HA topology, back up the database and the embeddings to your TLS object store, and test the index build and recall on a representative copy before you touch production.

What I’d Do

Provision the vector store as DSM-managed PostgreSQL with pgvector, in a 3-node HA cluster on a dedicated vSAN ESA cluster, separate from the GPU hosts. Index with HNSW unless your corpus genuinely runs to tens of millions of bulk-loaded vectors. Size the VM class for the index plus PostgreSQL working memory, not just the row count, and keep the working set in RAM. Pin the embedding model and store its version with the data. Then back up the embeddings, because the day you need them recomputed is the day you will not have GPUs to spare. What embedding dimension is your team standardizing on, and have you sized RAM for the index it implies?

References

VMware Private AI Series · Part 13 of 30
« Previous: Part 12  |  VMware Private AI Complete Guide  |  Next: Part 14 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading