Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

Vector Databases: How Semantic Search Really Works (GenAI Series, Part 14)

A vector database stores embeddings and finds the closest in meaning, fast, across millions of items. How semantic search, ANN indexes like HNSW and IVF, and pgvector work.

8 minutes

Read Time

Generative AI Series · Part 14 of 30

TL;DR · Key Takeaways

  • A vector database stores embeddings and finds the ones closest in meaning to a query, fast, even across millions of items.
  • It powers semantic search: matching by meaning rather than exact words, which is what makes RAG and recommendations work.
  • The trick is the index. Instead of comparing your query to every vector, it cleverly checks only a promising subset, trading a sliver of accuracy for enormous speed.
  • You do not always need a specialist database. pgvector bolts vector search onto PostgreSQL, and managed services exist when scale demands them.

Part 13 ended on a quiet piece of magic. A RAG system has to take your question and, out of perhaps millions of stored text chunks, find the handful that match it in meaning, and it has to do it in a few milliseconds, on every single request. That is not a small ask. Search every item one by one and a large knowledge base would take seconds per query, which is hopeless at scale. The thing that makes it fast enough to be invisible is the vector database, and the idea at its heart is genuinely clever.

Matching words vs matching meaning Query: “reset my password” KEYWORD SEARCH ✓ “How to reset your password” ✗ “Recovering account access” ✗ “I forgot my login” misses anything not using the exact words SEMANTIC SEARCH ✓ “How to reset your password” ✓ “Recovering account access” ✓ “I forgot my login” matches by meaning, shared words optional
Keyword search needs the words to match. Semantic search only needs the meaning to match.

From embeddings to search

Recall the big idea from Part 7: an embedding turns a piece of text into a vector, a list of numbers that acts like coordinates for its meaning, so that similar meanings land near each other. A vector database is simply a system built to store huge numbers of these vectors and answer one question extremely well: given a new vector, which stored vectors are closest to it? That “closest” is meaning-closeness, measured with something like the cosine similarity we met earlier. Find the nearest vectors and you have found the most semantically related text.

This is why semantic search feels smarter than the old keyword kind. A keyword search for “reset my password” only finds documents containing those words, missing a help article titled “recovering account access” that is exactly what the user needs. Semantic search embeds the query and the documents into the same meaning-space, so the article sits close to the query even with no shared words. The same machinery quietly powers product recommendations, duplicate detection, and “find similar” features everywhere. Underneath all of them is one operation: nearest-neighbour search in a space of vectors.

Search = find the nearest points query nearest 3 → retrieved far-away points are ignored
Every stored chunk is a point. A query lands among them and grabs its closest neighbours.

The index: how it gets fast

Here is the catch that makes vector databases interesting. The obvious way to find the nearest vectors is to compare your query against every stored vector and keep the best. That is called brute-force search, and it gives the perfect answer, but its cost grows with the size of your collection. At a few thousand items it is fine. At ten million, it is far too slow to run on every request. Something smarter is needed, and that something is the index.

A vector index is a pre-built structure that lets the database skip the vast majority of comparisons and still find the right neighbours almost every time. The key word is “almost.” These are approximate nearest-neighbour methods: they accept a tiny chance of missing a true closest item in exchange for being hundreds of times faster. In practice the approximation is so good that you would rarely notice, and the speed-up is the difference between a usable product and an unusable one. The two families you will hear about most are graph-based methods like HNSW and clustering-based methods like IVF, summarised below.

Approach How it works Trade-off
Brute force (flat)Compare the query to every vectorExact, but slow at scale
HNSW (graph)Hop through a graph of linked vectors toward the queryVery fast and accurate, uses more memory
IVF (clusters)Group vectors into clusters, search only the nearest fewLean and quick, tuning the cluster count matters
+ QuantizationCompress vectors to shrink memoryCheaper storage, slight accuracy loss
Indexes trade a little accuracy for a lot of speed. Which one wins depends on your scale and budget.

Do you even need a special database?

The phrase “vector database” makes this sound like a heavy new piece of infrastructure you must adopt. Often it is not. If you already run PostgreSQL, the pgvector extension adds vector storage and similarity search right inside the database you have, which means your embeddings live next to your normal data and you query both with familiar tools. For a great many applications, that is the whole answer, and reaching for a dedicated vector store first is over-engineering.

One feature matters more than people expect, and it is a good reason to keep your vectors near your normal data: metadata filtering. Real queries are rarely “find the closest meaning” in a vacuum. They are “find the closest meaning among documents this user is allowed to see,” or “…published in the last year,” or “…from the finance department only.” A vector search that ignores those filters can happily return a perfectly relevant passage the user has no right to read, which is a privacy incident waiting to happen. So in practice you combine the similarity search with ordinary structured filters on tags, dates, and permissions. Living inside PostgreSQL makes that combination natural, because the filters are just normal SQL conditions sitting alongside the vector query. It is one more reason the boring option is often the right one.

Dedicated vector databases and managed services earn their place when scale or features demand it: billions of vectors, very high query rates, advanced filtering, or the desire not to operate the thing yourself. They handle sharding, replication, and index tuning for you, at a price. My honest advice is to start with what you already run, prove the use case, and graduate to a specialist system only when you hit a wall you can actually name. Choosing the fanciest option on day one is how teams end up maintaining infrastructure their workload never needed.

Reality check: the database is rarely where RAG quality is won or lost. Two systems using the same vector store can perform worlds apart depending on the embedding model and the chunking. I would tune those first and treat the choice of vector database as an operational decision about scale and cost, not the thing that will fix mediocre search results.
▾  Go Deeper (optional, for technical readers)

Approximate nearest-neighbour search lives on a recall-versus-latency curve, and understanding it is most of the operational skill. Recall here means the fraction of the true nearest neighbours the index actually returns. You can almost always buy higher recall by letting the search work harder, exploring more of the graph or probing more clusters, but that costs time. Every ANN index exposes knobs that slide you along this curve.

In HNSW, a multi-layer proximity graph, the key dials are how many neighbours each node links to and how wide the search beam is at query time (often called ef). Raise them and recall climbs while latency and memory grow. In IVF, the data is partitioned into clusters and a query only searches the few nearest ones; the nprobe setting controls how many clusters you check, again trading recall for speed. Both are frequently paired with product quantization, which compresses each vector into a compact code so billions fit in memory, at a small accuracy cost. The practical takeaway: there is no single “best” index, only the right point on the recall-latency-cost curve for your traffic and your tolerance for the occasional missed result. Benchmark on your own data, because published numbers rarely match your distribution.

This is Part 14 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. Vector search rests on the embeddings from Part 7 and powers the retrieval in Part 13 on RAG.

The Bottom Line

A vector database is the engine that makes “find the closest meaning” fast enough to run on every request. It stores embeddings, and through an approximate index it locates the nearest neighbours of a query without comparing against everything, trading a sliver of accuracy for a massive gain in speed. That single capability is what turns the embeddings of Part 7 into the working retrieval of Part 13, and into semantic search and recommendations across the web.

If I had one piece of advice, it would be to resist starting with the most exotic option. Most teams are well served by pgvector until proven otherwise, and the real quality levers sit in the embedding and chunking, not the database brand. With retrieval and storage now demystified, the practical question becomes one of choices: when you want a model to use your knowledge or behave a certain way, should you prompt it, fine-tune it, or wire up RAG? That is exactly the decision the next part settles.

References

Generative AI Series · Part 14 of 30
« Part 13: RAG and grounding  |  Generative AI Complete Guide  |  Next: Part 15, fine-tuning vs RAG vs prompting »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading