How Words Become Numbers: Tokens and Embeddings (GenAI Series, Part 7)

A model does math, not language. How tokenizing chops text into chunks and embedding turns each into a vector of meaning-coordinates, where similar ideas sit close together.

by

Dr. Pranay Jha

June 18, 2026

9 minutes

Read Time

Generative AI Series · Part 7 of 30

TL;DR · Key Takeaways

A model does math, not language, so text must first become numbers. That happens in two steps: tokenizing, then embedding.
Tokenizing chops text into small chunks (whole words, word-parts, or punctuation) and gives each a number from a fixed vocabulary.
Embedding turns each token into a long list of numbers, a vector, that acts like coordinates for the token’s meaning.
In that space, similar meanings sit close together, so “closeness” becomes something a computer can measure. This is the quiet foundation under search, RAG, and the models themselves.

Here is a fact that sounds too dumb to matter: a model cannot read. It multiplies numbers, and that is the entire repertoire. So before anything clever can happen to your sentence, the words have to turn into numbers, and the model’s numeric answer has to turn back into words on the way out.

That conversion is two steps: tokenizing, then embedding. They are, to my mind, the most underrated ideas in generative AI. People sprint past them to get to attention or prompt tricks, then spend the next three months quietly confused about token bills, context limits, and why semantic search even works. Nearly all of that traces straight back to these two moves, so they are worth slowing down for.

Two conversions stand between your sentence and the model’s math.

Step one: chopping text into tokens

The first step, tokenizing, breaks your text into pieces called tokens and assigns each a number. You might assume a token is a word, and often it is, but not always. A tokenizer works from a fixed vocabulary, a dictionary of perhaps fifty to a hundred thousand chunks built once before training. Common words like “cat” or “the” usually get their own single token. Rarer or longer words get split into familiar parts. “Tokenization” might become “token” and “ization,” and an unusual name could fragment into several pieces. Spaces and punctuation count too. This is why, as a rough rule of thumb, a hundred words of English come to around 130 tokens, and why the model never truly sees letters or whole sentences, only this stream of vocabulary chunks.

Why bother with these odd half-words instead of plain whole words? Because language is endless and a vocabulary cannot be. New words, typos, brand names, slang, and other languages appear constantly. If the model only knew complete words, the first unfamiliar one would stop it cold. By keeping a stock of word-parts, the tokenizer can always build any string out of pieces it already knows, the way you can spell a word you have never seen using letters you have. It is a clever compromise between two bad extremes: a vocabulary of every possible word (impossibly large) and a vocabulary of single letters (tiny but stripped of meaning). Tokens land in the useful middle.

Step two: turning tokens into meaning-coordinates

A token’s ID number is just a name tag. The number 15496 for “Cats” says nothing about cats; it is an arbitrary slot in the dictionary. The real work happens in the second step, embedding, which replaces each ID with a long list of numbers called a vector. A typical embedding might have a few hundred or a couple of thousand of these numbers. Think of them as coordinates, but in a space with far more than the usual three dimensions. Just as two numbers can pin a point on a map and three can place it in a room, a few hundred numbers can place a word in a vast space of meaning.

Those coordinates are not handed out at random, and this is where it gets interesting. They are learned during training, and they settle so that words used in similar ways end up in similar locations. “Coffee” and “tea” drift close together. “Coffee” and “gravel” end up far apart. Nobody hand-places these points; the same guess-check-nudge learning from Part 6 pushes related words together because they show up in related contexts. The result is that the geometry of the space encodes meaning. Direction and distance carry information, which is what lets a model treat “the doctor washed her hands” and “the physician cleaned his hands” as nearly the same idea despite sharing few exact words.

The most striking demonstration that direction carries meaning is a bit of vector arithmetic that became famous with early embedding models. Take the vector for “king,” subtract the vector for “man,” add the vector for “woman,” and the point you land on sits closest to “queen.” The same trick maps “Paris” minus “France” plus “Italy” near “Rome.” What this reveals is that the space has not just clumped similar words together, it has lined up consistent relationships as repeatable directions: one direction roughly means “make it more feminine,” another roughly means “capital city of.” Nobody programmed those axes. They fell out of the training because those relationships recur everywhere in human text. Modern models are more subtle than these clean party tricks suggest, and the arithmetic is rarely this tidy in practice. But the underlying point is real, and it is the part that still feels like sleight of hand to me even knowing how it works: the layout of the space carries the meaning, not just the labels on the dots.

Related words cluster, and the directions between them can carry meaning of their own.

Why “close” is the whole point

Once meaning lives in coordinates, “similar” turns into a distance you can compute, and that single fact powers an astonishing amount of technology. To compare two pieces of text, you embed both and check how close their vectors point. The usual measure is cosine similarity, which looks at the angle between two vectors rather than how long they are. Two vectors pointing the same way score near 1 (very similar), perpendicular ones score near 0 (unrelated), and opposite ones score near −1. The intuition is simple: it asks whether two words or sentences are heading in the same direction in meaning-space, ignoring how big or emphatic each one is.

This is the engine of semantic search, the kind that finds “how do I reset my password” when you typed “I forgot my login,” even with no shared keywords. It is how a retrieval system, the RAG approach in a later part, fetches the right paragraph from your documents to ground an answer. It is what a vector database, another upcoming part, is built to store and search at scale. All of it reduces to the same move: turn text into vectors, then measure which vectors are close. None of it is exotic once you have seen the trick. If you remember a single mechanical idea from this whole series, I would vote for this one.

Cosine similarity scores meaning by direction, which is why unrelated-looking phrases can still match.

Reality check: tokens are why you are billed and limited by token, not by word, and why a wall of code or a foreign language can burn through a context window faster than plain English. They also explain a classic stumble: a model can struggle to count the letters in a word or spell it backwards, because it never sees the letters, only the token. The quirk is not stupidity, it is the unit it works in.

▾ Go Deeper (optional, for technical readers)

The vocabulary is usually built with a method called byte-pair encoding (BPE) or a close cousin. It starts from the smallest units, individual bytes or characters, then repeatedly scans a large corpus and merges the most frequent adjacent pair into a new token, again and again, until it reaches a target vocabulary size. Frequent sequences like “ing” or “the” get merged into single tokens early, while rare strings stay split into smaller parts. The elegance is that BPE never faces an unknown word: in the worst case it falls back to bytes, so any input at all can be represented. This is also why token counts vary by language; tokenizers trained mostly on English spend more tokens per word on languages they saw less of, which has real cost and fairness implications.

On the embedding side, the length of each vector is the embedding dimension, often a few hundred to a few thousand numbers. Higher dimensions give the model more room to separate fine shades of meaning, at the cost of memory and compute. One subtlety worth knowing: the simple per-token embeddings described here are the input layer. As the tokens pass through a transformer’s attention layers, each token’s vector is updated based on its neighbours, so the representation becomes contextual, the vector for “bank” near “river” ends up different from “bank” near “money.” That contextual reshaping is exactly what attention does, and it is the subject of the next part.

This is Part 7 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. If “learning” here felt unclear, see Part 6 on how networks learn, or begin the series at Part 1.

The Bottom Line

So the pipeline is short: chop text into tokens, then turn each token into a vector that pins its meaning somewhere in space. Once that is done, “similar” stops being a fuzzy human word and becomes a distance you can calculate. That one shift is the quiet thing doing the heavy lifting behind search, retrieval, and the models themselves.

There is a catch, though, and it is the reason this story is not finished. The vectors we have built so far are static. “Bank” gets one fixed point whether you are talking about a river or your savings, and that is obviously not how language works. A word’s meaning bends to the words around it. Teaching the vectors to do that bending is the job of attention, and it is where we head next.

References

See your own text split into tokens (OpenAI Tokenizer)
Embeddings: meaning as coordinates (Google Machine Learning)
How byte-pair encoding builds a vocabulary (Hugging Face)

Generative AI Series · Part 7 of 30
« Part 6: how neural networks learn | Generative AI Complete Guide | Next: Part 8, the attention idea »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Dr. Pranay Jha