The GenAI Words Everyone Uses, and What They Actually Mean (GenAI Series, Part 2)

Model, tokens, parameters, inference, embeddings, hallucination: the words everyone uses about generative AI, sorted into build time and use time and explained in plain English.

by

Dr. Pranay Jha

June 18, 2026

10 minutes

Read Time

Generative AI Series · Part 2 of 30

TL;DR · Key Takeaways

The jargon is not random. Almost every term belongs to one of two moments: building the model or using it.
Build time gives you training, parameters, fine-tuning, and the labels LLM and foundation model.
Use time gives you prompt, token, context, inference, and the failure mode everyone fears, hallucination.
Embedding is the bridge: it is how words get turned into numbers so the math can happen at all.

The fastest way to feel lost in an AI conversation is the vocabulary. People throw around tokens, parameters, inference, and embeddings as if everyone agreed on them years ago. The good news is that the words are not as scattered as they sound. Nearly all of them describe one of two moments in a model’s life: the moment it is built, and the moment you use it. Sort each term into the right moment and the jargon stops being a wall and starts being a map. That sorting is what this part does.

Sort each term into build time or use time and the vocabulary turns into a map.

The thing itself: model and parameters

Start with the model. We give it a full part of its own later, but the short version is that a model is a giant mathematical function that turns an input into a prediction. It is not a database of answers. It is a set of patterns squeezed into numbers.

Those numbers are the parameters. Think of a mixing desk in a studio with millions of dials. Each parameter is one dial, and the exact position of every dial is what makes this model behave the way it does. When you read that a model is “8 billion” or “70 billion”, that figure is the number of dials. More dials usually means more capacity to capture nuance, though it is a rough guide rather than a guarantee of quality. The whole point of building a model is to find good settings for all those dials, and that search has a name of its own.

Build time: training, fine-tuning, and the labels

Training is the process of setting those dials by example. The model is shown enormous amounts of text, makes a guess, gets told how wrong it was, and nudges its dials a little to be less wrong next time. Repeat that billions of times and the parameters settle into positions that capture real patterns of language. Training is slow, expensive, and happens once. It is the months of studio work before the album ships.

Fine-tuning is a shorter, more targeted round of the same thing. You take a model that already knows language broadly and train it a bit more on a narrow set of examples, say your company’s support tickets or a particular writing style. It does not rebuild the model, it adjusts it. We compare fine-tuning against other options in a later part, because people reach for it more often than they should.

That broad, pre-trained starting point has a name: a foundation model. It is a model trained on a wide sweep of general data so it can be pointed at many tasks rather than one. When the foundation model works on text, we usually call it a large language model, or LLM. So the relationship is simple: an LLM is a foundation model whose specialty is language. GPT-4o and Llama 3.1 are LLMs. A general image model is a foundation model that is not an LLM. The words stack rather than compete.

Use time: prompt, token, and context

Now the model is built and frozen. Everything from here is about feeding it. A prompt is simply the text you give it: your question, your instruction, the document you paste in. Whatever you type becomes the model’s starting point.

Before the model can work with your prompt, it chops the text into tokens. A token is a small chunk, often a whole word, sometimes part of one, sometimes just a punctuation mark. “Cat” is one token. “Unbelievable” might split into “un”, “believ”, and “able”. The model never sees letters or whole sentences, it sees a stream of these tokens, and almost everything about cost and length is measured in them.

The context is everything the model can hold in view at once: your prompt plus its own reply so far, all measured in tokens. It is the model’s short-term working memory. Once a conversation grows past the size of that window, the earliest parts fall out of view, which is why a long chat can seem to forget how it began. We devote a whole part to that window and its limits later.

One request, five terms: prompt, token, context, inference, and the response it returns.

Running it: inference, and why it costs

Inference is the word for actually running the model to get an answer. Training built the model, inference uses it. Every time you press enter, an inference happens: the model reads your tokens and generates new ones in reply. This matters more than it sounds, because training is a one-time bill while inference happens on every single request, forever. A later part is devoted entirely to that asymmetry, since it is where most of the real-world money goes.

One term sits between the words and the math and deserves a moment: the embedding. A model cannot do arithmetic on the word “cat” directly, so each token is turned into a list of numbers, a set of coordinates that places its meaning in a vast space. Words with similar meanings land near each other, which is how a model can tell that “king” and “queen” are related while “king” and “bicycle” are not. Embeddings are the translation layer that makes everything else possible, and they get their own part too.

Turn words into coordinates and “similar meaning” becomes “close together”, something a computer can measure.

The word for when it goes wrong: hallucination

The last term everyone trips over is hallucination. It is the name for when a model states something false with total confidence: a made-up citation, a wrong date, a quote nobody said. It is not lying, because lying needs a notion of truth to push against, and the model has none. It is doing exactly what it always does, producing a plausible next token, except this time plausible and true happen to part ways. Because the same machinery that writes a correct sentence writes the wrong one, the model gives you no warning. That is why hallucination gets its own dedicated part later, and why “check it against a real source” is the single most useful habit in this whole field.

Reality check: a confident tone is not evidence of accuracy. The model has one voice for facts it captured correctly and facts it invented, and it cannot tell you which is which. Fluency is the default, not a signal. Treat the polish as style, and verify anything that matters.

▾ Go Deeper (optional, for technical readers)

People use parameters, weights, and activations loosely, but they are three different things. Weights (together with biases) are the learned numbers stored in the model. They are the parameters: fixed after training, identical for every request, and they are what the file on disk actually contains. A “7B model” has roughly 7 billion of them.

Activations are different. They are the intermediate values computed during a single forward pass, as your specific tokens flow through the layers. They are temporary, they differ for every prompt, and they exist only for the duration of that inference. A rough analogy: weights are the wiring of a calculator, activations are the numbers lighting up on the display while you press keys. This distinction matters for memory. Weights set the baseline footprint you need just to load the model, while activations, especially the growing key-value cache across a long context, drive how much extra memory each request consumes. Parts later in the series on the context window, on quantization, and on the memory wall all turn on exactly this split between what is stored once and what is computed every time.

New here? This is Part 2 of a 30-part walk from zero to the infrastructure behind production AI. If a term above felt rushed, the Generative AI Complete Guide maps out which later part covers it in full. New to the idea itself? Start with Part 1, what generative AI actually is.

The terms at a glance

Term	Plain meaning	Stage
Model	A giant function that predicts the next token	Use
Parameters	The learned numbers (dials) inside the model	Build
Training	Setting the dials from examples	Build
Fine-tuning	Extra targeted training on a narrow set	Build
Foundation model	A broad model pointed at many tasks	Build
LLM	A foundation model specialised in language	Build
Prompt	The text you give the model	Use
Token	A small chunk of text the model reads	Use
Context	Everything the model can see at once	Use
Inference	Running the model to get an answer	Use
Embedding	A token turned into meaning-coordinates	Bridge
Hallucination	Confident output that is false	Use

Build time sets the model up; use time runs it; embeddings bridge the two.

The Bottom Line

The vocabulary of generative AI is not a pile of unrelated buzzwords. It is two short stories. Building the model gives you training, parameters, fine-tuning, and the labels foundation model and LLM. Using the model gives you prompt, token, context, inference, and the risk of hallucination, with embeddings as the bridge that turns words into numbers in between. Keep that build-then-use split in your head and you can place almost any new term you meet. Next, we trace how the field actually got here, from rigid if-statements to ChatGPT. Which of these words had been tripping you up the most?

References

Tokenizer: see how text splits into tokens (OpenAI)
What are foundation models? (IBM Research)
Embeddings: meaning as coordinates (Google Machine Learning)

Generative AI Series · Part 2 of 30
« Part 1: what generative AI actually is | Generative AI Complete Guide | Next: Part 3, from if-statements to ChatGPT »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Dr. Pranay Jha