Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

The GenAI Words Everyone Uses, and What They Actually Mean (GenAI Series, Part 2)

Model, tokens, parameters, inference, embeddings, hallucination: the words everyone uses about generative AI, sorted into build time and use time and explained in plain English.

10 minutes

Read Time

Generative AI Series · Part 2 of 30

TL;DR · Key Takeaways

  • The jargon is not random. Almost every term belongs to one of two moments: building the model or using it.
  • Build time gives you training, parameters, fine-tuning, and the labels LLM and foundation model.
  • Use time gives you prompt, token, context, inference, and the failure mode everyone fears, hallucination.
  • Embedding is the bridge: it is how words get turned into numbers so the math can happen at all.

The fastest way to feel lost in an AI conversation is the vocabulary. People throw around tokens, parameters, inference, and embeddings as if everyone agreed on them years ago. The good news is that the words are not as scattered as they sound. Nearly all of them describe one of two moments in a model’s life: the moment it is built, and the moment you use it. Sort each term into the right moment and the jargon stops being a wall and starts being a map. That sorting is what this part does.

The jargon map: build it, then use it BUILD TIME THE RESULT USE TIME Training: learn from data Parameters: the dials it sets Fine-tuning: extra targeted training Foundation model: broad base model LLM: a text foundation model The Modelfrozen patterns, ready to run Prompt: what you type in Token: a small chunk of text Context: all it can see at once Inference: running the model Hallucination: confident wrong output Build once, slowly. Use many times, quickly.
Sort each term into build time or use time and the vocabulary turns into a map.

The thing itself: model and parameters

Start with the model. We give it a full part of its own later, but the short version is that a model is a giant mathematical function that turns an input into a prediction. It is not a database of answers. It is a set of patterns squeezed into numbers.

Those numbers are the parameters. Think of a mixing desk in a studio with millions of dials. Each parameter is one dial, and the exact position of every dial is what makes this model behave the way it does. When you read that a model is “8 billion” or “70 billion”, that figure is the number of dials. More dials usually means more capacity to capture nuance, though it is a rough guide rather than a guarantee of quality. The whole point of building a model is to find good settings for all those dials, and that search has a name of its own.

Build time: training, fine-tuning, and the labels

Training is the process of setting those dials by example. The model is shown enormous amounts of text, makes a guess, gets told how wrong it was, and nudges its dials a little to be less wrong next time. Repeat that billions of times and the parameters settle into positions that capture real patterns of language. Training is slow, expensive, and happens once. It is the months of studio work before the album ships.

Fine-tuning is a shorter, more targeted round of the same thing. You take a model that already knows language broadly and train it a bit more on a narrow set of examples, say your company’s support tickets or a particular writing style. It does not rebuild the model, it adjusts it. We compare fine-tuning against other options in a later part, because people reach for it more often than they should.

That broad, pre-trained starting point has a name: a foundation model. It is a model trained on a wide sweep of general data so it can be pointed at many tasks rather than one. When the foundation model works on text, we usually call it a large language model, or LLM. So the relationship is simple: an LLM is a foundation model whose specialty is language. GPT-4o and Llama 3.1 are LLMs. A general image model is a foundation model that is not an LLM. The words stack rather than compete.

Use time: prompt, token, and context

Now the model is built and frozen. Everything from here is about feeding it. A prompt is simply the text you give it: your question, your instruction, the document you paste in. Whatever you type becomes the model’s starting point.

Before the model can work with your prompt, it chops the text into tokens. A token is a small chunk, often a whole word, sometimes part of one, sometimes just a punctuation mark. “Cat” is one token. “Unbelievable” might split into “un”, “believ”, and “able”. The model never sees letters or whole sentences, it sees a stream of these tokens, and almost everything about cost and length is measured in them.

The context is everything the model can hold in view at once: your prompt plus its own reply so far, all measured in tokens. It is the model’s short-term working memory. Once a conversation grows past the size of that window, the earliest parts fall out of view, which is why a long chat can seem to forget how it began. We devote a whole part to that window and its limits later.

Anatomy of a prompt and response CONTEXT WINDOW Prompt: “Write a haiku about Mondays” Write a hai ku Mondays Tokens, not letters or whole sentences Inferencerun the model RESPONSE Grey skies on my desk coffee cooling, inbox full the week clears its throat also tokens, one at a time Prompt and reply both live inside the context window and are both counted in tokens.
One request, five terms: prompt, token, context, inference, and the response it returns.

Running it: inference, and why it costs

Inference is the word for actually running the model to get an answer. Training built the model, inference uses it. Every time you press enter, an inference happens: the model reads your tokens and generates new ones in reply. This matters more than it sounds, because training is a one-time bill while inference happens on every single request, forever. A later part is devoted entirely to that asymmetry, since it is where most of the real-world money goes.

One term sits between the words and the math and deserves a moment: the embedding. A model cannot do arithmetic on the word “cat” directly, so each token is turned into a list of numbers, a set of coordinates that places its meaning in a vast space. Words with similar meanings land near each other, which is how a model can tell that “king” and “queen” are related while “king” and “bicycle” are not. Embeddings are the translation layer that makes everything else possible, and they get their own part too.

Embeddings: meaning as coordinates each axis is one of many learned dimensions of meaning king queen royalty cluster cat dog animal cluster bicycle far from both
Turn words into coordinates and “similar meaning” becomes “close together”, something a computer can measure.

The word for when it goes wrong: hallucination

The last term everyone trips over is hallucination. It is the name for when a model states something false with total confidence: a made-up citation, a wrong date, a quote nobody said. It is not lying, because lying needs a notion of truth to push against, and the model has none. It is doing exactly what it always does, producing a plausible next token, except this time plausible and true happen to part ways. Because the same machinery that writes a correct sentence writes the wrong one, the model gives you no warning. That is why hallucination gets its own dedicated part later, and why “check it against a real source” is the single most useful habit in this whole field.

Reality check: a confident tone is not evidence of accuracy. The model has one voice for facts it captured correctly and facts it invented, and it cannot tell you which is which. Fluency is the default, not a signal. Treat the polish as style, and verify anything that matters.
▾  Go Deeper (optional, for technical readers)

People use parameters, weights, and activations loosely, but they are three different things. Weights (together with biases) are the learned numbers stored in the model. They are the parameters: fixed after training, identical for every request, and they are what the file on disk actually contains. A “7B model” has roughly 7 billion of them.

Activations are different. They are the intermediate values computed during a single forward pass, as your specific tokens flow through the layers. They are temporary, they differ for every prompt, and they exist only for the duration of that inference. A rough analogy: weights are the wiring of a calculator, activations are the numbers lighting up on the display while you press keys. This distinction matters for memory. Weights set the baseline footprint you need just to load the model, while activations, especially the growing key-value cache across a long context, drive how much extra memory each request consumes. Parts later in the series on the context window, on quantization, and on the memory wall all turn on exactly this split between what is stored once and what is computed every time.

New here? This is Part 2 of a 30-part walk from zero to the infrastructure behind production AI. If a term above felt rushed, the Generative AI Complete Guide maps out which later part covers it in full. New to the idea itself? Start with Part 1, what generative AI actually is.

The terms at a glance

TermPlain meaningStage
ModelA giant function that predicts the next tokenUse
ParametersThe learned numbers (dials) inside the modelBuild
TrainingSetting the dials from examplesBuild
Fine-tuningExtra targeted training on a narrow setBuild
Foundation modelA broad model pointed at many tasksBuild
LLMA foundation model specialised in languageBuild
PromptThe text you give the modelUse
TokenA small chunk of text the model readsUse
ContextEverything the model can see at onceUse
InferenceRunning the model to get an answerUse
EmbeddingA token turned into meaning-coordinatesBridge
HallucinationConfident output that is falseUse
Build time sets the model up; use time runs it; embeddings bridge the two.

The Bottom Line

The vocabulary of generative AI is not a pile of unrelated buzzwords. It is two short stories. Building the model gives you training, parameters, fine-tuning, and the labels foundation model and LLM. Using the model gives you prompt, token, context, inference, and the risk of hallucination, with embeddings as the bridge that turns words into numbers in between. Keep that build-then-use split in your head and you can place almost any new term you meet. Next, we trace how the field actually got here, from rigid if-statements to ChatGPT. Which of these words had been tripping you up the most?

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading