Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

What a Model Really Is (GenAI Series, Part 4)

A model is not a database of answers. It is one large function that predicts the next token, built from billions of parameters. What model sizes and open vs closed weights really mean.

9 minutes

Read Time

Generative AI Series · Part 4 of 30

TL;DR · Key Takeaways

  • A model is not a database of answers. It is one very large function that does a single trick: predict the next token, over and over.
  • The parameters are the millions or billions of numbers inside that function. They are the model. The file you download is mostly just those numbers.
  • Sizes like 7B, 70B, and frontier models are counts of those parameters. Bigger usually means more capable, but it is a rough guide, not a promise.
  • Open weights mean you can download the numbers and run them yourself. Closed weights stay behind a company’s door and you reach them through an API.

People say “the model” constantly, as if it were obvious what one is. It is worth slowing down, because almost every later idea in this series, cost, speed, memory, fine-tuning, hangs on what a model actually is under the hood. The short version may surprise you: a modern AI model is not a brain, not a search engine, and not a vault of stored facts. It is a single mathematical function, frozen after training, that does one humble thing astonishingly well. Once you see that one thing clearly, a lot of the mystery drains away.

One trick, repeated: predict the next token INPUT SO FAR “The cat sat on the” The Modelone big function RANKED GUESSES FOR THE NEXT TOKEN mat41% floor22% sofa14% Pick one, add it to the input, and run the whole thing again.
Every paragraph a model writes is this one step, taken hundreds of times in a row.

A model is a function that guesses what comes next

At its core, a language model does one thing: given some text so far, it predicts what token should come next. That is the entire job. Feed it “The cat sat on the” and it produces a ranked list of likely continuations, with “mat” near the top and “photosynthesis” near the bottom. It picks one, sticks it onto the end, and runs again on the slightly longer text. Repeat a few hundred times and you have a paragraph. Everything a chatbot does, answering a question, writing code, drafting an email, is this same next-token guess, looped.

This is why it helps to think of a model as a function rather than a mind. A function takes an input and returns an output by a fixed rule. Type the same prompt into the same model with the same settings and the same machinery runs every time. The reason replies vary is not that the model is thinking differently. It is that we usually let it pick from among its top guesses with a dash of randomness, a knob we will cover in the part on temperature. The function itself is fixed the moment training ends. It does not learn from your chat, it does not remember yesterday, and it holds no live connection to the world. It is a very large, very still set of numbers waiting to be run.

The numbers inside: parameters are the model

So what is the function made of? Numbers, mostly. A model is a long chain of multiplications and additions, and every one of those operations uses a stored number called a parameter (also called a weight). In Part 2 we pictured these as the dials on a giant mixing desk. Training is the process of slowly turning all the dials to the settings that make the next-token guesses good. When training finishes, the dials lock. Those locked numbers are the model. When you download a model, the file you get is essentially a huge list of these parameters, nothing more.

This is also why a model knows things without storing them like a library does. There is no row in a table that says “Paris is the capital of France.” Instead, that fact is smeared across millions of parameter values that together make “Paris” the likely answer when the surrounding words point that way. The knowledge lives in the pattern of the numbers, not in any single slot. It is closer to muscle memory than to a filing cabinet, which is exactly why a model can be fluent and confidently wrong in the same breath: it is reproducing a pattern, not looking up a record.

What “7B” and “70B” actually mean

When you see a model described as “7B” or “70B,” that letter B means billion, and the number counts its parameters. A 7B model has roughly seven billion of those stored numbers; a 70B model has about ten times as many. The largest frontier models from the big labs are not always public about their exact size, but they run into the hundreds of billions, and some use designs that push effectively into the trillions. More parameters give a model more room to capture subtle patterns, so as a rough rule, bigger models tend to be more capable.

But size carries a cost that matters in practice. Every parameter has to be held in memory and pushed through math on every token, so a bigger model needs more expensive hardware and runs slower and pricier per answer. That is the trade at the heart of the whole infrastructure half of this series. A small 7B model can run on a single decent GPU, or even a laptop, while a frontier model needs a cluster. Choosing a model size is really choosing a point on the line between capability and cost, and the right point depends entirely on the job.

The size ladder: capability vs cost Small · ~7Bruns on a laptop or one GPU Mid · ~70Bneeds a server-grade GPU Frontier · 100B+needs a GPU cluster capability cost, memory, and latency per answer
Picking a model size is picking a spot on the line between what it can do and what it costs to run.

Open weights vs closed weights

One more split shapes nearly every practical decision: can you have the numbers, or not? With an open-weights model, the company releases the actual parameter file for anyone to download, run on their own hardware, inspect, and adapt. Meta’s Llama family and models from Mistral are well-known examples. With a closed-weights model, the parameters stay private inside the company. You never touch the file; you send your text to their servers over an API and get a reply back. OpenAI’s GPT-4o, Anthropic’s Claude, and Google’s Gemini work this way.

Neither is simply better; they trade different things. Open weights give you control, privacy, and the option to run entirely on your own infrastructure, which is a large part of why the later infrastructure parts of this series lean on them, but you own the work of hosting and the hardware bill. Closed models are easy to start with and often sit at the cutting edge of quality, but you rely on a vendor, send your data to them, and pay per use. This is the first glimpse of the build-versus-buy question we will return to in full once the infrastructure phase begins.

Two ways to reach a model OPEN WEIGHTS You download the parameter file. Run it on your own hardware. + Control, privacy, full ownership − You host it and buy the hardware e.g. Llama, Mistral CLOSED WEIGHTS File stays with the vendor. You call it over an API. + Easy to start, often top quality − Vendor dependence, data leaves you e.g. GPT-4o, Claude, Gemini
The same kind of model, two very different ownership models. The choice shapes cost, privacy, and control.
Reality check: a model has no memory of you between conversations and no live link to the world. By default it knows only what was baked into its parameters at training time, which is why it can be confidently out of date. Giving it fresh, private, or real-time facts is a separate job, and it has a name we cover later: retrieval.
▾  Go Deeper (optional, for technical readers)

Parameter count is a convenient headline, but it is a weak predictor of quality on its own. Three other factors often matter more. First, training data: the 2022 Chinchilla work showed many large models were badly undertrained, and that a smaller model fed more, cleaner tokens can beat a larger model trained on less. Quality and quantity of data frequently outweigh raw parameter count.

Second, architecture and tuning. A mixture-of-experts model may list a huge total parameter count yet only activate a fraction of it per token, so its effective compute and its memory footprint tell different stories, something we unpack in the part on MoE. Post-training steps like instruction tuning and preference alignment can lift a model’s usefulness far more than adding parameters would. Third, precision: the same parameters stored at lower numerical precision through quantization shrink the memory bill with little quality loss, which is why “how many parameters” and “how much GPU memory” are related but not the same question. The honest summary: parameter count sets a rough ceiling on capacity, but data, architecture, tuning, and precision decide how much of that capacity becomes real-world quality.

This is Part 4 of a 30-part walk from zero to the infrastructure behind production AI. The full map of what comes next lives on the Generative AI Complete Guide. New here? Start at Part 1, what generative AI actually is, or brush up on the vocabulary in Part 2.

The Bottom Line

Strip away the hype and a model is refreshingly concrete. It is one large mathematical function whose only move is to predict the next token, built entirely from billions of fixed numbers called parameters that were tuned during training and then frozen. Sizes like 7B and 70B count those numbers and trace a rough line from small and cheap to large and capable. And whether the numbers are open for you to download or closed behind a vendor’s API will shape how you build with them. Hold on to the picture of a still, silent function made of numbers, because the next part puts it to the test: what can this thing genuinely do, and where does it fall flat? What size of model do you think your own use case actually needs?

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading