Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

Why AI Models Make Things Up (and What Temperature Does) (GenAI Series, Part 11)

AI models generate by sampling likely words from a probability distribution. Why that produces confident hallucinations, what the temperature setting really does, and how to reduce it.

10 minutes

Read Time

Generative AI Series · Part 11 of 30

TL;DR · Key Takeaways

  • A model does not pick the “true” next word. It picks a likely one, by sampling from a probability distribution over all possible tokens.
  • When the model’s knowledge is thin, the most likely-sounding word is often not the correct one. That gap, plausible but false, is a hallucination.
  • Temperature is the dial that controls how adventurous the sampling is. Low temperature plays it safe; high temperature gets creative, and looser with the truth.
  • You cannot fully eliminate hallucination, but you can reduce it: lower the temperature for facts, ground the model in real sources, and verify anything that matters.

In 2023, two lawyers filed a legal brief built on half a dozen court cases that did not exist. A chatbot had invented them, complete with plausible names, docket numbers, and quotations, and the lawyers had trusted it. The judge was not amused, and the story went around the world as a cautionary tale. What rarely got explained is the part that actually matters: the model was not malfunctioning when it did this. It was doing exactly what it always does. To see why a system can be so fluent and so wrong at the same time, you have to look at what happens when it chooses a single word.

One word, a whole spread of guesses Prompt: “The capital of Australia is …” Sydney46% Canberra34% (correct) Melbourne12% Perth5% others3% The model rolls weighted dice across this list. Most likely is not always right.
The model never sees one answer. It sees a ranked field of candidates and has to pick.

Every word is a weighted roll of the dice

Go back to the core move from Part 4: the model predicts the next token. What it actually produces is not a single word but a probability for every token in its vocabulary, a long ranked list of candidates. After “The sky is” it might assign “blue” a high probability, “clear” a smaller one, “falling” a tiny one, and so on across tens of thousands of options. Then it has to commit to one, and it does that by sampling: picking from the list with each candidate’s chance of being chosen set by its probability. It is a roll of weighted dice, not a lookup of the right answer.

This design is a feature, not an accident. If a model always picked the single most probable word, it would be rigid and repetitive, giving the identical canned response to a prompt every time and writing prose with all the life of a tax form. The flexibility that lets it brainstorm, vary its phrasing, and sound human comes precisely from this willingness to sometimes choose a less-than-top option. The same mechanism that makes it a good writer is the mechanism that lets it wander off the facts. You do not get one without the other.

What the temperature dial actually does

Temperature is the setting that controls how reckless that dice roll is. Picture it as reshaping the list of candidates before the model draws from it. At low temperature, the probabilities get sharpened: the front-runner towers over everything else, so the model almost always takes the safest, most expected word. Output becomes focused, consistent, and a little dull. At high temperature, the probabilities get flattened: the gap between the likely and the unlikely narrows, so long-shot words get a real chance. Output becomes varied, surprising, and more prone to nonsense.

A temperature near zero makes the model nearly deterministic, almost always choosing the top candidate, which is what you want for a factual lookup or structured data extraction where you would happily trade flair for reliability. A temperature around the higher end loosens it up for brainstorming, fiction, or marketing copy where you want range and the odd unexpected turn. There is no single “correct” temperature. It is a deliberate trade between caution and creativity, and matching it to the task is one of the simplest, most overlooked levers you have.

Same words, two settings of the dial LOW TEMPERATURE safe, focused, repeatable top word wins HIGH TEMPERATURE varied, creative, riskier long shots get a real chance
Temperature does not add knowledge. It only changes how boldly the model gambles among the words it already has.

Why this turns into confident nonsense

Now the hallucination falls out naturally. When you ask about something the model knows well, the true answer also happens to be the high-probability one, and the dice almost always land on it. But ask for a specific court case, an exact statistic, or a citation the model never really learned, and there is no single high-probability truth waiting in the list. There is just a cloud of plausible-looking fragments: real-sounding names, believable numbers, the format of a legal reference. The model samples from that cloud and assembles something that looks exactly like a valid citation, because it has learned the shape of citations even when it does not have the specific one.

And it says it in the same even, confident voice it uses for everything, because internally nothing special happened. It did not flip into a “guessing mode.” Every token it has ever produced was a sample from a probability distribution; this time the distribution simply had no truth in it to find. That is the uncomfortable heart of the matter, and I think it is the single most important thing for anyone using these tools to internalise: the model has no built-in sense of the difference between recalling a fact and fabricating one. Both feel identical from the inside, which is why both come out sounding equally sure.

Why a fake citation looks so real Harrison v. Meridian Logistics, 614 F.3d 218 (9th Cir. 2011) plausible names realistic-looking citation number real court, believable year Every part is the right shape. The case is still entirely invented.
The model nailed the format of a citation, which is exactly what makes the fake so convincing.

How to actually keep a lid on it

You cannot delete hallucination, because it is baked into how generation works, but you can stack the odds heavily in your favour. Start with the dial: for anything factual, turn the temperature down. A low setting makes the model take the safest, most-supported continuation, which is usually the one closest to what it genuinely learned. Next, do not make it rely on memory at all when you can avoid it. Hand it the real source material and ask it to answer from that, the retrieval approach that is the subject of an upcoming part, so the true answer is sitting right there in the context instead of being dredged up from fuzzy training.

Then build in verification. Ask for sources and check them, because a model will cheerfully produce a citation and a real one looks no different from a fake one until you look it up. Cross-check anything that carries consequences. And pick the right tool for the job in the first place: a system that must never invent a fact is a poor fit for raw open-ended generation and a much better fit for generation grounded in a trusted database. None of these are exotic. Together they are the difference between a useful assistant and an embarrassing brief full of cases that never happened.

Reality check: turning temperature to zero does not switch on a “truth mode.” It makes the model more consistent, not more correct. If the most probable answer is wrong, a low temperature will hand you that wrong answer reliably, every single time. Consistency and accuracy are different things, and conflating them is how people talk themselves into trusting a confidently repeated mistake.
▾  Go Deeper (optional, for technical readers)

Temperature is only one of the sampling controls, and they are worth distinguishing. The rawest option is greedy decoding: always take the single highest-probability token. It is deterministic and fine for short, factual completions, but it tends to produce flat, repetitive text and can get stuck in loops. Temperature sits one level up: before the final softmax, the logits are divided by a number T. T below 1 sharpens the distribution toward the top token, T above 1 flattens it, and T near 0 collapses back to greedy.

The two other dials you will meet shape which candidates are eligible before sampling. Top-k keeps only the k most probable tokens and zeroes out the rest, so the model can never pick something truly bizarre, but a fixed k is clumsy when the distribution is sometimes sharp and sometimes flat. Top-p, or nucleus sampling, fixes that by keeping the smallest set of tokens whose probabilities add up to p (say 0.9), so the candidate pool grows when the model is uncertain and shrinks when it is confident. In practice people combine these: a modest temperature with top-p is a common default that balances coherence and variety. None of them, it bears repeating, add knowledge. They only govern how the model gambles among the words it already considers possible.

This is Part 11, the last of Phase 2, in a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. For the plain-English version of this problem see Part 5 on what GenAI can and cannot do; the next-token basics are in Part 4.

Sampling controls at a glance

ControlWhat it doesReach for it when
GreedyAlways take the top tokenShort, deterministic, factual output
TemperatureSharpens (low) or flattens (high) the oddsTuning caution vs creativity
Top-kKeep only the k most likely tokensBlocking truly bizarre picks
Top-p (nucleus)Keep the smallest set summing to pAdapting to the model confidence
None of these add knowledge; they only govern how boldly the model gambles.

The Bottom Line

Hallucination is not a glitch bolted onto an otherwise truthful machine. It is the shadow of the very thing that makes these models useful. They generate by sampling likely words from a probability distribution, and temperature decides how boldly they sample. When the truth is the likely word, you get a correct answer; when it is not, you get a confident fabrication wearing the same calm voice, because the model cannot tell the two apart.

So the goal is never to wait for a version that stops making things up. It is to build habits and systems that assume it will: lower the temperature for facts, put real sources in front of it, and verify what counts. That mindset, treat fluency as a default and never as proof, closes out the foundations. From here the series turns practical. Phase 3 is about using these models well, and it starts where every good result starts, with the prompt.

References

Generative AI Series · Part 11 of 30
« Part 10: the context window  |  Generative AI Complete Guide  |  Next: Part 12, prompt engineering that works »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading