Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

The Context Window, and Why Models Forget (GenAI Series, Part 10)

The context window is everything an AI can see at once. Why models have no memory between turns, why longer prompts cost more, and why details get lost in the middle.

9 minutes

Read Time

Generative AI Series · Part 10 of 30

TL;DR · Key Takeaways

  • The context window is everything the model can see at once: your prompt, the conversation so far, and its reply, all measured in tokens.
  • A model has no memory between turns. The app resends the whole conversation every time, and once it overflows the window, the oldest parts drop off. That is the “forgetting.”
  • Longer context is not free. Attention cost grows with the square of the length, so doubling the input roughly quadruples the work.
  • Even within the window, models attend best to the start and end and can lose details buried in the middle.

You are deep into a long chat with an AI. Forty messages in, you refer back to the name you gave it at the start, and it has no idea what you mean. Or you paste a fifty-page contract, ask a question about clause two, and get a confident answer about something on page forty instead. Both moments feel like the model getting dumber mid-conversation. It is not. You have run into the edges of the context window, and once you understand that one boundary, a whole class of strange AI behaviour suddenly makes sense.

The window only holds so much msg 1 msg 2 msg 3 msg 4 msg 5 msg 6 msg 7 replynow CONTEXT WINDOW (what the model can see) fallen out of view As the chat grows, the window slides forward and the oldest messages drop off the back.
Everything inside the dashed box is visible. Everything to the left of it is, to the model, gone.

What the context window really is

The context window is the model’s field of view, measured in tokens. Whatever sits inside it, the model can use; whatever falls outside, it cannot. And it has to hold everything at once: your current question, the earlier back-and-forth, any document you pasted, the system instructions you never see, and the reply being generated right now. All of it competes for the same fixed budget. A window described as “128k” means it can hold about 128,000 tokens, very roughly 100,000 words, across all of that combined.

The number has grown fast. Early models held a few thousand tokens, barely a long email. Recent ones, as of 2026, stretch to hundreds of thousands or even a million, enough for a whole book. But there is always a ceiling, and two things share the space: what goes in (your prompt and history) and what comes out (the reply). Ask for a long essay and you leave less room for input. Paste a huge input and you leave less room for the answer. They draw from one pool.

One token budget, shared by all systemprompt INPUT: your prompt + chat history + pasted docsthe more you put here… OUTPUT: the reply…the less is left here Total width is fixed. Widen one section and another has to shrink. This is why a giant pasted document can cut your answer short.
Input and output are not separate allowances. They split one fixed budget of tokens.

Why models “forget”

Here is the part that trips people up. A model keeps no memory between messages. None. Each time you hit send, the application quietly bundles up the entire conversation so far and feeds the whole thing back in as one long prompt. The model reads it fresh, answers, and forgets it again the instant it is done. The illusion of an ongoing conversation is maintained by your chat app re-sending the transcript every single turn, not by the model remembering anything.

So “forgetting” is simply the transcript outgrowing the window. When the conversation gets longer than the budget, something has to give, and the usual casualty is the oldest material. It gets trimmed away so the recent messages still fit. The model is not losing focus or getting tired. It literally can no longer see the part of the chat that scrolled off the back. Once you picture that sliding window, the fix becomes obvious too: if something matters, bring it back into view by restating it, rather than trusting the model to recall it from forty messages ago.

This also explains the “memory” features that products have started adding. When a chat app claims to remember you across sessions, it is not that the model learned anything. The app is keeping notes on the side, a short profile or a running summary of what you said, and quietly slipping those notes back into the window on future turns. Some tools do the same thing within a single long chat: once the transcript gets close to the limit, they compress the older messages into a brief summary and carry that forward instead of the full text. It is a sensible hack, but it is worth knowing it is happening, because a summary is lossy. The detail that gets summarised away is gone just as surely as if it had scrolled off the back.

Why a bigger window is not a free lunch

If forgetting is caused by a small window, why not just make the window enormous and paste in everything? Two reasons, and the first is cost. Recall from Part 8 that attention has every token look at every other token. Double the input and you have not doubled the comparisons, you have roughly quadrupled them. That is what “scales with the square of the length” means in practice, and it shows up as real money and real waiting: a very long prompt costs more and responds slower, every time you send it. A million-token window is a remarkable feat, but filling it on every request is a way to set fire to your inference budget.

The second reason is subtler and, honestly, more surprising. Models do not pay equal attention across a long context. They reliably notice what is near the beginning and near the end, and they get hazier about material stranded in the middle. Researchers named this the “lost in the middle” effect, and it means a crucial sentence buried halfway down a giant document can be quietly skated over, even though it technically fit in the window. A big context window guarantees the model can see something. It does not guarantee the model will use it well.

Longer prompts cost more than they look input length (tokens) → cost / time grows faster than linear “LOST IN THE MIDDLE” start    middle    end recall is worst in the middle
Left: cost climbs steeply with length. Right: even when it fits, the middle is the weak spot.
Reality check: a giant context window is marketed as a memory upgrade, and I think that framing quietly misleads people. It is a bigger desk, not a filing cabinet. Dumping a thousand pages in on every call is usually slower, pricier, and less accurate than fetching just the few relevant passages and handing those over, which is exactly the job of retrieval, coming up soon.
▾  Go Deeper (optional, for technical readers)

The quadratic cost comes straight from self-attention. For a sequence of n tokens, every token attends to every other, so the attention scores form an n × n matrix. That is roughly n² operations and, for the scores, n² memory. Going from 1,000 to 10,000 tokens is a 10× jump in length but about a 100× jump in this part of the work. On top of that sits the KV cache: to avoid recomputing the past on every new token, the model stores the key and value vectors for all previous tokens, and that cache grows linearly with context length, often becoming the real memory bottleneck during generation.

The long-context tricks you hear about are mostly ways to dodge that n² wall. FlashAttention reorders the computation to avoid ever writing the full score matrix to memory, giving exact attention far more efficiently. Sparse and sliding-window attention let each token attend to only a subset of others (nearby tokens, plus a few global ones) instead of all of them, trading some reach for a big speedup. And techniques like RoPE scaling (position interpolation) stretch a model trained on short contexts to handle longer ones without full retraining. These are why million-token windows became feasible at all, though none of them repeals the basic truth that more context means more work.

This is Part 10 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. The quadratic-cost point builds on Part 8 on attention; the cost angle continues from Part 9 on inference.

The Bottom Line

The context window is the single boundary that explains the most about how AI behaves in daily use. It is a fixed token budget holding the prompt, the history, and the reply all together. The model carries nothing between turns, so the “memory” you experience is your app re-feeding the transcript, and the “forgetting” is that transcript outgrowing the budget. Stretching the window helps, but it runs into a steep cost curve and the awkward fact that models read the middle of a long input poorly.

The practical takeaway I would tattoo on every prompt: do not confuse a large window with good recall, and do not pay to stuff in context the model will barely use. Be deliberate about what you put in front of it. That instinct, feed the model the right things rather than all the things, is the seed of retrieval and vector search, which are where this series heads after a detour through the other reason models surprise us: why they make things up, and what the temperature setting really does.

References

Generative AI Series · Part 10 of 30
« Part 9: training vs inference  |  Generative AI Complete Guide  |  Next: Part 11, why models make things up »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading