TL;DR · Key Takeaways
- The context window is everything the model can see at once: your prompt, the conversation so far, and its reply, all measured in tokens.
- A model has no memory between turns. The app resends the whole conversation every time, and once it overflows the window, the oldest parts drop off. That is the “forgetting.”
- Longer context is not free. Attention cost grows with the square of the length, so doubling the input roughly quadruples the work.
- Even within the window, models attend best to the start and end and can lose details buried in the middle.
You are deep into a long chat with an AI. Forty messages in, you refer back to the name you gave it at the start, and it has no idea what you mean. Or you paste a fifty-page contract, ask a question about clause two, and get a confident answer about something on page forty instead. Both moments feel like the model getting dumber mid-conversation. It is not. You have run into the edges of the context window, and once you understand that one boundary, a whole class of strange AI behaviour suddenly makes sense.
What the context window really is
The context window is the model’s field of view, measured in tokens. Whatever sits inside it, the model can use; whatever falls outside, it cannot. And it has to hold everything at once: your current question, the earlier back-and-forth, any document you pasted, the system instructions you never see, and the reply being generated right now. All of it competes for the same fixed budget. A window described as “128k” means it can hold about 128,000 tokens, very roughly 100,000 words, across all of that combined.
The number has grown fast. Early models held a few thousand tokens, barely a long email. Recent ones, as of 2026, stretch to hundreds of thousands or even a million, enough for a whole book. But there is always a ceiling, and two things share the space: what goes in (your prompt and history) and what comes out (the reply). Ask for a long essay and you leave less room for input. Paste a huge input and you leave less room for the answer. They draw from one pool.
Why models “forget”
Here is the part that trips people up. A model keeps no memory between messages. None. Each time you hit send, the application quietly bundles up the entire conversation so far and feeds the whole thing back in as one long prompt. The model reads it fresh, answers, and forgets it again the instant it is done. The illusion of an ongoing conversation is maintained by your chat app re-sending the transcript every single turn, not by the model remembering anything.
So “forgetting” is simply the transcript outgrowing the window. When the conversation gets longer than the budget, something has to give, and the usual casualty is the oldest material. It gets trimmed away so the recent messages still fit. The model is not losing focus or getting tired. It literally can no longer see the part of the chat that scrolled off the back. Once you picture that sliding window, the fix becomes obvious too: if something matters, bring it back into view by restating it, rather than trusting the model to recall it from forty messages ago.
This also explains the “memory” features that products have started adding. When a chat app claims to remember you across sessions, it is not that the model learned anything. The app is keeping notes on the side, a short profile or a running summary of what you said, and quietly slipping those notes back into the window on future turns. Some tools do the same thing within a single long chat: once the transcript gets close to the limit, they compress the older messages into a brief summary and carry that forward instead of the full text. It is a sensible hack, but it is worth knowing it is happening, because a summary is lossy. The detail that gets summarised away is gone just as surely as if it had scrolled off the back.
Why a bigger window is not a free lunch
If forgetting is caused by a small window, why not just make the window enormous and paste in everything? Two reasons, and the first is cost. Recall from Part 8 that attention has every token look at every other token. Double the input and you have not doubled the comparisons, you have roughly quadrupled them. That is what “scales with the square of the length” means in practice, and it shows up as real money and real waiting: a very long prompt costs more and responds slower, every time you send it. A million-token window is a remarkable feat, but filling it on every request is a way to set fire to your inference budget.
The second reason is subtler and, honestly, more surprising. Models do not pay equal attention across a long context. They reliably notice what is near the beginning and near the end, and they get hazier about material stranded in the middle. Researchers named this the “lost in the middle” effect, and it means a crucial sentence buried halfway down a giant document can be quietly skated over, even though it technically fit in the window. A big context window guarantees the model can see something. It does not guarantee the model will use it well.
▾ Go Deeper (optional, for technical readers)
The quadratic cost comes straight from self-attention. For a sequence of n tokens, every token attends to every other, so the attention scores form an n × n matrix. That is roughly n² operations and, for the scores, n² memory. Going from 1,000 to 10,000 tokens is a 10× jump in length but about a 100× jump in this part of the work. On top of that sits the KV cache: to avoid recomputing the past on every new token, the model stores the key and value vectors for all previous tokens, and that cache grows linearly with context length, often becoming the real memory bottleneck during generation.
The long-context tricks you hear about are mostly ways to dodge that n² wall. FlashAttention reorders the computation to avoid ever writing the full score matrix to memory, giving exact attention far more efficiently. Sparse and sliding-window attention let each token attend to only a subset of others (nearby tokens, plus a few global ones) instead of all of them, trading some reach for a big speedup. And techniques like RoPE scaling (position interpolation) stretch a model trained on short contexts to handle longer ones without full retraining. These are why million-token windows became feasible at all, though none of them repeals the basic truth that more context means more work.
This is Part 10 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. The quadratic-cost point builds on Part 8 on attention; the cost angle continues from Part 9 on inference.
The Bottom Line
The context window is the single boundary that explains the most about how AI behaves in daily use. It is a fixed token budget holding the prompt, the history, and the reply all together. The model carries nothing between turns, so the “memory” you experience is your app re-feeding the transcript, and the “forgetting” is that transcript outgrowing the budget. Stretching the window helps, but it runs into a steep cost curve and the awkward fact that models read the middle of a long input poorly.
The practical takeaway I would tattoo on every prompt: do not confuse a large window with good recall, and do not pay to stuff in context the model will barely use. Be deliberate about what you put in front of it. That instinct, feed the model the right things rather than all the things, is the seed of retrieval and vector search, which are where this series heads after a detour through the other reason models surprise us: why they make things up, and what the temperature setting really does.
References
- Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023)
- FlashAttention: fast, memory-efficient exact attention (Dao et al., 2022)
- Count tokens in your own text (OpenAI Tokenizer)
« Part 9: training vs inference | Generative AI Complete Guide | Next: Part 11, why models make things up »








