TL;DR · Key Takeaways
- A mixture-of-experts (MoE) model holds many specialised sub-networks but uses only a few for each token, so it has huge capacity at a modest compute cost.
- This is why some models advertise enormous total parameter counts yet run about as cheaply as far smaller ones: most of the model sits idle on any given token.
- The catch is memory: every expert must still live in VRAM even though only a few are used, so MoE saves compute, not space.
- Alongside MoE, the trends to watch are longer context, reasoning models that think before answering, and a steady push toward efficiency.
How can a model have hundreds of billions of parameters, even a trillion, and still cost roughly what a much smaller model costs to run? It sounds like a contradiction, given everything Phase 4 said about size driving cost. The resolution is an architecture called mixture-of-experts, and it has quietly become one of the most important ideas in how frontier models are built. It is also the “expert parallelism” we kept deferring. Understanding it is the key to making sense of a lot of recent model announcements, and it opens onto the broader question this penultimate part is really about: where AI architecture is heading next.
Dense models vs mixture-of-experts
The models we have described so far are dense: every parameter is involved in processing every token. Run a word through a dense 70-billion-parameter model and all 70 billion participate. That is simple and effective, but it means cost scales directly with size, because there is no way to use a bigger model without doing more work on every single token.
A mixture-of-experts model breaks that link. Instead of one big dense block, it contains many smaller sub-networks called experts, and a small router that, for each token, picks just a couple of experts to handle it. The rest sit out that token entirely. So a model might hold dozens of experts totalling hundreds of billions of parameters, but only activate, say, two of them per token, doing the compute of a far smaller model. The total capacity is enormous, the work per token is modest, and the router learns during training which experts are good at what, so different tokens get sent to the sub-networks best suited to them. It is the difference between making one generalist do everything and keeping a roomful of specialists where only the relevant two are ever consulted at a time.
The win, and the catch
The win is real and important. MoE lets you grow a model’s total knowledge and capacity, more experts, more specialisation, without growing the compute spent on each token in step. For the same inference budget you get a model that behaves as if it were much larger, because it effectively is, just not all at once. This is a major reason the gap between “total parameters” and “active parameters” now appears in model descriptions, and why a model advertised at a headline-grabbing size can still be economical to serve. It is one of the cleaner free-ish lunches the field has found.
But there is a catch, and it lands exactly on the constraint from Part 23. MoE saves compute, not memory. Every expert might be needed by some token, so all of them have to be loaded into VRAM at once, even though only a couple fire for any given token. A mixture-of-experts model with hundreds of billions of total parameters needs the memory to hold hundreds of billions of parameters, even while doing the compute of a much smaller one. So MoE shifts the bottleneck: it eases the compute wall and leans even harder on the memory wall. That trade is usually worth it at the frontier, where memory can be thrown at the problem, but it is the reason MoE is not simply a free win, and why it pairs so naturally with the expert parallelism of the last part, spreading those many experts across many GPUs.
For anyone choosing or deploying models, this has a concrete consequence worth internalising. An MoE model can be a bargain to run and still demanding to host: you may need enough VRAM for a very large model even though each token only does the work of a small one. So an MoE model that looks cheap by its active-parameter count can still rule itself out if you cannot fit all its experts in memory. When you compare a dense model against an MoE one, hold two numbers in mind at once, active parameters for the compute bill and total parameters for the memory you must provision, because they answer different questions and an MoE model deliberately pulls them far apart.
The other directions the field is moving
MoE is one trend; a few others are reshaping things just as much. Longer context is a sustained push: through the efficiency tricks from Part 10, sparse and windowed attention, smarter position handling, models have stretched from a few thousand tokens to hundreds of thousands and beyond, changing what they can take in at once even as the “lost in the middle” caveats persist. Reasoning models are the newer and perhaps deeper shift. Instead of answering immediately, these models are trained to produce a long internal chain of working before committing to a final answer, in effect spending more computation at inference time to think harder on problems that need it. On math, logic, and multi-step tasks this can lift quality sharply, and it introduces a new dial: pay more compute per query in exchange for better reasoning, a trade that did not exist when every answer cost the same.
Around these sit steadier currents. There is a strong drive toward efficiency, smaller models that, thanks to better data and distillation, rival yesterday’s giants and run on a laptop, the theme of Part 19 playing out in real time. Multimodality from Part 17 keeps deepening, with models that natively handle text, images, audio, and video together. And agentic capabilities keep improving, with the reliability caveats of Part 16. No one knows which of these will dominate, but the honest pattern is that progress is now coming less from simply making models bigger and more from making them smarter about how they spend their parameters and their compute.
▾ Go Deeper (optional, for technical readers)
The precise reason MoE gives more capacity at similar inference cost is the split between total and active parameters. Inference compute (FLOPs per token) scales with the active parameters, the ones that actually fire, while model capacity, roughly its ability to store knowledge and skills, scales with the total. A model like Mixtral 8x7B has eight experts and around 47 billion total parameters, but routes each token to just two experts, so only about 13 billion are active per token. It therefore costs roughly like a 13B model to run while carrying the knowledge closer to a much larger one. Decouple compute from capacity and you can climb the capability curve far more cheaply, in compute terms, than dense scaling allows.
The engineering is not free, which is why MoE took time to mature. The router must spread tokens across experts reasonably evenly; if it sends too many to a few favourites, those experts bottleneck while others sit idle, so training adds load-balancing losses to discourage that collapse. Memory is the harder cost: all experts must reside in VRAM, so an MoE model’s memory footprint tracks its total parameters even though its compute tracks the active ones, which is exactly why MoE leans so hard on the memory wall and on expert parallelism to spread experts across GPUs. There is also a communication cost, because routing tokens to experts that live on different GPUs means shuffling data across the network mid-layer, an all-to-all exchange that the interconnect of Part 26 has to absorb. The takeaway: MoE is a deliberate trade of memory and communication complexity for a much better compute-to-capability ratio, and getting that trade right is a big part of why it now appears across the frontier.
This is Part 29 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It pays off the “expert parallelism” from Part 28 and the memory wall of Part 23.
Dense vs mixture-of-experts
| Dense | Mixture-of-Experts | |
|---|---|---|
| Params active per token | All of them | Only a few experts |
| Compute per token | Scales with total size | Scales with active params |
| Memory needed | Total parameters | Total parameters (all experts loaded) |
| Capacity | Tied to compute | Decoupled, much higher |
The Bottom Line
Mixture-of-experts is the architecture that broke the rigid link between model size and running cost. By holding many specialist experts but activating only a few per token, an MoE model gets the capacity of something huge at the compute of something modest, the resolution to the trillion-parameter paradox. The price is memory, since every expert must still sit in VRAM, which is why MoE eases one wall by leaning on another. Alongside it, longer context, reasoning models that think before they answer, and a broad push for efficiency are the currents reshaping what models can do.
The pattern worth carrying forward is that the frontier has shifted from “make it bigger” to “make it smarter about what it activates and when.” That is a healthier kind of progress, and a more economical one. We have now travelled from what generative AI is all the way to how it is trained and where it is going. The final part steps back to ask the questions a clear-eyed practitioner should: what does the economics really look like, how much of the hype is real, and what should you actually do about it.
References
- Switch Transformers: scaling to trillion-parameter models with MoE (Fedus et al., 2021)
- Mixtral of Experts (Jiang et al., 2024)
- Mixture of experts explained (Hugging Face)
« Part 28: training across thousands of GPUs | Generative AI Complete Guide | Next: Part 30, the economics and future (finale) »








