Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

Mixture-of-Experts and Where AI Architecture Is Heading (GenAI Series, Part 29)

Mixture-of-experts models hold enormous capacity but activate only a few experts per token, so they run cheaply. How MoE works, its memory catch, and the trends to watch.

10 minutes

Read Time

Generative AI Series · Part 29 of 30

TL;DR · Key Takeaways

  • A mixture-of-experts (MoE) model holds many specialised sub-networks but uses only a few for each token, so it has huge capacity at a modest compute cost.
  • This is why some models advertise enormous total parameter counts yet run about as cheaply as far smaller ones: most of the model sits idle on any given token.
  • The catch is memory: every expert must still live in VRAM even though only a few are used, so MoE saves compute, not space.
  • Alongside MoE, the trends to watch are longer context, reasoning models that think before answering, and a steady push toward efficiency.

How can a model have hundreds of billions of parameters, even a trillion, and still cost roughly what a much smaller model costs to run? It sounds like a contradiction, given everything Phase 4 said about size driving cost. The resolution is an architecture called mixture-of-experts, and it has quietly become one of the most important ideas in how frontier models are built. It is also the “expert parallelism” we kept deferring. Understanding it is the key to making sense of a lot of recent model announcements, and it opens onto the broader question this penultimate part is really about: where AI architecture is heading next.

Dense models vs mixture-of-experts

The models we have described so far are dense: every parameter is involved in processing every token. Run a word through a dense 70-billion-parameter model and all 70 billion participate. That is simple and effective, but it means cost scales directly with size, because there is no way to use a bigger model without doing more work on every single token.

A mixture-of-experts model breaks that link. Instead of one big dense block, it contains many smaller sub-networks called experts, and a small router that, for each token, picks just a couple of experts to handle it. The rest sit out that token entirely. So a model might hold dozens of experts totalling hundreds of billions of parameters, but only activate, say, two of them per token, doing the compute of a far smaller model. The total capacity is enormous, the work per token is modest, and the router learns during training which experts are good at what, so different tokens get sent to the sub-networks best suited to them. It is the difference between making one generalist do everything and keeping a roomful of specialists where only the relevant two are ever consulted at a time.

Use all of it, or just the right part DENSE all params every token uses everything MIXTURE-OF-EXPERTS router expert 1 expert 2 expert 3 expert 4 only 2 active · rest idle huge total capacity, small compute per token
The router is the new ingredient: it decides which few experts each token actually needs.

The win, and the catch

The win is real and important. MoE lets you grow a model’s total knowledge and capacity, more experts, more specialisation, without growing the compute spent on each token in step. For the same inference budget you get a model that behaves as if it were much larger, because it effectively is, just not all at once. This is a major reason the gap between “total parameters” and “active parameters” now appears in model descriptions, and why a model advertised at a headline-grabbing size can still be economical to serve. It is one of the cleaner free-ish lunches the field has found.

But there is a catch, and it lands exactly on the constraint from Part 23. MoE saves compute, not memory. Every expert might be needed by some token, so all of them have to be loaded into VRAM at once, even though only a couple fire for any given token. A mixture-of-experts model with hundreds of billions of total parameters needs the memory to hold hundreds of billions of parameters, even while doing the compute of a much smaller one. So MoE shifts the bottleneck: it eases the compute wall and leans even harder on the memory wall. That trade is usually worth it at the frontier, where memory can be thrown at the problem, but it is the reason MoE is not simply a free win, and why it pairs so naturally with the expert parallelism of the last part, spreading those many experts across many GPUs.

For anyone choosing or deploying models, this has a concrete consequence worth internalising. An MoE model can be a bargain to run and still demanding to host: you may need enough VRAM for a very large model even though each token only does the work of a small one. So an MoE model that looks cheap by its active-parameter count can still rule itself out if you cannot fit all its experts in memory. When you compare a dense model against an MoE one, hold two numbers in mind at once, active parameters for the compute bill and total parameters for the memory you must provision, because they answer different questions and an MoE model deliberately pulls them far apart.

The other directions the field is moving

MoE is one trend; a few others are reshaping things just as much. Longer context is a sustained push: through the efficiency tricks from Part 10, sparse and windowed attention, smarter position handling, models have stretched from a few thousand tokens to hundreds of thousands and beyond, changing what they can take in at once even as the “lost in the middle” caveats persist. Reasoning models are the newer and perhaps deeper shift. Instead of answering immediately, these models are trained to produce a long internal chain of working before committing to a final answer, in effect spending more computation at inference time to think harder on problems that need it. On math, logic, and multi-step tasks this can lift quality sharply, and it introduces a new dial: pay more compute per query in exchange for better reasoning, a trade that did not exist when every answer cost the same.

Around these sit steadier currents. There is a strong drive toward efficiency, smaller models that, thanks to better data and distillation, rival yesterday’s giants and run on a laptop, the theme of Part 19 playing out in real time. Multimodality from Part 17 keeps deepening, with models that natively handle text, images, audio, and video together. And agentic capabilities keep improving, with the reliability caveats of Part 16. No one knows which of these will dominate, but the honest pattern is that progress is now coming less from simply making models bigger and more from making them smarter about how they spend their parameters and their compute.

Where architecture is heading Mixture-of-experts Longer context Reasoning / “thinking” Efficient small models Deeper multimodality the common thread: smarter use of parameters and compute, not just more of them
A snapshot of the field as of 2026, not a prophecy. The throughline is efficiency of capability, not raw scale.
Reality check: when you read that a new model has some colossal parameter count, ask whether it is dense or MoE, and what the active parameter count is. Those are very different claims about cost and capability. I treat a headline total-parameter number for an MoE model the way I treat a car’s top speed: technically true, rarely the figure that tells you what it is like to actually live with.
▾  Go Deeper (optional, for technical readers)

The precise reason MoE gives more capacity at similar inference cost is the split between total and active parameters. Inference compute (FLOPs per token) scales with the active parameters, the ones that actually fire, while model capacity, roughly its ability to store knowledge and skills, scales with the total. A model like Mixtral 8x7B has eight experts and around 47 billion total parameters, but routes each token to just two experts, so only about 13 billion are active per token. It therefore costs roughly like a 13B model to run while carrying the knowledge closer to a much larger one. Decouple compute from capacity and you can climb the capability curve far more cheaply, in compute terms, than dense scaling allows.

The engineering is not free, which is why MoE took time to mature. The router must spread tokens across experts reasonably evenly; if it sends too many to a few favourites, those experts bottleneck while others sit idle, so training adds load-balancing losses to discourage that collapse. Memory is the harder cost: all experts must reside in VRAM, so an MoE model’s memory footprint tracks its total parameters even though its compute tracks the active ones, which is exactly why MoE leans so hard on the memory wall and on expert parallelism to spread experts across GPUs. There is also a communication cost, because routing tokens to experts that live on different GPUs means shuffling data across the network mid-layer, an all-to-all exchange that the interconnect of Part 26 has to absorb. The takeaway: MoE is a deliberate trade of memory and communication complexity for a much better compute-to-capability ratio, and getting that trade right is a big part of why it now appears across the frontier.

This is Part 29 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It pays off the “expert parallelism” from Part 28 and the memory wall of Part 23.

Dense vs mixture-of-experts

 DenseMixture-of-Experts
Params active per tokenAll of themOnly a few experts
Compute per tokenScales with total sizeScales with active params
Memory neededTotal parametersTotal parameters (all experts loaded)
CapacityTied to computeDecoupled, much higher
MoE saves compute, not memory: that is the whole trade.

The Bottom Line

Mixture-of-experts is the architecture that broke the rigid link between model size and running cost. By holding many specialist experts but activating only a few per token, an MoE model gets the capacity of something huge at the compute of something modest, the resolution to the trillion-parameter paradox. The price is memory, since every expert must still sit in VRAM, which is why MoE eases one wall by leaning on another. Alongside it, longer context, reasoning models that think before they answer, and a broad push for efficiency are the currents reshaping what models can do.

The pattern worth carrying forward is that the frontier has shifted from “make it bigger” to “make it smarter about what it activates and when.” That is a healthier kind of progress, and a more economical one. We have now travelled from what generative AI is all the way to how it is trained and where it is going. The final part steps back to ask the questions a clear-eyed practitioner should: what does the economics really look like, how much of the hype is real, and what should you actually do about it.

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading