The Core Idea

A Mixture of Experts (MoE) model is a machine learning architecture where several specialized sub-networks (the “experts”) are trained together, and a separate mechanism (the “router” or “gate”) decides which expert(s) to consult for each individual input.

The key motivation is conditional computation: instead of activating every parameter of a giant model for every input, an MoE activates only a subset. This allows you to scale up total model capacity without proportionally scaling up inference cost.

The Mechanism

A typical (sparse) MoE layer works as follows:

  1. An input token or example arrives at the MoE layer.
  2. A small neural network called the router computes a score for each available expert.
  3. The router selects the top $k$ experts (commonly top-2). Only those selected experts process the input; the rest are idle for this token.
  4. The experts’ outputs are combined, typically as a weighted sum where the weights are the router’s scores for the selected experts.

So if you have 64 experts and top-2 routing, each token only uses 2 experts’ parameters, but the model as a whole benefits from the knowledge stored across all 64.

Prerequisites and Jargon

  • Sub-network / Expert: A distinct set of neural network parameters (usually a feed-forward network in transformer MoEs). “Expert” does not mean it was pre-trained on a niche dataset; it simply refers to one such sub-network.
  • Dense vs. Sparse: A dense model activates all parameters for every input. A sparse MoE activates only a fraction, saving compute. This is what makes modern MoEs practical at scale.
  • Router / Gate: The small module that decides routing. Its output is usually a probability distribution over experts (often via softmax).
  • Top-$k$ routing: Selecting the $k$ experts with the highest router scores. Top-2 is the most common choice in large language model MoEs.
  • Load balancing: A training concern. If the router sends most tokens to the same few experts, those experts become overloaded and others undertrained. MoE training adds auxiliary loss terms to encourage tokens to be distributed evenly across experts.
  • Capacity factor: A per-expert limit on how many tokens it can process in a batch. Tokens routed beyond this capacity may be dropped or passed through residual connections. This prevents memory blowups from uneven routing.

How It Differs from a Single Large Model

With a dense model, doubling parameters roughly doubles both training and inference cost for every input.

With a sparse MoE, you can increase the total parameter count dramatically while keeping per-token FLOPs (floating point operations) much lower. For example, a model with 8×7B experts and top-2 routing has roughly 47B parameters total, but only ~14B are active for any given token.

This is why recent large language models like Mixtral (Mistral’s sparse MoE) and some GPT-4-class systems use MoE: they get the capacity of a very large dense model at a fraction of the inference cost.

Conceptual Example

Imagine a customer support system with three experts:

  • Expert A: billing and payments
  • Expert B: technical troubleshooting
  • Expert C: account access

When a user message arrives, the router decides which expert is best suited. “My card was declined” routes to Expert A; “My app keeps crashing” routes to Expert B. The final answer blends their outputs as needed.

In practice, modern MoE experts are not this semantically interpretable — they often specialize in ways that are hard to name, such as handling certain syntactic patterns or token frequencies — but the routing concept is the same.

Trade-offs

  • Pros: Larger effective model capacity; lower per-token compute than an equivalent dense model; better quality at a given inference budget.
  • Cons: Higher memory requirements (all experts must be resident in memory even if not all are used); more complex training (load balancing, router stability); potential underutilization of experts; harder to deploy efficiently across multiple GPUs due to uneven per-expert load.

Where You Will Encounter It

MoE is most prominent in large language models, but the concept is general and predates transformers by decades. You will see it in modern frameworks such as:

  • Mixtral 8x7B / 8x22B: open sparse MoE LLMs
  • GPT-4 (widely reported to be an MoE)
  • DeepSeek-V3 / DeepSeek-R1: sparse MoE architectures with shared experts
  • Various multimodal and vision models

A recent variant is the shared expert design: one always-active expert processes every token, while routed experts add specialized capacity. This improves baseline quality and reduces routing instability.