Skip to content

Public technical briefing from the MLNavigator Research Group.

← Back to Technical Briefings
architectureMoEnondeterminisminferencegovernance

Mixture-of-Experts Routing Adds a Second Layer of Inference Variation

December 28, 2025·MLNavigator Team

How MoE routing works

Mixture-of-Experts (MoE) models like Mixtral 8x7B1 replace the standard feed-forward block in each transformer layer with multiple parallel "expert" subnetworks. A small router network selects which experts process each token. In Mixtral's case, each token is routed to 2 of 8 available experts per layer.

The model has 47 billion total parameters but only activates roughly 13 billion for any given token1. That is the efficiency argument for MoE: large capacity without proportional compute cost.

MoE routing diagram showing the same token routed to different experts across two runs

Where the variation enters

The router network is a learned component — it is a small neural network that produces a probability distribution over experts for each token. Two properties of this design introduce variation:

First, routing depends on the full input context. A single token change in the prompt can shift routing decisions across multiple layers, causing a different subset of the model's parameters to activate. The relationship between input perturbation and routing change is nonlinear and hard to predict.

Second, load balancing is approximate. MoE models use auxiliary loss functions during training to encourage balanced expert utilization, but NVIDIA's experiments with MoE architectures2 show that despite this, "there could still be large distributional imbalances" in how tokens are assigned to experts. Some experts get overloaded while others sit idle.

Combined with the floating-point nondeterminism inherent in GPU computation3, two identical inference runs can activate different experts for the same tokens and then compute slightly different results within those experts.

Why this matters for review

In a standard dense model, every parameter participates in every forward pass. The mapping from input to output is complex, but at least the set of parameters involved is fixed. In an MoE model, the set of parameters that produced a given output is itself variable. That raises three questions for any compliance reviewer:

  1. Which parameters contributed? In a dense model: all of them. In an MoE model: a subset, determined by routing, which may vary. Explaining why the model produced a particular output now requires knowing which experts were selected, not just what the input was.

  2. Can the output be reconstructed? If routing decisions aren't logged, the output can't be reconstructed even if the input and weights are known. The reviewer would need the exact expert assignments at each layer for each token — information that standard inference pipelines don't retain.

  3. How sensitive is the output to routing? Studies on routing robustness4 indicate that standard routers can select suboptimal experts on out-of-distribution inputs, and that "when facing inputs that deviate from the training distribution, the router must extrapolate from fixed priors."

Practical implications

If you're evaluating MoE models for governed deployments, three things follow:

Routing logs are necessary for reconstruction. If your team needs the ability to replay or audit specific inference runs, the expert assignments must be captured alongside the input and output. This is logging overhead that dense models don't require.

Architecture selection is a governance decision. Choosing an MoE model over a dense model changes what can be reconstructed, what can be explained, and what variation you accept. That tradeoff should be made explicitly — before deployment, not discovered during an audit.

Dense models are simpler to govern. Where reproducibility and explainability are priorities, dense architectures remove one entire category of variation. The hardware-level nondeterminism still applies5, but at least the question of which parameters contributed has a fixed answer.

MoE architectures are effective and increasingly common. The governance implications of routing variability just need to be understood before the first production inference, not after an auditor asks a question you can't answer.

Appendix: References

1

Jiang et al., "Mixtral of Experts," arXiv:2401.04088 (2024), arxiv.org/abs/2401.04088.

2

NVIDIA Developer Blog, "Applying Mixture of Experts in LLM Architectures," developer.nvidia.com.

3

MLNavigator Research Group, "Nondeterminism is a Governance Failure, Not Just a Quirky Detail," nondeterminism-governance-failure.

4

Huang et al., "Understanding the Routing Robustness in Mixture of Experts," arXiv:2601.02144 (2026), arxiv.org/html/2601.02144.

5

MLNavigator Research Group, "Hardware Variation Governance," gpu-determinism-hardware.