This article reflects MLNavigator Research Group work. Deployment lives in AdapterOS.

← Back to Research Notes
determinism MoE inference reproducibility routing

Mixture-of-Experts Systems Quietly Amplify Nondeterminism

December 28, 2025

Most engineers who work with GPUs understand that floating-point operations can produce slightly different results across runs. This is documented behavior. What is less commonly discussed is that Mixture-of-Experts architectures introduce a second layer of variance—one that operates on top of the hardware-level nondeterminism and can amplify its effects in ways that are difficult to trace.

Why MoE Is Attractive

Mixture-of-Experts is an architectural pattern that routes each input to a subset of specialized subnetworks rather than passing it through every parameter. A model with 400 billion parameters might activate only 50 billion for any given token. The remaining parameters sit idle for that forward pass.

This sparsity is the appeal. Training and inference costs scale with the number of active parameters, not the total parameter count. A sparse model can match the quality of a dense model at a fraction of the computational cost. Industry adoption reflects this advantage. GPT-4, Mixtral, and several other frontier models use MoE or similar conditional computation strategies.

The efficiency gains are real. So is the complexity they introduce.

[FIG 1: MoE Routing Branch Point] Caption: Input tokens flow through a router network that scores each expert. The top-k experts (here, k=2) are selected, and only those experts process the input. The routing decision is discrete: an expert is either selected or it is not.

Determinism as an Execution Property

Determinism means that identical inputs produce identical outputs. Not similar. Identical. Bit-for-bit reproducible.

This property matters for several practical reasons. Debugging requires reproducibility—isolating a failure is difficult when the failure does not reliably occur. Regression testing assumes that unchanged code produces unchanged outputs. Compliance audits may require demonstrating that a system produces consistent results. Scientific reproducibility depends on the ability to replicate experiments exactly.

Determinism is not a feature that can be toggled on. It is a property of the entire execution path: the algorithm, the implementation, the libraries, the hardware, and the runtime environment. If any component introduces variance, the system is not deterministic.

GPU execution is already non-deterministic in many common configurations. Thread scheduling, atomic operations, and algorithm selection can all introduce variance. This is documented behavior that engineers working with CUDA and cuDNN learn to either accept or mitigate.

MoE adds another source of variance, and it operates at a different level of abstraction.

How MoE Introduces Drift

The router in an MoE system is a small neural network that takes an input and produces a score for each expert. These scores pass through a softmax function, producing a probability distribution. The top-k experts—those with the highest scores—are selected to process the input.

This selection is a discrete decision. An expert with a score of 0.5001 is selected. An expert with a score of 0.4999 is not. The gap between these scores can be arbitrarily small, but the decision is binary.

[FIG 2: Near-Tie Routing Sensitivity] Caption: Two experts with nearly identical router scores. Expert A scores 0.5001; Expert B scores 0.4999. Expert A is selected. A change in the sixth decimal place of the router's internal computation could reverse this decision.

Floating-point arithmetic affects the router's computation just as it affects any other neural network operation. Small differences in the order of operations, rounding behavior, or intermediate precision can shift the router scores by amounts that are normally negligible—but not when those scores are near a tie.

Consider what happens when hardware-level nondeterminism perturbs the router's computation. On one run, Expert 3 scores 0.2501 and Expert 7 scores 0.2499. Expert 3 is selected. On the next run, the same input produces scores of 0.2499 and 0.2501. Expert 7 is selected instead.

The input was identical. The model weights were identical. The selected expert was different.

This is not a malfunction. The router scores were always within floating-point tolerance of each other. The tie was real. The hardware simply resolved it differently on different runs.

Why Drift Is a Systems Problem

The consequences of routing variance depend on what the experts have learned. If two experts produce similar outputs for similar inputs, routing variance may be benign—the final output changes negligibly. But if the experts have specialized to handle different aspects of the input distribution, routing to the wrong expert can produce noticeably different behavior.

Debugging becomes harder. A failure that occurs only when a specific expert is selected may not reproduce reliably. The engineer sees intermittent failures with no obvious pattern. The routing decision, buried inside the model's forward pass, is not typically logged or inspected.

[FIG 3: Execution Drift Across Runs] Caption: The same input processed twice by the same MoE model. Run A routes through Experts 2 and 5. Run B routes through Experts 2 and 7. The outputs differ, though both runs used identical inputs and model weights.

Regression testing becomes statistical rather than deterministic. A test that compares outputs to a golden reference must allow for some tolerance. The tolerance threshold is difficult to set correctly. If routing variance causes outputs to vary by 0.1%, a threshold of 0.01% will produce false failures. A threshold of 1% will miss real regressions.

Reproducibility requires more than saving model weights. A complete specification must include the routing decisions for each input, or the system must guarantee that routing decisions are stable. Most systems do neither.

Silent failures can occur when the system produces plausible outputs through the wrong expert path. The output passes validation checks. The downstream application consumes it. No error is raised. The system behaves differently than expected, but the difference is not detected until much later—or never.

Industry Norms and Design Tradeoffs

Most production MoE systems accept routing variance as a design tradeoff. The performance benefits of sparse activation outweigh the reproducibility costs for many use cases. Inference throughput matters more than bit-exact reproducibility.

This is a reasonable tradeoff for applications where statistical correctness is sufficient. A language model generating text does not need to produce the same tokens every time. A recommendation system does not need perfect reproducibility.

The tradeoff is less reasonable for applications where audit trails, debugging, or compliance require deterministic behavior. Financial systems, medical applications, and safety-critical deployments may need guarantees that standard MoE implementations do not provide.

Deterministic adapter routing is one approach to this problem. Rather than allowing the router to make probabilistic decisions that can vary between runs, the routing logic is designed to produce identical decisions for identical inputs regardless of hardware variance. AdapterOS implements this approach, ensuring that the same input always routes to the same adapter path. The routing decision becomes a stable property of the input, not an artifact of execution order.

This is not the only approach, and it involves its own tradeoffs. Deterministic routing may constrain the flexibility of the routing policy or require additional computation to ensure stability. Patent pending.

The point is not that one approach is universally correct. The point is that routing variance exists, that it has consequences, and that engineers building MoE systems should understand where that variance comes from and whether it matters for their application.


Mixture-of-Experts is an efficient architecture. Efficiency comes from sparse activation, and sparse activation requires routing decisions. Routing decisions are discrete, and discrete decisions made from continuous scores are sensitive to small perturbations. The perturbations come from the same hardware-level nondeterminism that affects all GPU computation. The amplification comes from the architecture itself.

This is not a flaw to be fixed. It is a property to be understood.