Operational Outcomes

Teams can explain output differences by inspecting routing behavior, not just model weights.
Verification pipelines can re-check receipts only when routing is stable and recorded.
Audit and safety programs can set clear controls around when routing variance is acceptable.
Engineering can reduce intermittent test and debug failures tied to route flips.

Most engineers who work with GPUs understand that floating-point operations can produce slightly different results across runs. This is documented behavior. What is less commonly discussed is that Mixture-of-Experts architectures introduce a second layer of variance—one that operates on top of the hardware-level nondeterminism and can amplify its effects in ways that are difficult to trace.

Why MoE Is Attractive

Mixture-of-Experts is an architectural pattern that routes each input to a subset of specialized subnetworks rather than passing it through every parameter. A model with 400 billion parameters might activate only 50 billion for any given token. The remaining parameters sit idle for that forward pass.

This sparsity is the appeal. Training and inference costs scale with the number of active parameters, not the total parameter count. A sparse model can match the quality of a dense model at a fraction of the computational cost. Industry adoption reflects this cost-quality tradeoff. GPT-4, Mixtral, and several other frontier models use MoE or similar conditional computation strategies.

The efficiency gains are real, and so is the added operational complexity.

Determinism as an Execution Property

Determinism means that identical inputs produce identical outputs. Not similar. Identical. Bit-for-bit reproducible.

This property matters for several practical reasons. Debugging requires reproducibility—isolating a failure is difficult when the failure does not reliably occur. Regression testing assumes that unchanged code produces unchanged outputs. Compliance audits may require demonstrating that a system produces consistent results. Scientific reproducibility depends on the ability to replicate experiments exactly.

Determinism is a property of the entire execution path -- the algorithm, the implementation, the libraries, the hardware, and the runtime environment -- and emerges only when every component is free of variance.

GPU execution is already non-deterministic in many common configurations. Thread scheduling, atomic operations, and algorithm selection can all introduce variance. This is documented behavior that engineers working with CUDA and cuDNN learn to either accept or mitigate.

MoE adds another source of variance, and it operates at a different level of abstraction.

How MoE Introduces Drift

The router in an MoE system is a small neural network that takes an input and produces a score for each expert. These scores pass through a softmax function, producing a probability distribution. The top-k experts—those with the highest scores—are selected to process the input.

This selection is a discrete decision. An expert with a score of 0.5001 is selected. An expert with a score of 0.4999 is not. The gap between these scores can be arbitrarily small, but the decision is binary.

Floating-point arithmetic affects the router's computation just as it affects any other neural network operation. Small differences in the order of operations, rounding behavior, or intermediate precision can shift the router scores by amounts that are normally negligible—but not when those scores are near a tie.

Consider what happens when hardware-level nondeterminism perturbs the router's computation. On one run, Expert 3 scores 0.2501 and Expert 7 scores 0.2499. Expert 3 is selected. On the next run, the same input produces scores of 0.2499 and 0.2501. Expert 7 is selected instead.

The input was identical. The model weights were identical. The selected expert was different.

This is not a system fault. The router scores were always within floating-point tolerance of each other. The tie was real. The hardware simply resolved it differently on different runs.

Why Drift Is a Systems Problem

The consequences of routing variance depend on what the experts have learned. If two experts produce similar outputs for similar inputs, routing variance may be benign—the final output changes negligibly. But if the experts have specialized to handle different aspects of the input distribution, routing to the wrong expert can produce noticeably different behavior.

Debugging becomes harder. A failure that occurs only when a specific expert is selected may not reproduce reliably. The engineer sees intermittent failures with no obvious pattern. The routing decision, buried inside the model's forward pass, is not typically logged or inspected.

Regression testing shifts toward statistical comparison, requiring tolerance thresholds that are difficult to calibrate. A test that compares outputs to a golden reference must allow for some tolerance. The tolerance threshold is difficult to set correctly. If routing variance causes outputs to vary by 0.1%, a threshold of 0.01% will produce false failures. A threshold of 1% will miss real regressions.

Reproducibility requires more than saving model weights. A complete specification must include the routing decisions for each input, or the system must guarantee that routing decisions are stable. Most systems do neither.

Silent failures can occur when the system produces plausible outputs through the wrong expert path. The output passes validation checks. The downstream application consumes it. No error is raised. The system behaves differently than expected, but the difference is not detected until much later—or never.

Industry Norms and Design Tradeoffs

Most production MoE systems accept routing variance as a design tradeoff. The performance benefits of sparse activation outweigh the reproducibility costs for many use cases. Inference throughput matters more than bit-exact reproducibility.

This is a reasonable tradeoff for applications where statistical correctness is sufficient. A language model generating text does not need to produce the same tokens every time. A recommendation system does not need perfect reproducibility.

The tradeoff is less reasonable for applications where audit trails, debugging, or compliance require deterministic behavior. Financial systems, medical applications, and safety-critical deployments may need guarantees that standard MoE implementations do not provide.

Deterministic adapter routing is one approach to this problem. Rather than allowing the router to make probabilistic decisions that can vary between runs, the routing logic is designed to produce identical decisions for identical inputs regardless of hardware variance. adapterOS implements this approach, using Q15 fixed-point quantization to ensure that the same input always routes to the same adapter path. The routing decision becomes a stable property of the input, independent of execution order.

This is not the only approach, and it involves its own tradeoffs. Deterministic routing may constrain the flexibility of the routing policy or require additional computation to ensure stability.

Both approaches involve tradeoffs. What matters is that routing variance exists, that it has consequences, and that engineers building MoE systems should understand where that variance comes from and whether it matters for their application.

Mixture-of-Experts is efficient because it activates only part of the model for each token. That efficiency depends on routing, and routing is a discrete decision built from continuous scores. Small numeric shifts can therefore change expert choice, even when everything else is unchanged.

In many applications, that variance is acceptable. Where reproducibility, auditability, or safety is required, routing policy has to be treated as a controlled part of system design, not an implementation detail.

Mixture-of-Experts Systems Quietly Amplify Nondeterminism

Operational Outcomes

Why MoE Is Attractive

Determinism as an Execution Property

How MoE Introduces Drift

Why Drift Is a Systems Problem

Industry Norms and Design Tradeoffs

Continue Reading

All Research Notes

Research Pillars

Discuss Deployment