Skip to content

Public technical briefing from the MLNavigator Research Group.

← Back to Technical Briefings
governancenondeterminismauditcomplianceinference

When the Same Model Gives Different Answers, Governance Breaks

January 08, 2026·MLNavigator Team

The problem in one sentence

As explicitly detailed in NVIDIA's framework determinism guidelines, running the same prompt through the same model on the same GPU twice can produce different outputs due to parallel execution variations. In any environment where someone needs to reconstruct what happened and why, we believe that is a material problem.

Same model and input producing different outputs across two GPU runs due to thread scheduling and floating-point rounding

Where the variation comes from

Modern GPU inference pipelines are not deterministic by default. The root cause is floating-point non-associativity (FPNA): the order in which floating-point additions execute changes the result, and parallel hardware does not guarantee execution order.

Three specific mechanisms contribute:

Parallel reduction ordering. When a GPU sums thousands of partial products across cores, the order depends on thread scheduling. Different scheduling produces different rounding sequences, which propagate through subsequent layers.

Atomic operations in convolution kernels. NVIDIA's cuDNN documentation explicitly lists several routines that "use atomic operations in a way that introduces truly random floating-point rounding errors." These include backward-pass convolution and max-pooling operations that are also used during certain inference configurations.

Tensor Core precision differences. Operations using Tensor Cores follow a different floating-point pipeline than scalar CUDA cores, introducing another source of variation. The hardware-level details matter when deployments span GPU architectures.

These are architectural properties of parallel floating-point hardware, baked in at the silicon level.

Why we believe this matters for governance

Audit frameworks in regulated environments often assume you can answer a basic question: given the same inputs, would the system produce the same output? If the answer is "usually, but not always, and we can't predict when it will differ," we see several governance assumptions breaking down.

Incident reconstruction hits a gap first. If a deployment produced an output that led to a downstream decision, and the system can't reliably reproduce that output, we find that reviewers struggle to distinguish between "the model behaved correctly given its inputs" and "something changed."

Change validation breaks next. Teams that validate model deployments by comparing outputs across environments can't trust exact-match comparisons. A mismatch might be a real regression or it might be FPNA noise. Without a tolerance framework, every mismatch triggers manual review.

Certification evidence gets harder to define. Compliance programs that require documented system behavior face a definitional problem — the system's behavior includes a nondeterministic component that can't be fully documented in advance.

What "deterministic mode" actually costs

PyTorch offers a deterministic mode that forces reproducible operation. The framework's own documentation is direct about the tradeoff: "deterministic operations are often slower than nondeterministic operations." Some operations have no deterministic implementation at all and will raise errors.

NVIDIA's GTC 2019 guidance is more specific: bit-exact reproducibility requires locking six layers of the software and hardware stack simultaneously. Change any one and the guarantee disappears.

For organizations that also need to run inference on Mixture-of-Experts architectures, the problem compounds further — routing decisions add another layer of variation on top of the numerical one.

What to do about it

There's no clean fix that preserves both performance and perfect reproducibility. The tradeoffs are real, and they require explicit choices:

Lock the full software stack or define a tolerance. If bit-exact reproduction is required, the entire inference stack must be version-controlled and deployment-pinned — see the full requirements list. If exact reproduction is not required, define what deviation is acceptable and document the rationale as an operational policy decision.

Log more than the output. If the system cannot guarantee identical outputs, the operational record needs to capture enough context to explain expected variation. That means logging not just the prompt and response, but the specific software versions, hardware identifiers, and inference configuration active at the time.

Treat nondeterminism as a declared property. Rather than pretending the system is deterministic, document it as a system characteristic and design review processes around it. An auditor who encounters unexplained variation is in a worse position than one who encounters documented, bounded variation.

The hardware-level dimensions go deeper — particularly when deployments span different GPU architectures.