Skip to content

This article reflects MLNavigator Research Group work. Deployment lives in adapterOS.

← Back to Research Notes
determinismGPUCUDAcuDNNreproducibilityoperations

GPU Determinism: What Is Guaranteed, What Is Not, and What to Control

February 11, 2026·MLNavigator Team

If two runs use the same prompt, same model, and same seed but produce slightly different numeric values, that is usually not an application bug. It is expected behavior from parallel floating-point execution unless determinism controls are explicitly enabled.

For regulated teams, this matters because review, incident response, and audit workflows depend on knowing which parts of execution are deterministic and which parts are variance-tolerant.

Why This Matters for Operators and Reviewers

  • Engineering teams can separate expected numeric variance from true regressions.
  • Security and compliance teams can review explicit determinism scope instead of vague reproducibility claims.
  • Program teams across suppliers, integrators, and regulated operators can align tolerance policy with operational risk.
  • Audit artifacts become stronger when determinism controls are versioned and bound into execution receipts.

Scope First: Determinism Is Not All-or-Nothing

In production AI systems, determinism is usually scoped, not universal.

  • Some computations must be repeatable because they affect governance outcomes (for example, adapter routing or policy gates).
  • Other computations can tolerate bounded variance if that variance is documented.

That is why adapterOS treats determinism as a policy boundary. This note focuses on the hardware/runtime side of variance. For policy-scoped control design, see kernel allow-lists and Q15 commit boundaries.

Where GPU Variance Comes From

1. Parallel execution changes operation order

Floating-point arithmetic is not associative. Reordering additions can change rounded intermediate values, even when the mathematical expression is equivalent.

On GPUs, block scheduling and warp timing are optimized for throughput, not stable global ordering. That can change the sequence of reductions between runs.

REDUCTION ORDER CHANGES ROUNDING PATH RUN A RUN B x1 x2 x3 x4 (x1+x2) (x3+x4) sum_A x1 x2 x3 x4 (x1+x3) (x2+x4) sum_B

Same inputs, different reduction path, slightly different rounded result

2. Atomic accumulation is serialized, but not globally ordered

Operations that use atomicAdd prevent memory races, but they do not enforce a deterministic global update order across all threads. The total is valid within floating-point tolerance, yet least-significant bits can differ between runs.

3. Library autotuning can select different kernels

Frameworks and libraries often benchmark kernel options and choose the fastest one for current conditions. Different choices can mean different operation order and slightly different numeric results.

Determinism Controls That Actually Work

The most reliable approach is layered control, not a single flag.

Control LayerWhat to LockWhy It Matters
EnvironmentGPU model, driver, CUDA/cuDNN/cuBLAS versionsRemoves cross-environment drift
Runtime policyDeterministic kernels where requiredConstrains operation-order variance
Execution topologyStreams, batching behavior, seed handlingAvoids schedule-dependent divergence
EvidenceReceipt fields for policy/version/toleranceMakes claims auditable
DETERMINISM CONTROL STACK Layer 1: Hardware + software baseline (version-locked) Layer 2: Kernel policy for compliance-critical decision paths Layer 3: Runtime topology (streaming, batching, seed policy) Layer 4: Receipt evidence (policy hash, tolerance, route signature) repeatability claims are only as strong as all layers combined

Practical Baseline (PyTorch/CUDA)

import torch

# Reproducible seed handling
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

# Prevent runtime benchmark-driven algorithm drift
torch.backends.cudnn.benchmark = False

# Enforce deterministic implementations where available
torch.backends.cudnn.deterministic = True
torch.use_deterministic_algorithms(True)

This baseline is useful for debugging and controlled replay. For production, teams often apply stricter controls only where determinism materially affects governance outcomes, then document tolerance elsewhere.

Cost and Tradeoff

Determinism has a real throughput and latency cost in many workloads. Some kernels have slower deterministic implementations; some operations may fail under strict mode if deterministic implementations are unavailable.

The right decision is workload-specific:

  • Strict mode for forensics, incident replay, and compliance-sensitive decision points.
  • Scoped mode for production pathways where bounded variance is acceptable and documented.

Avoiding Messaging Overlap Across Notes

This note covers hardware/runtime variance mechanics.

Bottom Line

GPU nondeterminism is a system property, not a failure mode by default. The engineering task is to make determinism explicit: control it where required, bound it where acceptable, and record that boundary in evidence that reviewers can verify.