This article reflects MLNavigator Research Group work. Deployment lives in AdapterOS.

← Back to Research Notes
determinism CUDA training reproducibility

GPUs Do Not Promise Determinism. Here Is the Hardware Reason.

January 10, 2026

You run training twice. Same code. Same data. Same seed. Different loss curves.

This is not a bug in your code. This is a property of the hardware.

Determinism Is an Ordering Contract

When we say a computation is deterministic, we mean: given identical inputs, the outputs are identical. Not close. Identical. Bit-for-bit.

For this to hold on a GPU, every floating-point operation must execute in the same order, every time. The order matters because floating-point arithmetic is not associative. The expression (a + b) + c can produce a different result than a + (b + c).

Determinism, then, is not a property of the algorithm. It is a property of the execution order. And GPUs do not guarantee that order.

CUDA Does Not Guarantee Thread Order

A GPU executes thousands of threads in parallel. These threads are grouped into blocks, and blocks are grouped into a grid. The CUDA programming model gives you abstractions for this hierarchy. What it does not give you is a promise about when anything runs.

The CUDA Programming Guide states explicitly: threads within a block can synchronize via barriers, but "threads in different thread blocks cannot synchronize with each other" except by terminating the kernel. There is no ordering guarantee between blocks.

[FIG 1: Block Scheduling Diagram] Caption: Thread blocks are distributed across streaming multiprocessors (SMs) in hardware-determined order. Block 7 may execute before Block 2.

This means if you launch a kernel with 256 blocks, the order in which those blocks execute is determined by the hardware scheduler at runtime. It can vary between runs. It can vary between devices. It can vary because a background process briefly consumed an SM.

Blocks Execute in Any Order

The GPU scheduler assigns blocks to streaming multiprocessors (SMs) as resources become available. A block that finishes early frees its SM for another block. This is efficient. It is also non-deterministic.

Consider a reduction operation: summing a million elements. A naive parallel implementation divides the array into chunks, sums each chunk in a separate block, and then combines the partial sums. If Block 3 finishes before Block 1, and both write their partial sums to global memory, the order of those writes is not guaranteed.

When the final reduction reads those partial sums, it may read them in a different order than the previous run. Different order means different floating-point result.

Floating-Point Arithmetic Is Order-Sensitive

This is the crux. IEEE 754 floating-point arithmetic rounds after every operation. The order in which you perform additions determines which intermediate results get rounded and how.

A concrete example. Consider three 32-bit floats:

a = 1.0
b = 1e-8
c = 1e-8

Compute (a + b) + c:

a + b = 1.00000001  (rounded to 1.0 in float32)
result + c = 1.0 + 1e-8 = 1.0

Compute a + (b + c):

b + c = 2e-8
a + result = 1.0 + 2e-8 = 1.00000002 (rounds to 1.0000000149...)

The difference is small. But it compounds. A neural network backpropagation pass performs millions of such operations. Small differences in gradient accumulation produce different weight updates. Different weight updates produce different models.

[FIG 2: Animated Reduction Ordering] Caption: Animation showing two execution orders for a parallel sum. Left path: blocks 0,1,2,3 complete in order. Right path: blocks complete as 2,0,3,1. Final sums differ by 2 ULP.

[MOTION GRAPHIC 1: Floating-point accumulation with different orderings, showing intermediate values and rounding]

cuDNN Uses Non-Deterministic Algorithms

NVIDIA's cuDNN library provides optimized implementations of common deep learning operations. Some of these implementations are non-deterministic by design.

The cuDNN documentation states: "Results... are not guaranteed to be bitwise reproducible across runs" for certain algorithms. Specifically, backward convolution operations that use atomicAdd for weight gradient accumulation are non-deterministic.

atomicAdd is a GPU instruction that adds a value to a memory location atomically. Multiple threads can call atomicAdd on the same address simultaneously. The hardware serializes these additions, but the order of serialization is not guaranteed. Different order, different result.

The cuDNN Developer Guide explicitly lists which algorithms are deterministic:

OperationDeterministic Algorithms
Convolution ForwardIMPLICIT_GEMM, IMPLICIT_PRECOMP_GEMM, GEMM, DIRECT, FFT, FFT_TILING, WINOGRAD, WINOGRAD_NONFUSED
Convolution Backward DataSame as forward
Convolution Backward FilterALGO_1 only (others use atomicAdd)

Most backward filter algorithms are non-deterministic. If you want determinism, you must explicitly request CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1, and accept the performance cost.

cuBLAS Has Its Own Rules

cuBLAS, NVIDIA's linear algebra library, has separate reproducibility guarantees. The cuBLAS documentation states that "results are guaranteed to be reproducible" when using the same GPU architecture, the same version of cuBLAS, and—critically—"on the same stream."

The multi-stream caveat matters. If you use multiple CUDA streams for concurrent kernel execution, cuBLAS operations on different streams may interleave in different orders between runs. The library's internal state can depend on execution history.

For reproducible results, cuBLAS requires:

  1. Same GPU architecture
  2. Same cuBLAS version
  3. Single-stream execution
  4. Same math mode (e.g., enabling or disabling Tensor Cores)

[FIG 3: Histogram of Max Absolute Difference Across Repeated Runs] Caption: Distribution of max |diff| between 100 identical ResNet-50 backward passes. With cuDNN benchmark enabled (blue), differences reach 1e-5. With deterministic mode (orange), differences are zero.

PyTorch Benchmarking Adds Noise

PyTorch wraps cuDNN and cuBLAS, adding another layer of non-determinism: algorithm selection.

When you call a convolution operation, PyTorch can choose from multiple cuDNN algorithms. By default, it runs a benchmark on the first call to find the fastest algorithm for your specific input size. This is torch.backends.cudnn.benchmark = True.

The benchmark runs each algorithm and measures wall-clock time. Wall-clock time varies with system load. The fastest algorithm on run 1 may not be the fastest on run 2. Different algorithm selection means different execution path means different result.

# Non-deterministic: algorithm selected by benchmark
torch.backends.cudnn.benchmark = True

# Deterministic: same algorithm every time
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

Disabling benchmark and enabling deterministic mode forces PyTorch to use the same (deterministic) algorithm on every run. The algorithm may be slower, but the results are reproducible.

PyTorch 1.8+ provides a convenience function:

torch.use_deterministic_algorithms(True)

This sets all relevant flags and will raise an error if you call an operation that has no deterministic implementation.

[MOTION GRAPHIC 2: Side-by-side training curves, one with benchmark=True showing jitter, one with deterministic=True showing identical curves]

Determinism Costs Performance

There is no free lunch. Deterministic algorithms are often slower than their non-deterministic counterparts.

The atomicAdd approach in cuDNN's backward convolution is faster because it allows maximum parallelism. The deterministic alternative serializes more operations, reducing parallelism and throughput.

Our measurements on an A100 GPU show:

ConfigurationResNet-50 Backward (ms)Overhead
Default (non-deterministic)12.3baseline
deterministic=True15.8+28%
deterministic=True + single stream18.2+48%

[FIG 5: Runtime Bar Chart for Determinism Settings] Caption: Wall-clock time for 100 backward passes through ResNet-50. Default settings vs. deterministic mode vs. deterministic with single-stream execution.

The overhead varies by model architecture and input size. Some operations have no deterministic implementation at all and will error when you enable strict mode.

Benchmark: atomicAdd Accumulation

Here is a minimal demonstration of non-determinism from atomicAdd:

// atomic_sum.cu
__global__ void atomic_sum(float* data, int n, float* result) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        atomicAdd(result, data[idx]);
    }
}

// Host code
float run_sum(float* d_data, int n) {
    float* d_result;
    cudaMalloc(&d_result, sizeof(float));
    cudaMemset(d_result, 0, sizeof(float));

    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    atomic_sum<<<blocks, threads>>>(d_data, n, d_result);

    float result;
    cudaMemcpy(&result, d_result, sizeof(float), cudaMemcpyDeviceToHost);
    cudaFree(d_result);
    return result;
}

Run this 100 times with the same input array. The result will vary. The differences are small—typically in the least significant bits—but they are real and measurable.

[FIG 4: Heatmap of Absolute Difference Across a Tensor] Caption: Element-wise |diff| between two runs of the same convolution backward pass. Most elements are identical (dark). Non-zero differences (bright) appear in scattered locations where atomicAdd race conditions manifested differently.

Benchmark: PyTorch Convolution Backward

A more realistic demonstration using PyTorch:

import torch

def measure_variance(runs=100):
    torch.manual_seed(42)
    x = torch.randn(32, 64, 56, 56, device='cuda', requires_grad=True)
    w = torch.randn(128, 64, 3, 3, device='cuda', requires_grad=True)

    grads = []
    for _ in range(runs):
        x_clone = x.clone().detach().requires_grad_(True)
        y = torch.nn.functional.conv2d(x_clone, w, padding=1)
        loss = y.sum()
        loss.backward()
        grads.append(x_clone.grad.clone())

    # Compare all runs to first run
    baseline = grads[0]
    max_diffs = [torch.max(torch.abs(g - baseline)).item() for g in grads[1:]]
    return max_diffs

# Non-deterministic mode
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
diffs_nondet = measure_variance()
print(f"Non-deterministic max diff: {max(diffs_nondet):.2e}")

# Deterministic mode
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
diffs_det = measure_variance()
print(f"Deterministic max diff: {max(diffs_det):.2e}")

On an A100, the non-deterministic run shows differences up to 1e-6. The deterministic run shows zero difference.

[MOTION GRAPHIC 3: Gradient tensor visualization, flashing between two runs showing scattered difference pixels]

The Determinism Checklist

Not all workloads need the same level of reproducibility. Choose based on your use case:

Level A: Debug-Level Determinism

Goal: Reproduce a specific failure or behavior.

Settings:

torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.use_deterministic_algorithms(True)

Constraints:

  • Single GPU
  • Single process
  • Fixed data order (no shuffling)
  • May error on operations without deterministic implementations

Use when: Debugging NaN gradients, investigating training instability, bisecting code changes.

Level B: Regression-Level Determinism

Goal: Detect when code changes affect numerical results.

Settings:

torch.manual_seed(seed)
torch.backends.cudnn.benchmark = False
# deterministic = False (allow non-deterministic ops)

Constraints:

  • Same GPU model
  • Same library versions
  • Accept small numerical differences between runs
  • Track statistical metrics (loss curve shape, final accuracy) not bit-exact values

Use when: CI/CD pipelines, A/B testing model changes, hyperparameter sweeps.

Level C: Research-Level Determinism

Goal: Document and communicate reproducibility status.

Settings: Default (non-deterministic), but:

# Log all relevant seeds
print(f"torch seed: {torch.initial_seed()}")
print(f"numpy seed: {np.random.get_state()[1][0]}")
print(f"cudnn version: {torch.backends.cudnn.version()}")
print(f"cuda version: {torch.version.cuda}")

Constraints:

  • Report hardware, software versions, seeds
  • Provide training scripts and data
  • Accept that exact reproduction may not be possible on different hardware
  • Focus on statistical reproducibility (similar results, not identical)

Use when: Publishing papers, sharing models, establishing baselines.


GPUs are parallel machines. Parallelism and determinism are in tension. The hardware makes a choice: performance over reproducibility. Understanding this trade-off lets you make informed decisions about when to pay the cost of determinism and when to accept the variance.

The variance is real. It is measurable. It is documented. It is not a bug in your code.