The cross-architecture guarantee is: there is none
NVIDIA's cuDNN documentation is specific on this point. Routines are designed to produce the same bit-wise results across runs on GPUs with the same architecture. Across different architectures — Volta to Turing, Turing to Ampere, Ampere to Hopper — no cuDNN routine guarantees bit-wise reproducibility.
The reason is straightforward: each GPU generation implements floating-point arithmetic differently, schedules threads differently, and handles memory access patterns differently.
What changes between architectures
Each GPU architecture ships a different Tensor Core design. The multiply-accumulate pipeline on Ampere handles precision differently than Hopper's. cuDNN notes that Tensor Core results are "very close, but not always identical" to scalar floating-point, and the specific deviation pattern changes with the hardware.
Thread scheduling varies too. The number of streaming multiprocessors, warp schedulers, and register files differs across architectures. As detailed in NVIDIA's documentation on floating-point precision, parallel reductions that sum partial products across warps can be sensitive to execution order — and scheduling order is inherently architecture-dependent.
Memory hierarchy plays a role as well: L2 cache sizes, shared memory configurations, and memory bandwidth all affect which data is available when during parallel computation, which in turn can affect the order of floating-point accumulations.
Then there's cuBLAS. NVIDIA's matrix operations library can exhibit nondeterministic behavior depending on buffer management heuristics. The CUBLAS_WORKSPACE_CONFIG environment variable can force deterministic behavior, but only within a single architecture.
What "same environment" actually requires
NVIDIA's GTC 2019 presentation on determinism lists the full set of conditions for bit-exact reproducibility:
- Same GPU architecture
- Same driver version
- Same CUDA version
- Same cuDNN version
- Same framework version
- Same number of GPUs
- Same distribution setup
Drop any one condition and the guarantee is void. In practice, this means that a model validated on a development machine with an A100 cannot be assumed to produce identical outputs on a production machine with an H100 — even with identical weights, identical inputs, and identical framework code.
Why this matters for controlled deployments
Any team deploying AI under change-control requirements faces a specific challenge: hardware upgrades break reproducibility guarantees. A GPU refresh that improves throughput by 40% also invalidates every output comparison baseline established on the previous hardware.
This creates a tension between operational efficiency and governance. The newer hardware is faster and cheaper to run. But the organization's validation evidence was produced on the old hardware, and re-validation on new hardware will not produce bit-identical results, even if the model is behaving correctly.
The practical consequence is that governance frameworks need to account for hardware variation as a declared property of the system rather than treating it as an anomaly. Tolerance-based comparison — where outputs are considered equivalent within a defined numerical bound — replaces exact-match comparison. That bound needs to be defined, justified, and documented.
Tensor Cores vs. CUDA Cores: the performance-reproducibility tradeoff
One approach to improving cross-run reproducibility is restricting computation to CUDA cores and avoiding Tensor Cores entirely. Ingonyama's engineering team reported success with this approach, achieving identical outputs across three different GPU architectures by restricting to CUDA cores and carefully managing floating-point operation ordering.
The cost is significant. Tensor Cores exist because they're dramatically faster for matrix operations. Disabling them can reduce inference throughput by 2-10x depending on the model and batch size. For most production workloads, that tradeoff doesn't hold.
For deployments where model architecture choices add additional routing variation, the hardware-level variation compounds with algorithmic nondeterminism, making the reproducibility problem harder.
What organizations should document
Hardware-related reproducibility limits should be treated as part of the deployment specification:
- The specific GPU architecture, driver, and CUDA/cuDNN versions in use
- Whether Tensor Cores are enabled and what precision modes are active
- The defined tolerance for output comparison across runs
- The re-validation procedure when any hardware or driver component changes
- Whether deterministic mode is enabled and what performance cost is accepted
If an auditor will eventually ask "would this system produce the same result if you ran it again?" — and in CMMC-scoped environments, they will — the answers need to exist before the question does.