determinism GPU kernels policy inference performance configuration

Kernel Allow-Lists: Determinism as Configurable Policy

January 22, 2026 · James KC Auchterlonie

The standard approach to deterministic GPU computation is a global flag. Set CUBLAS_WORKSPACE_CONFIG, enable torch.use_deterministic_algorithms(True), accept the performance hit across every operation. This works for research reproducibility, where the goal is bitwise-identical training runs regardless of cost. It is less appropriate for production inference, where some operations are on the critical path for reproducibility and others are not.

A kernel allow-list is a different approach. Rather than forcing every operation to be deterministic, you specify which operations are permitted during inference execution and which are blocked. The result is a per-model or per-adapter policy that explicitly encodes the tradeoff between performance and reproducibility, and that policy itself becomes part of the cryptographic receipt.

Why Kernels Matter

A "kernel" in GPU computing is a function that runs on the device. When you call a matrix multiplication, the framework selects a kernel implementation based on the input dimensions, the available hardware, and various heuristics. Different kernel implementations may use different reduction orders, different thread block configurations, or different algorithmic strategies—all of which can produce slightly different floating-point results for the same mathematical operation.

This selection is typically invisible to the application. You call matmul(A, B) and the framework picks the fastest kernel that produces a correct result. Correctness here means "within floating-point tolerance," not "bitwise identical to what another kernel would produce." The framework's kernel selection heuristics may change between driver versions, between hardware generations, or even between runs on the same hardware if the auto-tuner reaches a different conclusion about which kernel is fastest for a given problem size.

For most applications, this invisible selection is fine. For applications that need to produce a cryptographic receipt binding the computation to its result, it is a source of variance that must be either eliminated or documented.

The Allow-List Approach

An allow-list specifies exactly which kernel implementations are permitted for each operation type. If the framework would normally select from eight candidate kernels for a given matrix multiplication, the allow-list might restrict that selection to two kernels that are known to produce deterministic results for the relevant input dimensions.

This is more granular than a global determinism flag. You might allow non-deterministic kernels for the initial embedding lookup—where the output is a simple table lookup that is inherently deterministic regardless of kernel implementation—while restricting the attention computation to deterministic reduction kernels, because attention scores feed directly into the routing decision.

The policy can vary per model and per adapter. A model deployed in a regulated environment with strict reproducibility requirements might use a restrictive allow-list that permits only verified deterministic kernels. The same model deployed for low-latency interactive use might use a permissive allow-list that allows all standard kernels, accepting minor floating-point variation in exchange for throughput.

Policy as Part of the Receipt

The allow-list configuration is not just a runtime setting—it is bound into the cryptographic receipt alongside the model configuration, adapter selection, and input-output relationship. When a verifier validates a receipt, they can confirm not only what the system computed but under what determinism constraints the computation was performed.

This distinction matters for compliance. A receipt produced under a strict allow-list—one that permits only deterministic kernels—carries a stronger reproducibility guarantee than a receipt produced under a permissive policy. Both receipts are valid, but they make different claims about the computation they describe. The strict receipt says "this output was produced under conditions that guarantee bitwise reproducibility." The permissive receipt says "this output was produced under conditions that may introduce floating-point variation in specific operations."

Making the policy explicit and verifiable is more honest than the alternative, which is claiming determinism without documenting what it means or silently accepting non-determinism without disclosing it.

Practical Configuration

In practice, the operations that most commonly introduce non-determinism in inference are reductions (summations over large tensors), attention score computation (which involves both reduction and softmax), and any operation that uses atomic additions on shared memory. These operations have well-known deterministic alternatives that are slower but produce consistent results.

A reasonable default allow-list for an adapter-routed system might enforce deterministic kernels for the router scoring and gating computation—where reproducibility directly affects the routing decision—while allowing standard kernels for the forward pass through the base model and adapter layers. This targets the determinism guarantee at the operations that most affect receipt reproducibility, without penalizing operations where minor floating-point variance has no practical consequence.

The configuration is not static. As new kernel implementations are validated, they can be added to the allow-list. As hardware changes, the allow-list can be updated to reflect the kernels available on the new platform. The key design principle is that the allow-list is versioned, explicit, and bound into the receipt, so that any change to the determinism policy is visible and auditable.

What This Enables

Configurable determinism policies enable deployment across environments with different reproducibility requirements without maintaining separate model builds or inference pipelines. The same model binary, the same adapter set, the same inference code can produce receipts under different determinism guarantees by changing the allow-list configuration.

For organizations that operate across multiple compliance regimes—defense contracts requiring strict reproducibility alongside commercial deployments optimizing for throughput—this is the difference between maintaining one system with configurable policy and maintaining two separate systems with different guarantees. The policy is the variable, not the implementation.

Continue Reading

All Research Notes Research Pillars