determinism quantization fixed-point Q15 inference reproducibility routing

Q15 Fixed-Point Quantization as a Determinism Boundary

January 18, 2026 · James KC Auchterlonie

There is a persistent assumption in deterministic inference work that every floating-point operation in the pipeline must produce identical results across runs. This assumption is expensive, restrictive, and—if you design the system carefully—unnecessary.

The insight behind Q15 commit boundaries is that determinism is not a property you need everywhere. It is a property you need at the points where decisions become permanent. Everything upstream of that boundary can be as messy as hardware floating-point arithmetic wants to be, provided you quantize the result into a fixed representation before anything gets recorded, transmitted, or used to branch execution.

The Problem with Floating-Point Determinism

IEEE 754 floating-point arithmetic is associative in theory and non-associative in practice. The expression (a + b) + c does not necessarily equal a + (b + c) when computed in finite precision, and compilers, drivers, and hardware schedulers routinely reorder operations for performance. A GPU kernel that sums 128 partial products across threads may accumulate them in a different order on Tuesday than it did on Monday, depending on which thread block happened to finish first.

This is well-understood. The usual response is to lock down the entire computation graph—pin kernel versions, disable auto-tuning, force deterministic reduction algorithms, accept the performance penalty. NVIDIA's CUBLAS_WORKSPACE_CONFIG flag exists precisely for this purpose. It works. It is also a blunt instrument that trades throughput for bitwise reproducibility across every operation, including operations whose exact values do not matter.

Where Determinism Actually Matters

In an adapter-routed inference system, the computation that matters most is the routing decision. Which adapters get selected, and with what gate weights, determines the entire downstream computation. If two runs select different adapters for the same input, the outputs diverge not by a few bits in the fifteenth decimal place but by the entire content of the response.

The routing computation involves scoring each adapter against input features, sorting by score, applying a probability distribution, and producing gate values. Each of these steps involves floating-point arithmetic that may vary across hardware. But the routing decision itself—which adapters were selected and how much weight each one carries—is what the receipt must bind and what a replay must reproduce.

This is where quantization becomes a design tool rather than a compression technique.

Q15 as a Commit Boundary

Q15 is a fixed-point representation that maps floating-point values in the range [−1, 1] to signed 16-bit integers with a denominator of 32767. The conversion is straightforward: multiply the floating-point value by 32767 and round to the nearest integer. The result is a value that can be stored, transmitted, compared, and replayed without any ambiguity about what number it represents.

When the routing layer produces gate values—say, 0.7134 and 0.2866 for a two-adapter selection—it quantizes them to Q15 before committing the routing decision. The quantized values (23380 and 9393, respectively) are what get recorded in the receipt, what get used to weight the adapter outputs, and what a verifier checks during replay.

The floating-point computation upstream of this quantization can vary between runs. It can vary between chips. It can use FP16 on one machine and FP32 on another. None of that matters, because the quantized values define the canonical routing decision. If the floating-point scores on two different machines both quantize to the same Q15 values, the routing is identical and the receipt is reproducible. If they quantize to different values—which would require the floating-point scores to differ by more than 1/32767, roughly 0.003%—then the routing genuinely differs, and the receipt should reflect that difference.

Why Q15 Specifically

The choice of Q15 is not arbitrary. It provides roughly 15 bits of precision (hence the name), which is sufficient to represent gate values with meaningful granularity while remaining compact enough to embed directly in receipt structures without inflating their size. The fixed denominator of 32767 avoids the rounding-mode ambiguities that plague floating-point comparison—two Q15 values are either identical or they are not, with no epsilon required.

Q15 is also a well-established representation in digital signal processing, where the same tension between computational efficiency and output reproducibility has been negotiated for decades. Audio codecs, control systems, and communications protocols all use fixed-point commit boundaries for exactly this reason: you want your intermediate computations to be fast, and you want your committed outputs to be unambiguous.

The Architectural Implication

Designing around commit boundaries changes how you think about the determinism problem. Instead of asking "how do I make every floating-point operation reproducible?" you ask "where in the pipeline do decisions become permanent, and how do I make those decisions unambiguous?"

This reframing has practical consequences. You do not need to disable GPU auto-tuning for the scoring computation. You do not need to force deterministic reductions for the softmax normalization. You can configure which kernels require determinism and which do not. You need to ensure that the quantization step is correct and that everything downstream of it uses the quantized values rather than the floating-point originals.

The performance savings are substantial. Deterministic GPU kernels are typically 10–30% slower than their non-deterministic counterparts, depending on the operation and the hardware. By restricting the determinism requirement to the commit boundary rather than the entire computation graph, you recover most of that throughput while maintaining the property that actually matters: identical inputs produce identical receipts.

What This Does Not Solve

Q15 quantization does not make the inference output identical across hardware. The forward pass through the base model and adapters still involves floating-point arithmetic that may vary between platforms. What it guarantees is that the routing decision—which adapters were used and how much each one contributed—is identical and verifiable.

For many applications, this is sufficient. The routing decision is the highest-leverage source of output variance in an adapter-based system, and eliminating that variance while documenting the remaining sources is a more honest engineering position than claiming bitwise determinism across the entire stack.

The receipt says: these adapters were selected with these weights. Verify that, and you have verified the most consequential decision in the inference pipeline.

Continue Reading

All Research Notes Research Pillars