Operational Outcomes
- Teams can lower inference cost and latency while keeping cache decisions auditable.
- Security and compliance reviews can verify when reuse occurred and why it was allowed.
- Multi-tenant operators can reduce cross-tenant risk by proving stale cache entries were not reused.
- Incident response can trace cache behavior with receipt-bound evidence instead of internal-only logs.
KV Cache Economics
Every transformer layer produces a set of Key and Value tensors during the attention computation for each token. Once computed, these tensors can be stored and reused so that subsequent tokens in the same sequence attend to earlier context without recomputing it from scratch. This is the KV cache, and it is one of the biggest optimizations in production inference. Without it, the cost of generating each new token would scale linearly with the full sequence length. With it, each token's generation cost is roughly constant.
The performance impact is substantial. Systems like vLLM and SGLang have built entire serving architectures around efficient KV-cache management, using paged storage and radix-tree indexing to maximize reuse across requests that share common prefixes. LMCache extends this further by extracting and sharing KV caches across multiple engine instances. Prefix caching alone can eliminate thousands of tokens of redundant computation per request when system prompts, few-shot examples, or document contexts are shared across a batch.
The economics are straightforward. Reusing a cached 2,000-token prefix saves approximately 2,000 tokens of compute on every subsequent request that shares that prefix. At scale, this is the difference between inference that is commercially viable and inference that burns money. The question of who benefits from these savings—and whether customers can verify that savings were passed through—is a separate problem with its own cryptographic solutions.
Adapters Change What the Cache Contains
The complication arrives when adapters enter the picture. Lightweight fine-tuning methods like LoRA, IA³, and prefix tuning modify the model's behavior by injecting small learned parameter deltas into existing layers. Many of these methods specifically target the attention projections—the Q, K, and V weight matrices—because attention is where task-specific behavior concentrates most efficiently.
This creates a straightforward coherence problem. The KV cache stores the output of K and V projections as they existed at the time of computation. If an adapter subsequently changes those projections—by modifying the LoRA A and B matrices, by adjusting a scaling factor, or by activating an entirely different adapter—the cached tensors reflect a model state that is no longer active. Reusing them means attending to context that was computed by a different effective model than the one currently generating tokens.
The consequences range from subtle quality degradation to outright incorrect outputs. In multi-tenant inference environments where different customers use different adapters on shared hardware, stale cache reuse can leak one tenant's contextual representations into another tenant's generation. This is a correctness problem, an isolation problem, and in regulated environments, a compliance problem.
The Adapter Boundary Is the Decision Point
The safe approach treats every adapter change as a potential cache invalidation event. The question is how to make that determination efficiently, without discarding cache entries that remain perfectly valid.
The key insight is that adapter modifications are not uniform across layers. A LoRA adapter might modify the K and V projections in layers 12 through 24 while leaving layers 0 through 11 untouched. In that case, the KV cache for layers 0 through 11 remains valid even after an adapter swap. Blanket invalidation—discarding the entire cache on any adapter change—is safe but wasteful. Per-layer invalidation, where only the layers whose effective weights changed are recomputed, preserves the performance benefit of caching for the majority of the model while ensuring correctness for the layers that actually differ.
Recent work on Activated LoRA demonstrates this principle at scale. By designing adapters that modify attention projections only after a predefined activation sequence, the base model's KV cache can be reused for all tokens preceding that activation point. The reported results are large: up to 58x end-to-end latency reduction and over 100x time-to-first-token improvement compared to standard LoRA serving, with benefits that scale with both model size and sequence length. FastLibra takes a complementary approach, co-optimizing the caching of both adapter weights and KV tensors to reduce time-to-first-token by 63% and improve peak throughput by 35% in multi-LoRA serving scenarios.
These results confirm that the performance opportunity is substantial, but they also underscore that the invalidation logic must be precise. A cache hit on a stale entry is worse than a cache miss, because a miss merely costs time while a stale hit corrupts the output.
State Hashing Makes Reuse Verifiable
The mechanism for deciding whether a cached entry remains valid is a per-layer state hash. For each attention block, you compute a short cryptographic hash—BLAKE3 is well-suited here, given its speed and parallelism—over the effective K/V path weights. This means the base weight matrices plus any active adapter deltas, scaling factors, and bias terms that influence the K or V projection for that layer. The hash is stored alongside the cached KV tensors as metadata.
On reuse, the system computes the current state hash for each layer and compares it against the stored hash. A match means the effective weights that produced the cached tensors are identical to the weights that would produce them now, so reuse is safe. A mismatch means the weights have changed, so the KV for that layer must be recomputed. The scope of the hash is deliberately narrow: only Q, K, and V projection weights, their associated LoRA A/B matrices, bias terms, and scaling factors. Feed-forward network adapters are excluded because they do not affect the attention cache.
This approach has a natural affinity with the determinism infrastructure that verifiable inference already requires. The state hash can be bound to the same cryptographic receipt that records the model version, adapter identity, and inference configuration. The result is an auditable record that proves which cache entries were reused, which were recomputed, and why—turning a performance optimization into a verifiable control.
For systems that use Q15 commit boundaries to concentrate determinism at decision points rather than demanding it uniformly, the state hash serves an analogous function: it is the commit boundary for the KV cache, the point at which the system decides whether cached computation is trustworthy enough to reuse.
Additional Complexity from MoE Routing
Adapter changes are one source of cache invalidity. Routing changes are another. In Mixture-of-Experts architectures, each token can be routed to different expert subnetworks depending on the router's gating decisions. If the router selects different experts for a given token position than it did when the cache was created, the cached K/V tensors may not reflect the computation path that the current model state would produce.
The solution is to extend the state hash to include routing metadata. For deterministic routers, this means including the selected expert IDs or a deterministic route signature in the cache key. For stochastic routers, the situation is more nuanced—the cache entry should include the full routing decision (which experts, with what weights) so that the system can verify whether the same routing would occur on recomputation.
This is where the per-layer approach proves its value. In an MoE model with 64 expert layers and 32 attention layers, a routing change in one expert layer invalidates only the KV entries downstream of that change, while all other cache entries remain valid. The granularity matters because MoE models are often expensive to run, and caching has the most impact there.
Segmented Caches for Mid-Sequence Adapter Swaps
Some inference pipelines swap adapters mid-sequence. A common pattern is to use one adapter for understanding the input context and a different adapter for generating the output—or to process a multi-turn conversation where different turns were handled by different adapter configurations.
The safe design for this scenario is a segmented cache. Each segment corresponds to a contiguous span of tokens produced under a single adapter configuration. When the adapter changes, the system seals the current segment—marking it as immutable and recording the state hash that produced it—and starts a new segment under the new configuration. Previous segments are referenced, never mutated. Each segment carries its own state hash metadata, so the validity of each segment can be verified independently.
This segmented approach also interacts cleanly with the token billing question. Each segment represents a distinct unit of computation with a verifiable adapter state. Billing systems that attribute computation to specific adapter configurations can use segment boundaries as natural accounting units, and the state hashes provide the evidence needed to distinguish cached computation from fresh computation in the billing record.
Quantization State Belongs in the Hash
One subtle failure mode deserves explicit attention. If the model's quantization state changes between the time the cache was created and the time it is reused—for example, if per-tensor quantization scales are recalibrated, or if the system switches between different quantization formats—the effective weights have changed even though the logical adapter configuration has not. The state hash must include the quantization parameters (per-tensor scales, zero points, format identifiers) to avoid false matches.
This matters particularly for systems that use dynamic quantization, where the quantization parameters are computed from the activation statistics of recent batches. In these systems, the quantization state is continuously shifting, and a cache entry that was valid five minutes ago may correspond to slightly different effective weights now. Including the quantization state in the hash catches these shifts automatically.
The hardware-level determinism challenges compound this. Floating-point nondeterminism in GPU execution can produce slightly different intermediate values across runs, and if those intermediate values feed into quantization scale computation, the resulting quantization parameters—and therefore the effective weights—may differ between the cache-creation run and the cache-reuse run. Tying the hash to the exact quantization state provides a clean detection mechanism for this class of divergence.
Recommended Default Policy
The practical default for production systems is a three-part reuse predicate: the per-layer state hash must match, the adapter epoch (a monotonic counter bumped on any K/V-relevant parameter change) must match, and the route signature (if the model uses MoE or conditional computation) must match. If all three conditions hold for a given layer at a given step, reuse the cached KV for that layer. If any condition fails, recompute only that layer's KV while reusing the rest.
The telemetry to support this is lightweight: log the layer ID, whether the hash matched, the active adapter ID, the epoch, and the route signature for each cache decision. Periodically—on the order of one in every thousand tokens—run a divergence sample where you compare the logits produced by reused KV against the logits that would result from full recomputation. This gives you a continuous, low-overhead measurement of the cache policy's fidelity in production.
Ship with the conservative default—reuse only when all hashes match—and selectively relax it for specific adapter and layer combinations once field data confirms that divergence is effectively zero. This is the same pattern that kernel allow-lists use for determinism policy: start strict, measure, and open up where the evidence supports it. The cache policy becomes a configurable, auditable artifact rather than a hidden implementation detail.