caching tokens billing receipts verification metering audit enterprise

Cache Credits: Cryptographic Proof You Were Not Overcharged

February 06, 2026 · James KC Auchterlonie

Prompt caching is one of the most effective optimizations in production inference. When a new request shares a prefix with a previous computation—the same system prompt, the same document context, the same few-shot examples—the system can reuse the intermediate representations from the cached prefix rather than recomputing them from scratch. The savings are significant. Reusing a cached 2000-token prefix avoids approximately 2000 tokens worth of computation on every subsequent request that shares that prefix.

The question is: who benefits from that savings?

In most commercial inference services, the answer is ambiguous. The provider may or may not pass cache savings through to the customer. The billing statement shows tokens consumed, but it does not distinguish between tokens that were actually computed and tokens that were served from cache. The customer has no mechanism to verify whether caching occurred, how many tokens it saved, or whether the billing reflects the reduced computation.

This is not a trust problem in the sense that providers are necessarily overcharging. It is a transparency problem. The customer cannot verify the bill, and the provider cannot prove the bill is accurate, because the information required for verification—which tokens were cached, which were computed, and how the attribution was calculated—is not part of the billing artifact.

Logical Tokens Versus Attributed Tokens

The distinction that makes verifiable cache credits possible is between logical tokens and attributed tokens.

Logical tokens are the tokens in the request as the caller sees them. If you send a 3000-token prompt and receive a 500-token response, the logical token count is 3500—3000 input, 500 output. This is the count the caller can verify independently, because they have the prompt and the response.

Cached tokens are the tokens whose intermediate representations were reused from a previous computation. If 2000 of the 3000 input tokens matched a cached prefix, the cached token count is 2000.

Attributed tokens are the tokens that required actual computation: logical tokens minus cached tokens. In this example, attributed input tokens are 3000 − 2000 = 1000. The system only computed the forward pass for 1000 new input tokens plus the 500 output tokens.

This arithmetic is straightforward, but its implications for billing are substantial. The customer's computational cost should be proportional to attributed tokens, not logical tokens. A 3000-token prompt that hits a 2000-token cache should cost roughly one-third of a 3000-token prompt computed from scratch, because only one-third of the input computation was actually performed.

Binding Cache Credits into the Receipt

The cryptographic receipt commits all three token counts—logical, cached, and attributed—into the verifiable digest. This commitment has several consequences that simple billing logs do not provide.

First, the token counts are tamper-evident. If the receipt says 2000 tokens were cached, any modification to that number produces a different digest that fails verification. The provider cannot claim fewer cached tokens after the fact, and the customer cannot claim more.

Second, the cache identifier—a hash of the cached prefix—is included in the receipt. This means a verifier can confirm that a specific cached computation was reused, not just that some caching occurred. If two requests claim to reuse the same cache, they should reference the same cache identifier. If they reference different identifiers, they used different caches, and the attributed token counts may differ accordingly.

Third, the attribution arithmetic is verifiable. A verifier can confirm that attributed = logical - cached and that the three values in the receipt are internally consistent. This eliminates an entire category of billing error where logical and attributed counts do not reconcile.

Why This Matters at Scale

At small volumes, the difference between logical and attributed billing is negligible. At enterprise scale, it is not.

Consider an organization that processes 50,000 documents per day through an inference pipeline. Each document includes a 1500-token system prompt that is identical across all requests. Without cache credits, the organization is billed for 75 million system-prompt tokens per day. With proper cache attribution, those 75 million tokens are computed once and cached, reducing the attributed cost by roughly the cost of 74,998,500 tokens per day.

At current API pricing, this difference can be hundreds of thousands of dollars per month. The organization should not have to trust that the provider is applying cache credits correctly. It should be able to verify.

Verifiable cache credits also change the procurement conversation. Instead of negotiating a discount that the customer hopes reflects caching savings, the customer can audit actual cache utilization. If the provider claims 60% cache hit rates but the receipts show 30%, the discrepancy is documented and quantifiable. If the provider claims to have passed through cache savings but the attributed token counts do not reflect it, the receipts provide the evidence for a billing dispute.

The Completeness Property

One subtle but important property of binding cache credits into the receipt is completeness. The receipt accounts for every token in the request: each one is either cached or attributed, and the two categories sum to the logical total. There is no unaccounted residual.

This completeness property means that the receipt is a self-contained billing artifact. A finance team does not need access to the inference system's internal cache state, does not need to correlate billing logs with operational logs, and does not need to trust the provider's billing pipeline. The receipt itself contains all the information required to verify the charge: what was requested (logical tokens), what was reused (cached tokens), and what should be billed (attributed tokens).

For organizations operating under audit requirements—SOC 2, CMMC, internal compliance—this self-contained verifiability is a meaningful improvement over the current state of inference billing, which typically requires reconciling multiple data sources, trusting provider-reported metrics, and accepting some level of unresolvable variance as a cost of doing business.

Cache Expiry and Invalidation

One question that arises with verifiable cache credits is what happens when the cache is invalidated. If the system prompt changes, the model is updated, or the cache TTL expires, previously cached prefixes are no longer valid and the next request must compute them from scratch.

The receipt handles this naturally. If no cache was available, the cached token count is zero and the attributed token count equals the logical token count. The receipt does not claim cache credits that were not actually applied. If a partial cache was available—the first 1000 of 1500 system prompt tokens matched a cache entry, but the remaining 500 did not—the receipt reflects partial caching with the appropriate attributed count.

This transparency around cache invalidation is valuable for capacity planning. An organization that monitors its cache hit rates through receipts can detect when a change in system prompts, model versions, or request patterns has degraded caching effectiveness, and can adjust its prompting strategy or caching configuration accordingly—without waiting for the billing variance to show up at the end of the month.

Continue Reading

All Research Notes Research Pillars