Why LLM Caching Is Only for Input Tokens

Why prompt caching applies to inputs and not outputs in LLM APIs (Anthropic, OpenAI, Google). The asymmetry comes down to how inputs vs. outputs are computed, and what’s actually reusable across requests.

Inputs are processed in parallel; outputs are generated sequentially

When a prompt comes in, the transformer computes KV (key/value) tensors for every token in one forward pass — the prefill phase. Those KV tensors are a deterministic function of the input, so they can be stashed and reused if the same prefix shows up again.

Output generation is the decode phase: one token at a time, each step depending on the sampled previous token. Even with the same input, sampling (temperature, top-p) can produce different sequences. There’s nothing stable to cache across requests.

Inputs are reused; outputs typically aren’t

Prompt caching exists because long system prompts, RAG contexts, code repos, and document attachments get sent again and again across calls. Caching them turns expensive recomputation into a lookup.

Outputs, by contrast, are consumed once by the user and rarely sent back verbatim. Caching a 2000-token answer to reuse it would require a future request to ask for that exact answer — which almost never happens. When it does, that’s response caching at the application layer (e.g. memoizing a deterministic query), not model-level caching.

But outputs DO get cached — just on the next turn

In a multi-turn conversation, yesterday’s assistant message becomes part of today’s input. At that point, the provider’s prompt cache will reuse those tokens.

So there isn’t really a “no caching for outputs” rule. It’s more that output tokens only become cacheable once they’ve transitioned into being input tokens for a subsequent call. Anthropic, OpenAI, and Google all do this implicitly when you replay a conversation.

The KV cache during decode is a separate thing

Inside a single generation, each newly produced token’s KV vectors get appended to a running cache so the next token doesn’t have to re-attend over the whole sequence from scratch. That’s a within-request optimization and is universal.

It’s not what people mean by “prompt caching.” The cross-request prompt cache that providers charge a discount for is specifically about the prefill phase being skippable.

Pricing reflects the compute asymmetry

  • Output tokens cost more than input tokens (often 4–5×) because each one requires a full forward pass through the model with no parallelism.
  • Cached input tokens cost even less than uncached input — you’re skipping the prefill compute entirely and just loading tensors.
  • There’s no equivalent shortcut for output generation. You can’t “skip” producing a token you haven’t produced yet.

TL;DR

InputOutput
ComputationParallel prefillSequential decode
DeterminismDeterministic given inputStochastic (sampling)
Reuse patternSame prompts sent repeatedlyGenerated once, rarely resent
Cacheable across requests?YesNot until it becomes input on the next turn

Inputs are deterministic, parallelizable, and frequently reused — perfect cache candidates. Outputs are sequential, stochastic, and consumed once. The moment they’re not consumed once, they’ve become inputs anyway.