LLM Prompt Caching: Implicit vs Explicit

LLM Prompt Caching: Implicit vs Explicit Caching in LLM inference is about reusing the KV-cache computed from a prompt prefix so the model doesn’t re-process the same tokens on every request. The “implicit vs explicit” distinction is about who manages that cache. Prompt Prefix: The Underlying Mechanism “Prefix” means literally the starting tokens of the prompt — the bytes from position 0 onward, in order, that two requests have in common before they diverge. ...

May 21, 2026 · 3 min

Why LLM Caching Is Only for Input Tokens

Why LLM Caching Is Only for Input Tokens Why prompt caching applies to inputs and not outputs in LLM APIs (Anthropic, OpenAI, Google). The asymmetry comes down to how inputs vs. outputs are computed, and what’s actually reusable across requests. Inputs are processed in parallel; outputs are generated sequentially When a prompt comes in, the transformer computes KV (key/value) tensors for every token in one forward pass — the prefill phase. Those KV tensors are a deterministic function of the input, so they can be stashed and reused if the same prefix shows up again. ...

May 21, 2026 · 3 min