LLM Prompt Caching: Implicit vs Explicit

Caching in LLM inference is about reusing the KV-cache computed from a prompt prefix so the model doesn’t re-process the same tokens on every request. The “implicit vs explicit” distinction is about who manages that cache.

Prompt Prefix: The Underlying Mechanism

“Prefix” means literally the starting tokens of the prompt — the bytes from position 0 onward, in order, that two requests have in common before they diverge.

When a transformer processes a prompt, it computes attention keys and values for each token. The KV state for token N depends on every token before it. So if request A is [system prompt][doc X][question 1] and request B is [system prompt][doc X][question 2], the KV state for [system prompt][doc X] is identical in both — the model can skip recomputing it and pick up at the divergence point.

Key constraint: it has to be a true prefix — byte-identical, from token zero. If request B differs by even one token (a different system prompt, an extra space, a swapped ordering), the cache is invalidated from that point onward, because every downstream token’s attention now depends on different upstream state. You cannot cache the middle or end of a prompt while changing the beginning.

Prompt ordering for cacheability

Structure prompts so volatile content comes last:

System prompt (stable across all calls)
Tool/function definitions (stable per app version)
Large static context — RAG documents, knowledge base chunks (stable per session)
Conversation history (grows but is append-only, so prior turns stay prefix-stable)
The new user message (the only volatile part)

Flip that order — put the user message first — and the cache busts on every turn.

Implicit Caching

Automatic. The provider’s inference layer detects when a new request shares a long prefix with a recent one and silently reuses the cached state. No code changes — just send the same system prompt or document context at the start of each request, and if the platform’s heuristics fire (usually requiring a minimum prefix length and a short time since the last hit), you get a discount on the cached tokens.

Providers: Gemini, OpenAI, and DeepSeek all do versions of this.

Pros: Zero integration effort.

Cons: Best-effort. No TTL you control, no guarantee a given request will hit, and cold-start requests pay full price.

Explicit Caching

You tell the provider: “store this exact context, give me back a handle, and bill me for storage until it expires.” Subsequent requests reference the handle instead of resending the content.

Providers:

Anthropic — cache_control: {type: "ephemeral"} markers on message blocks
Gemini — CachedContent API

Pros: Deterministic. Within the TTL, you will hit. Storage fee + one-time write cost, but reads are dramatically cheaper than re-tokenizing.

Cons: Requires integration code; you pay storage cost and must manage cache lifecycle (creation, TTL, invalidation).

When to Use Which

Implicit wins for workloads with naturally repeating prefixes and bursty traffic where you can’t easily reason about cache lifetime — chatbots, IDE autocomplete, anything where a system prompt is shared across many short-lived calls.
Explicit wins when you have a large, stable context (long document, big tool/schema block, knowledge base chunk) queried many times over a known window. Trade integration code + storage cost for guaranteed savings.

In practice teams often layer them — explicit caches for the heavy stable stuff (system prompt + tool definitions + large RAG context), implicit caching catching the rest opportunistically.

Prompt Prefix: The Underlying Mechanism#

Prompt ordering for cacheability#

Implicit Caching#

Explicit Caching#

When to Use Which#

Prompt Prefix: The Underlying Mechanism

Prompt ordering for cacheability

Implicit Caching

Explicit Caching

When to Use Which