LLM Prompt Cache Options Across Providers

LLM Prompt Cache Options Across Providers A reference covering cache TTL options and other cache-control dimensions across major LLM providers as of May 2026. TTL mechanics Fixed-duration TTLs Anthropic: 5-min (default) and 1-hour (extended). Cache writes cost 1.25× base input for 5-min TTL, 2× for 1-hour. Cache reads ≈ 10% of base input. TTL refreshes on each read (sliding window). AWS Bedrock: 5-min default, 1-hour added Jan 2026 for Claude Sonnet 4.5, Haiku 4.5, Opus 4.5. Also refresh-on-read. OpenRouter (Gemini path): 5-min TTL that does NOT update on read (fixed window) — gateway-specific behavior worth checking when going through proxies. Arbitrary / configurable TTL Google Gemini explicit caching: No minimum or maximum bounds on TTL. Default 60 min. You can update TTL on an existing cache and delete it early to stop billing. Billed as cached_tokens × storage_duration (per token-hour), not via a write-time premium. Opaque / provider-managed retention OpenAI: No exposed TTL. Baseline ~5–10 min of idle retention; off-peak can persist up to 1 hour. Extended prompt caching retains KV tensors 1–2h typical, up to 24h max. DeepSeek, Grok, Moonshot, Groq, Kimi K2: Automatic, provider-managed, no exposed TTL. Implicit vs explicit control Implicit (zero-config): OpenAI, DeepSeek, Grok, Moonshot, Groq, Gemini implicit tier. Server decides what to cache when it detects a recurring prefix. Explicit (marked / lifecycle-managed): Anthropic and Alibaba use inline cache_control: {"type": "ephemeral"} markers. Gemini explicit caching exposes full CRUD on cache objects via API (create, get, update, delete) — caches behave like first-class resources, similar to Valkey keys. Cache breakpoints / layering Anthropic supports up to 4 cache_control breakpoints in a single request. You can mix TTLs within one request, but longer TTL blocks must appear before shorter TTL blocks in the prompt structure (tools → system → messages order). Practical use: 1-hour cache for stable system prompt + tool defs, 5-min cache for mid-conversation context, paying the higher write premium only on the truly stable prefix. ...

May 21, 2026 · 4 min

LLM Prompt Caching: Implicit vs Explicit

LLM Prompt Caching: Implicit vs Explicit Caching in LLM inference is about reusing the KV-cache computed from a prompt prefix so the model doesn’t re-process the same tokens on every request. The “implicit vs explicit” distinction is about who manages that cache. Prompt Prefix: The Underlying Mechanism “Prefix” means literally the starting tokens of the prompt — the bytes from position 0 onward, in order, that two requests have in common before they diverge. ...

May 21, 2026 · 3 min

Vectors vs Tensors

Vectors vs Tensors — Are They the Same? Short answer: related but not identical. A vector is a special case of a tensor. The math hierarchy Term Rank Shape example Scalar 0 a single number Vector 1 [d] — a 1D array Matrix 2 [m, n] — a 2D array Tensor N [d1, d2, ..., dN] — generic N-dimensional array Every vector is a tensor (specifically, a rank-1 tensor). Not every tensor is a vector. ...

May 21, 2026 · 2 min

Why LLM Caching Is Only for Input Tokens

Why LLM Caching Is Only for Input Tokens Why prompt caching applies to inputs and not outputs in LLM APIs (Anthropic, OpenAI, Google). The asymmetry comes down to how inputs vs. outputs are computed, and what’s actually reusable across requests. Inputs are processed in parallel; outputs are generated sequentially When a prompt comes in, the transformer computes KV (key/value) tensors for every token in one forward pass — the prefill phase. Those KV tensors are a deterministic function of the input, so they can be stashed and reused if the same prefix shows up again. ...

May 21, 2026 · 3 min