Why LLM Caching Is Only for Input Tokens

Why LLM Caching Is Only for Input Tokens Why prompt caching applies to inputs and not outputs in LLM APIs (Anthropic, OpenAI, Google). The asymmetry comes down to how inputs vs. outputs are computed, and what’s actually reusable across requests. Inputs are processed in parallel; outputs are generated sequentially When a prompt comes in, the transformer computes KV (key/value) tensors for every token in one forward pass — the prefill phase. Those KV tensors are a deterministic function of the input, so they can be stashed and reused if the same prefix shows up again. ...

May 21, 2026 · 3 min

Attention in Machine Learning

Attention in Machine Learning Attention is a mechanism that lets a model dynamically decide which parts of the input matter most when producing each piece of output. Instead of compressing everything into one fixed representation, the model computes a weighted combination of inputs where the weights are learned and depend on context. Intuition When translating “the cat sat on the mat” to French, generating the word for “cat” should mostly pay attention to “cat” in the source — not “mat” or “on.” Attention makes this routing explicit and differentiable. ...

May 17, 2026 · 3 min

Mixture of Experts (MoE)

Mixture of Experts (MoE) Mixture of Experts is an architecture pattern in machine learning where a model is divided into many specialized sub-networks (“experts”), with a routing mechanism that selectively activates only a subset of them for any given input. Core Idea Instead of passing every input through all parameters of a model, MoE routes each token (or input) to only a few relevant experts. This decouples total parameter count from compute per forward pass — you can have a massive model that’s still fast and efficient to run. ...

April 23, 2026 · 3 min