Speculative Decoding
Speculative decoding is a clever inference optimization technique that exploits a fundamental asymmetry in how LLMs work: verifying a token is much cheaper than generating one. The Basic Setup You run two models simultaneously — a small, fast “draft” model and your large “target” model. The draft model generates several tokens ahead in a single pass (typically 4–8 tokens). The large model then verifies all of those candidate tokens in parallel in one forward pass. If the draft tokens match what the large model would have produced, you accept them all at once. If a token diverges, you reject it (and everything after it) and fall back to the large model’s output for that position. ...