Inference

What is Speculative Decoding?

Explains speculative decoding, which pairs a small draft model with a large target model to accelerate LLM inference without changing outputs.

Distinguishes RLVR as training-time weight updates from inference-time agent verification loops.

Explains that RL in LLMs is a training/alignment stage, not inference, with pipeline context.

Explains prefix caching for reusing attention KV computations to speed up shared-prefix AI inference.

Explains implicit vs explicit LLM prompt caching, prefix constraints, provider support, and when to use each.

Explains why LLM prompt caching applies to reusable input-token prefill, not sequential output decoding.