Transformers

Autoregressive Image Generation

Explains autoregressive image generation as sequential visual-token prediction using Transformer-style next-token modeling.

Why LLM Caching Is Only for Input Tokens

Explains why LLM prompt caching applies to reusable input-token prefill, not sequential output decoding.

Attention in Machine Learning

Explanation of the attention mechanism in ML, covering Query/Key/Value, self-attention, multi-head, causal, cross-attention, and efficiency variants like FlashAttention and GQA.

Mixture of Experts (MoE)

Overview of MoE architecture, routing, key components, variants, and trade-offs in machine learning models