LLM Thinking Token Budgets

Thinking Token Budget Token budget parameters for thinking LLMs usually cap how many internal reasoning tokens the model may spend before producing the visible answer. Common names by API/provider include: max_tokens / max_output_tokens: caps generated output tokens, sometimes including hidden reasoning tokens depending on the API. reasoning_effort: qualitative budget like low, medium, high; the API maps this to an internal reasoning-token allowance. thinking_budget / budget_tokens: explicit number of hidden reasoning tokens allowed for models that expose thinking controls. max_completion_tokens: in some APIs, caps both reasoning tokens and final answer tokens together. Why it matters: ...

May 25, 2026 · 1 min

LLM Prompt Cache Options Across Providers

LLM Prompt Cache Options Across Providers A reference covering cache TTL options and other cache-control dimensions across major LLM providers as of May 2026. TTL mechanics Fixed-duration TTLs Anthropic: 5-min (default) and 1-hour (extended). Cache writes cost 1.25× base input for 5-min TTL, 2× for 1-hour. Cache reads ≈ 10% of base input. TTL refreshes on each read (sliding window). AWS Bedrock: 5-min default, 1-hour added Jan 2026 for Claude Sonnet 4.5, Haiku 4.5, Opus 4.5. Also refresh-on-read. OpenRouter (Gemini path): 5-min TTL that does NOT update on read (fixed window) — gateway-specific behavior worth checking when going through proxies. Arbitrary / configurable TTL Google Gemini explicit caching: No minimum or maximum bounds on TTL. Default 60 min. You can update TTL on an existing cache and delete it early to stop billing. Billed as cached_tokens × storage_duration (per token-hour), not via a write-time premium. Opaque / provider-managed retention OpenAI: No exposed TTL. Baseline ~5–10 min of idle retention; off-peak can persist up to 1 hour. Extended prompt caching retains KV tensors 1–2h typical, up to 24h max. DeepSeek, Grok, Moonshot, Groq, Kimi K2: Automatic, provider-managed, no exposed TTL. Implicit vs explicit control Implicit (zero-config): OpenAI, DeepSeek, Grok, Moonshot, Groq, Gemini implicit tier. Server decides what to cache when it detects a recurring prefix. Explicit (marked / lifecycle-managed): Anthropic and Alibaba use inline cache_control: {"type": "ephemeral"} markers. Gemini explicit caching exposes full CRUD on cache objects via API (create, get, update, delete) — caches behave like first-class resources, similar to Valkey keys. Cache breakpoints / layering Anthropic supports up to 4 cache_control breakpoints in a single request. You can mix TTLs within one request, but longer TTL blocks must appear before shorter TTL blocks in the prompt structure (tools → system → messages order). Practical use: 1-hour cache for stable system prompt + tool defs, 5-min cache for mid-conversation context, paying the higher write premium only on the truly stable prefix. ...

May 21, 2026 · 4 min

LLM Prompt Caching: Implicit vs Explicit

LLM Prompt Caching: Implicit vs Explicit Caching in LLM inference is about reusing the KV-cache computed from a prompt prefix so the model doesn’t re-process the same tokens on every request. The “implicit vs explicit” distinction is about who manages that cache. Prompt Prefix: The Underlying Mechanism “Prefix” means literally the starting tokens of the prompt — the bytes from position 0 onward, in order, that two requests have in common before they diverge. ...

May 21, 2026 · 3 min

Why LLM Caching Is Only for Input Tokens

Why LLM Caching Is Only for Input Tokens Why prompt caching applies to inputs and not outputs in LLM APIs (Anthropic, OpenAI, Google). The asymmetry comes down to how inputs vs. outputs are computed, and what’s actually reusable across requests. Inputs are processed in parallel; outputs are generated sequentially When a prompt comes in, the transformer computes KV (key/value) tensors for every token in one forward pass — the prefill phase. Those KV tensors are a deterministic function of the input, so they can be stashed and reused if the same prefix shows up again. ...

May 21, 2026 · 3 min

Model Drift

Model Drift Model drift is the general phenomenon where a deployed model’s predictive performance degrades over time, even though nothing about the model itself has changed. The model is the same; the world it operates in isn’t. Taxonomy Drift is usually classified by what’s shifting in the underlying probability distributions. Data drift (covariate shift) The distribution of input features P(X) changes, but the relationship P(Y|X) stays the same. A fraud detection model starts seeing a higher fraction of mobile-wallet payments — inputs look different, but the rules for “is this fraud” haven’t changed. ...

May 21, 2026 · 4 min

Tool-DC Strategic Anchor Grouping — Web Search Example

Tool-DC: Strategic Anchor Grouping — Web Search Example This is a concrete example illustrating how the Strategic Anchor Grouping mechanism works in the Tool-DC framework. See also: notes/ml/tool-dc-framework.md. Setup Query: “search the web for recent AI news” Tool library: 20 tools total Retriever returns top 3: T_top = [Google Search, Bing Search, DuckDuckGo Search] T_tail = 17 remaining tools (Calculator, Weather API, Wikipedia, Code Executor, etc.) With K=3, Tool-DC creates 4 groups: ...

May 19, 2026 · 4 min

AgentFlow

AgentFlow: In-the-Flow Agentic System Optimization Source: arXiv:2510.05592 — ICLR 2026 Oral (Top 1.1%) Authors: Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu (Stanford University, Texas A&M, UC San Diego, Lambda) The Problem It Solves Standard tool-augmented LLMs (like Search-R1 or ToRL) train a single monolithic policy that interleaves thinking and tool calls in one big context. This works okay on short tasks but scales poorly on long-horizon problems: the context grows, the reward signal is sparse (you only find out at the very end whether you succeeded), and the model generalizes weakly to new tool configurations. AgentFlow is built to fix all three of those. ...

May 19, 2026 · 3 min

Tool-DC Framework

Tool-DC Framework: Try, Check and Retry for Long-context Tool-Calling Source: arXiv:2603.11495 — Accepted at ACL 2026 Authors: Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du (Wuhan University), Dacheng Tao (NTU) The Core Problem When you give an LLM access to a large library of tools — say 20, 50, or hundreds of APIs — performance degrades sharply. The paper shows that even going from fewer than 10 tools to 20 causes significant accuracy drops across all tested models, especially smaller ones. Two things go wrong: the sheer length of the context buries the signal, and semantically similar tools with slightly different argument schemas confuse the model when it’s trying to fill in the right parameters. ...

May 19, 2026 · 3 min

Top-K in RAG Search

Top-K in RAG Search In Retrieval-Augmented Generation (RAG), top-k is the number of most relevant document chunks the retriever returns from the vector store for a given query. The “k” is literally just a number — top-3, top-5, top-10, etc. How it works Embed the query into a vector Run a similarity search (cosine, dot product, etc.) against indexed chunks Retriever ranks every chunk by similarity score Top-k says “give me the k highest-scoring ones” Those chunks get stuffed into the LLM’s context as grounding material before generation Choosing k — the tradeoff Too low (k=1, 2): ...

May 18, 2026 · 2 min

SWE-bench & SWE-bench Pro Explained

SWE-bench & SWE-bench Pro Explained SWE-bench is a benchmark that tests whether an AI model can actually fix real GitHub issues from open-source Python repositories (like Django, Flask, scikit-learn, etc.). The model is given a repo, a bug report or feature request, and has to produce a code patch that makes the failing tests pass — without being told what to change. It’s considered one of the more meaningful coding benchmarks because it tests end-to-end software engineering ability: reading existing code, understanding context, making targeted changes, and not breaking other things. ...

May 16, 2026 · 2 min

LLM as Judge

LLM as Judge Using a language model to evaluate the outputs of another model (or itself) instead of relying on humans or rigid automated metrics like BLEU/ROUGE/exact-match. Give the judge model a response (or a pair of responses) plus a rubric or question, and it returns a score, a label, or a winner. Why it exists For open-ended generation — chat answers, code explanations, summaries, agent traces — string-overlap metrics don’t capture quality, and human eval is slow and expensive. Once frontier LLMs got good enough, they became decent proxies for human raters on a lot of tasks, so they’re now the default evaluator in MT-Bench, Chatbot Arena, G-Eval, and most internal eval pipelines. ...

May 14, 2026 · 2 min

Fine-Tuning Techniques for LLMs

Fine-Tuning Techniques for LLMs Fine-tuning techniques can be grouped along a few axes: what you optimize (full weights vs. small additions), what signal you train on (labels, instructions, preferences, rewards), and how the data is generated (human, synthetic, AI-judged). Full Fine-Tuning (FFT) Update every parameter in the model on a target dataset. Highest capacity, but expensive in memory and prone to catastrophic forgetting. Mostly reserved for smaller models or when you have lots of high-quality data and compute. ...

April 25, 2026 · 4 min

Deterministic Graders (for LLM / AI Evaluation)

Deterministic Graders (for LLM / AI Evaluation) Definition A deterministic grader is an evaluation function that produces the same result every time for the same input — no randomness, no LLM-in-the-loop judgment. You check the model’s output against a fixed, code-based rule. Concrete Examples Exact string match — “Does the output equal Paris?” Regex match — “Does the output contain a valid ISO date?” Structured-output validation — “Does this parse as JSON and pass the schema?” Code execution / unit tests — “Run the generated function against these test cases. Did they pass?” Numeric tolerance — “Is the answer within 0.01 of the expected value?” Set membership — “Is the classification label one of {positive, negative, neutral}?” Contrast: Model-Graded / LLM-as-Judge The opposite approach is a model-graded (or “LLM-as-judge”) evaluator, where you ask another model something like “Is this answer helpful and correct?” ...

April 24, 2026 · 2 min

Chain of Thought (CoT)

Chain of Thought (CoT) Chain of Thought is a prompting technique where an AI model is guided — or learns — to reason through a problem step by step before arriving at a final answer, rather than jumping straight to the conclusion. The core idea is that breaking down complex reasoning into intermediate steps leads to more accurate and reliable outputs, much like how a person might work through a math problem by showing their work. ...

April 23, 2026 · 2 min

Multi-Turn Conversation in AI

Multi-Turn Conversation in AI Multi-turn conversation in AI refers to a dialogue system where a model maintains context across multiple exchanges — rather than treating each message as an isolated input. Single-Turn vs Multi-Turn In a single-turn interaction, the model sees one prompt and produces one response, with no memory of anything before or after. In a multi-turn interaction, the model receives the full conversation history (all prior messages) with each new request, allowing it to: ...

April 21, 2026 · 2 min