Anthropology

Anthropology is the study of humans: where we came from, how we live, how our bodies and cultures vary, and how human societies change over time. A simple way to define it: Anthropology is the holistic study of humanity across time, place, biology, culture, language, and material life. “Holistic” means anthropology tries to understand humans as whole beings rather than looking at only one part of life. For example, an anthropologist studying food might ask not only what people eat, but also how food relates to family, religion, economics, farming, migration, health, identity, and history. ...

May 27, 2026 · 5 min

Commit Intent in AI Harness Engineering

Commit intent is the discipline of having an agent explicitly declare what it is about to do, and why, immediately before it actually invokes a tool — separating the decision from the execution as two distinct steps in the harness. Concretely, before a tool call goes out, the agent emits a short, structured statement: the action being taken, the target, the expected outcome, and often the reasoning that justifies it. Only after that intent is committed does the harness fire the actual tool call. This sounds redundant — the tool call itself already encodes “what” — but it solves several real problems in agentic systems. ...

May 25, 2026 · 3 min

Sub-Agent vs Tool-Agent in AI Harness Engineering

A sub-agent is another agentic process delegated a goal. It has its own prompt/context, can reason over steps, may call tools, and returns a synthesized result or handoff. Use it when the work benefits from independent judgment. Example: Investigate why the auth tests are flaky and report root cause plus fix options. A tool-agent is a tool-shaped interface that may internally use agentic behavior, but from the harness perspective it is invoked like a tool: bounded input, bounded output, narrower contract. Use it when you want a capability, not an independent collaborator. ...

May 25, 2026 · 2 min

LLM Thinking Token Budgets

Token budget parameters for thinking LLMs usually cap how many internal reasoning tokens the model may spend before producing the visible answer. Common names by API/provider include: max_tokens / max_output_tokens: caps generated output tokens, sometimes including hidden reasoning tokens depending on the API. reasoning_effort: qualitative budget like low, medium, high; the API maps this to an internal reasoning-token allowance. thinking_budget / budget_tokens: explicit number of hidden reasoning tokens allowed for models that expose thinking controls. max_completion_tokens: in some APIs, caps both reasoning tokens and final answer tokens together. Why it matters: ...

May 25, 2026 · 1 min

LLM Prompt Cache Options Across Providers

A reference covering cache TTL options and other cache-control dimensions across major LLM providers as of May 2026. TTL mechanics Fixed-duration TTLs Anthropic: 5-min (default) and 1-hour (extended). Cache writes cost 1.25× base input for 5-min TTL, 2× for 1-hour. Cache reads ≈ 10% of base input. TTL refreshes on each read (sliding window). AWS Bedrock: 5-min default, 1-hour added Jan 2026 for Claude Sonnet 4.5, Haiku 4.5, Opus 4.5. Also refresh-on-read. OpenRouter (Gemini path): 5-min TTL that does NOT update on read (fixed window) — gateway-specific behavior worth checking when going through proxies. Arbitrary / configurable TTL Google Gemini explicit caching: No minimum or maximum bounds on TTL. Default 60 min. You can update TTL on an existing cache and delete it early to stop billing. Billed as cached_tokens × storage_duration (per token-hour), not via a write-time premium. Opaque / provider-managed retention OpenAI: No exposed TTL. Baseline ~5–10 min of idle retention; off-peak can persist up to 1 hour. Extended prompt caching retains KV tensors 1–2h typical, up to 24h max. DeepSeek, Grok, Moonshot, Groq, Kimi K2: Automatic, provider-managed, no exposed TTL. Implicit vs explicit control Implicit (zero-config): OpenAI, DeepSeek, Grok, Moonshot, Groq, Gemini implicit tier. Server decides what to cache when it detects a recurring prefix. Explicit (marked / lifecycle-managed): Anthropic and Alibaba use inline cache_control: {"type": "ephemeral"} markers. Gemini explicit caching exposes full CRUD on cache objects via API (create, get, update, delete) — caches behave like first-class resources, similar to Valkey keys. Cache breakpoints / layering Anthropic supports up to 4 cache_control breakpoints in a single request. You can mix TTLs within one request, but longer TTL blocks must appear before shorter TTL blocks in the prompt structure (tools → system → messages order). Practical use: 1-hour cache for stable system prompt + tool defs, 5-min cache for mid-conversation context, paying the higher write premium only on the truly stable prefix. ...

May 21, 2026 · 4 min

LLM Prompt Caching: Implicit vs Explicit

Caching in LLM inference is about reusing the KV-cache computed from a prompt prefix so the model doesn’t re-process the same tokens on every request. The “implicit vs explicit” distinction is about who manages that cache. Prompt Prefix: The Underlying Mechanism “Prefix” means literally the starting tokens of the prompt — the bytes from position 0 onward, in order, that two requests have in common before they diverge. When a transformer processes a prompt, it computes attention keys and values for each token. The KV state for token N depends on every token before it. So if request A is [system prompt][doc X][question 1] and request B is [system prompt][doc X][question 2], the KV state for [system prompt][doc X] is identical in both — the model can skip recomputing it and pick up at the divergence point. ...

May 21, 2026 · 3 min

Vectors vs Tensors

Short answer: related but not identical. A vector is a special case of a tensor. The math hierarchy Term Rank Shape example Scalar 0 a single number Vector 1 [d] — a 1D array Matrix 2 [m, n] — a 2D array Tensor N [d1, d2, ..., dN] — generic N-dimensional array Every vector is a tensor (specifically, a rank-1 tensor). Not every tensor is a vector. Why the terminology blurs In deep learning frameworks (PyTorch, JAX, TensorFlow), everything is called a “tensor” by convention — even scalars and vectors — because that’s the underlying data type the framework operates on. That’s a major reason the words get used interchangeably in ML writing. ...

May 21, 2026 · 2 min

Why LLM Caching Is Only for Input Tokens

Why prompt caching applies to inputs and not outputs in LLM APIs (Anthropic, OpenAI, Google). The asymmetry comes down to how inputs vs. outputs are computed, and what’s actually reusable across requests. Inputs are processed in parallel; outputs are generated sequentially When a prompt comes in, the transformer computes KV (key/value) tensors for every token in one forward pass — the prefill phase. Those KV tensors are a deterministic function of the input, so they can be stashed and reused if the same prefix shows up again. ...

May 21, 2026 · 3 min

Model Drift

Model drift is the general phenomenon where a deployed model’s predictive performance degrades over time, even though nothing about the model itself has changed. The model is the same; the world it operates in isn’t. Taxonomy Drift is usually classified by what’s shifting in the underlying probability distributions. Data drift (covariate shift) The distribution of input features P(X) changes, but the relationship P(Y|X) stays the same. A fraud detection model starts seeing a higher fraction of mobile-wallet payments — inputs look different, but the rules for “is this fraud” haven’t changed. ...

May 21, 2026 · 4 min

PPO — Proximal Policy Optimization

PPO is a reinforcement learning algorithm from OpenAI (Schulman et al., 2017) that became the default workhorse for RLHF — it’s what trained InstructGPT and the original ChatGPT. Core Idea Policy gradient methods are unstable because a single large update can collapse the policy. PPO fixes this by staying close to the previous policy on each update — the “proximal” part. It does this with a clipped surrogate objective: L = m i n ( r ( θ ) · A , c l i p ( r ( θ ) , 1 - ε , 1 + ε ) · A ) Where: ...

May 19, 2026 · 2 min

GRPO — Group Relative Policy Optimization

GRPO is a reinforcement learning algorithm introduced by DeepSeek (DeepSeekMath, later DeepSeek-R1) as a more efficient alternative to PPO for fine-tuning LLMs with RL. Core Idea PPO needs a separate value model (critic) of comparable size to the policy to estimate the baseline for advantage calculation. That doubles memory and compute. GRPO ditches the critic entirely. Instead, for each prompt it samples a group of G outputs from the current policy, scores each with the reward model, and uses the group’s mean and standard deviation as the baseline: ...

May 19, 2026 · 2 min

Tool-DC Strategic Anchor Grouping — Web Search Example

This is a concrete example illustrating how the Strategic Anchor Grouping mechanism works in the Tool-DC framework. See also: notes/ml/tool-dc-framework.md. Setup Query: “search the web for recent AI news” Tool library: 20 tools total Retriever returns top 3: T_top = [Google Search, Bing Search, DuckDuckGo Search] T_tail = 17 remaining tools (Calculator, Weather API, Wikipedia, Code Executor, etc.) With K=3, Tool-DC creates 4 groups: S₀ — Full top-K group (kept as baseline) [ G o o g l e S e a r c h , B i n g S e a r c h , D u c k D u c k G o S e a r c h ] This is the problematic group. All three tools do essentially the same thing — search the web — but have slightly different argument schemas: ...

May 19, 2026 · 4 min

AgentFlow

Source: arXiv:2510.05592 — ICLR 2026 Oral (Top 1.1%) Authors: Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu (Stanford University, Texas A&M, UC San Diego, Lambda) The Problem It Solves Standard tool-augmented LLMs (like Search-R1 or ToRL) train a single monolithic policy that interleaves thinking and tool calls in one big context. This works okay on short tasks but scales poorly on long-horizon problems: the context grows, the reward signal is sparse (you only find out at the very end whether you succeeded), and the model generalizes weakly to new tool configurations. AgentFlow is built to fix all three of those. ...

May 19, 2026 · 3 min

Tool-DC Framework

Source: arXiv:2603.11495 — Accepted at ACL 2026 Authors: Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du (Wuhan University), Dacheng Tao (NTU) The Core Problem When you give an LLM access to a large library of tools — say 20, 50, or hundreds of APIs — performance degrades sharply. The paper shows that even going from fewer than 10 tools to 20 causes significant accuracy drops across all tested models, especially smaller ones. Two things go wrong: the sheer length of the context buries the signal, and semantically similar tools with slightly different argument schemas confuse the model when it’s trying to fill in the right parameters. ...

May 19, 2026 · 3 min

Top-K in RAG Search

In Retrieval-Augmented Generation (RAG), top-k is the number of most relevant document chunks the retriever returns from the vector store for a given query. The “k” is literally just a number — top-3, top-5, top-10, etc. How it works Embed the query into a vector Run a similarity search (cosine, dot product, etc.) against indexed chunks Retriever ranks every chunk by similarity score Top-k says “give me the k highest-scoring ones” Those chunks get stuffed into the LLM’s context as grounding material before generation Choosing k — the tradeoff Too low (k=1, 2): ...

May 18, 2026 · 2 min

Why Skirts Became Feminine and Trousers Masculine

The short answer is: skirts are not an inherently “female” form of clothing, and trousers are not an inherently “male” one. Across history, many men wore non-bifurcated garments — garments that do not pass between the legs — such as tunics, robes, kilts, togas, kaftans, sarongs, and long shirts. The strong Western association of women = skirts/dresses and men = trousers developed gradually from a mix of practical needs, class signals, gender norms, modesty rules, and later industrial-era fashion conventions. ...

May 17, 2026 · 7 min

Attention in Machine Learning

Attention is a mechanism that lets a model dynamically decide which parts of the input matter most when producing each piece of output. Instead of compressing everything into one fixed representation, the model computes a weighted combination of inputs where the weights are learned and depend on context. Intuition When translating “the cat sat on the mat” to French, generating the word for “cat” should mostly pay attention to “cat” in the source — not “mat” or “on.” Attention makes this routing explicit and differentiable. ...

May 17, 2026 · 3 min

Molecular Dating of Clothing Origins via Body Louse Evolution

Authors: Ralf Kittler, Manfred Kayser, Mark Stoneking (Max Planck Institute for Evolutionary Anthropology) Journal: Current Biology, Vol. 13, Issue 16, pp. 1414–1417 (19 August 2003) DOI: 10.1016/S0960-9822(03)00507-4 · PMID: 12932325 The Question When did humans start wearing clothing regularly? Clothing leaves almost no archaeological trace, so the date has long been speculative. The authors use an unusual proxy: the human body louse. The Key Insight Two forms of Pediculus humanus parasitize humans: ...

May 17, 2026 · 2 min

Paleolithic Eyed Needles and the Evolution of Dress

Authors: Ian Gilligan, Francesco d’Errico, Luc Doyon, Wei Wang, Yaroslav V. Kuzmin Published: Science Advances 10, eadp2887 — 28 June 2024 (DOI) Type: Review (Anthropology) TL;DR Eyed needles weren’t invented to tailor clothes — bone awls already did that. Their arrival ~40,000 years ago signals something bigger: the rise of layered garments (including underwear) and the shift from decorating skin to decorating clothing, transforming clothes from physical necessity into social dress. ...

May 17, 2026 · 2 min

MCP Interaction Model

Components (official MCP nomenclature) Host — The user-facing application that embeds the LLM and enforces policy (Claude Desktop, Claude Code, an IDE plugin, etc.). It owns the user, the model, and the trust boundary. Client — A protocol connector that lives inside the Host. One Client per Server, holding a 1:1 stateful session. The Host spawns Clients as needed. Server — The process that exposes capabilities (tools, resources, prompts) over the MCP protocol. Can be local (stdio transport) or remote (Streamable HTTP transport). Authorization Server (AS) — For remote Servers: the OAuth 2.1 issuer of access tokens. May be the Server itself or a separate identity provider. Resource Server (RS) — OAuth role played by the remote MCP Server when it validates bearer tokens on incoming requests. User — The human who approves connections, consents to tool calls, and answers elicitations. LLM — Not technically an MCP component, but the reasoning engine the Host drives; never talks to a Server directly. Phase 1 — Transport & connection Host → Client: Host launches a Client configured for a specific Server (command + args for stdio, or URL for HTTP). Client ↔ Server: Transport established. stdio: Host spawns the Server as a subprocess; JSON-RPC over stdin/stdout. Streamable HTTP: Client opens an HTTPS connection; bidirectional via POST + SSE stream. Phase 2 — Authorization (remote Servers only) MCP uses OAuth 2.1 + PKCE, with Resource Indicators (RFC 8707) and Dynamic Client Registration (RFC 7591). ...

May 16, 2026 · 4 min