[{"content":"Anti-narration in Harness Engineering In AI harness engineering, “anti‑narration” means the harness is designed to prevent large language models (LLMs) from producing fluent but unverified stories — it enforces verification before accepting outputs, ensuring correctness over coherence. It’s not about stopping hallucinations directly, but about breaking the tendency of AI systems to narrate confidently without grounding.\n🔎 What “Anti‑Narration” Means Narration vs. Hallucination\nNarration: The structural tendency of LLMs to produce coherent, completed stories or answers. Hallucination: Fabricated or false information. Harness engineering focuses on narration because coherence can mask errors — a fluent answer may sound right but be wrong. moltbook.com Anti‑Narration Guardrails\nVerification steps are inserted between “this sounds right” and “this is right.” Examples: runtime validation, manifest checks, review gates. The harness forces outputs to be checked against trusted data before being accepted. moltbook.com 🛠 Harness Engineering Context Harness engineering is the discipline of building the control system around an AI model. It includes:\nGuides: Constraint files, system prompts, and rules that direct the agent. Sensors: Validation loops, drift detectors, and parsers that check outputs. Data Context Layer: Certified, lineage‑verified data pipelines feeding the model. Orchestration Logic: Sequences tasks, routes outputs, and enforces review gates. Atlan 📊 Comparison: Anti‑Narration vs Anti‑Hallucination Concept Focus Mechanism Outcome Anti‑Narration Preventing premature, fluent stories Verification before acceptance Stops “sounds right” answers from being trusted Anti‑Hallucination Preventing false facts Fact‑checking, retrieval augmentation Reduces fabricated details but doesn’t stop narrative drift ⚠️ Risks \u0026amp; Trade‑offs Risk of Overconfidence: LLMs optimize for coherence, not correctness. Without anti‑narration, they produce polished but wrong answers. Trade‑off in Speed: Verification slows down output, but ensures reliability. Permanent Scaffold: Harnesses must remain external to generation — correctness requires reference outside the model loop. moltbook.com ✅ Key Takeaway In harness engineering, anti‑narration is the structural safeguard: it doesn’t stop hallucinations directly, but it prevents the system from presenting unchecked narratives as truth. This makes AI agents more trustworthy in production environments, especially where data quality and validation loops are critical.\n","permalink":"https://knowledged.to/ai/concepts/anti-narration/","summary":"\u003ch1 id=\"anti-narration-in-harness-engineering\"\u003eAnti-narration in Harness Engineering\u003c/h1\u003e\n\u003cp\u003e\u003cstrong\u003eIn AI harness engineering, “anti‑narration” means the harness is designed to prevent large language models (LLMs) from producing fluent but unverified stories — it enforces verification before accepting outputs, ensuring correctness over coherence. It’s not about stopping hallucinations directly, but about breaking the tendency of AI systems to narrate confidently without grounding.\u003c/strong\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"-what-antinarration-means\"\u003e🔎 What “Anti‑Narration” Means\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eNarration vs. Hallucination\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cem\u003eNarration\u003c/em\u003e: The structural tendency of LLMs to produce coherent, completed stories or answers.\u003c/li\u003e\n\u003cli\u003e\u003cem\u003eHallucination\u003c/em\u003e: Fabricated or false information.\u003c/li\u003e\n\u003cli\u003eHarness engineering focuses on narration because coherence can mask errors — a fluent answer may sound right but be wrong.  \u003ca href=\"https://www.moltbook.com/post/b60b0b60-df89-4ea2-bef3-8fc64cc90c6b\"\u003emoltbook.com\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eAnti‑Narration Guardrails\u003c/strong\u003e\u003c/p\u003e","title":"Anti-Narration in Harness Engineering"},{"content":"Commit Intent in AI Harness Engineering Commit intent is the discipline of having an agent explicitly declare what it is about to do, and why, immediately before it actually invokes a tool — separating the decision from the execution as two distinct steps in the harness.\nConcretely, before a tool call goes out, the agent emits a short, structured statement: the action being taken, the target, the expected outcome, and often the reasoning that justifies it. Only after that intent is committed does the harness fire the actual tool call. This sounds redundant — the tool call itself already encodes \u0026ldquo;what\u0026rdquo; — but it solves several real problems in agentic systems.\nWhy it matters Drift prevention Without an intent step, models will sometimes reason their way to one conclusion and then call a tool that does something subtly different (wrong argument, wrong target, hallucinated parameter). Forcing the agent to articulate the intent first makes the model commit to a specific action in natural language, then anchors the actual tool call to that statement. Misalignments between intent and execution become detectable.\nGating and approval The intent is a natural interception point. The harness can pause here for human-in-the-loop approval on irreversible or expensive actions (deletes, payments, prod deploys, sending external messages) without needing to parse and re-validate tool arguments. Action-type taxonomies in skill systems (e.g. \u0026ldquo;prohibited / explicit-permission / regular\u0026rdquo;) are essentially built on this idea.\nObservability and replay Intents form an audit log of reasoning that\u0026rsquo;s far more useful than just \u0026ldquo;tool X was called with args Y.\u0026rdquo; When debugging a bad trajectory, you can see where the agent\u0026rsquo;s thinking went wrong, not just where the call went wrong.\nCoherence across multi-step plans When an agent commits intent for a sequence (\u0026ldquo;I will: first check the schedule, then update the deployment, then verify the rollout\u0026rdquo;), it tends to stay on plan instead of getting distracted by intermediate tool outputs.\nRelation to other patterns The pattern overlaps with — but isn\u0026rsquo;t quite the same as — ReAct\u0026rsquo;s \u0026ldquo;Thought\u0026rdquo; step or tool preambles:\nReAct mixes reasoning and action loosely; the \u0026ldquo;Thought\u0026rdquo; is freeform and not structurally enforced. Tool preambles are often descriptive (\u0026ldquo;I\u0026rsquo;m going to search for X\u0026rdquo;) and sometimes emitted alongside or after the call. Commit intent is the stricter discipline of making the pre-action declaration a structured, required artifact that the harness can inspect, gate on, and log independently from the tool call itself. Design implications When building a harness with commit intent:\nTreat the intent as a first-class artifact in the trace, not just freeform text inside a thinking block. Make gating policies key off intent metadata (action category, target, reversibility), not raw tool names. Allow the harness to reject an intent and force the agent to re-plan, rather than only being able to fail after a tool call has already happened. Log intents and tool calls as paired records so post-hoc analysis can detect intent/execution drift. ","permalink":"https://knowledged.to/notes/ml/commit-intent/","summary":"\u003ch1 id=\"commit-intent-in-ai-harness-engineering\"\u003eCommit Intent in AI Harness Engineering\u003c/h1\u003e\n\u003cp\u003e\u003cstrong\u003eCommit intent\u003c/strong\u003e is the discipline of having an agent explicitly declare \u003cem\u003ewhat it is about to do, and why\u003c/em\u003e, immediately before it actually invokes a tool — separating the decision from the execution as two distinct steps in the harness.\u003c/p\u003e\n\u003cp\u003eConcretely, before a tool call goes out, the agent emits a short, structured statement: the action being taken, the target, the expected outcome, and often the reasoning that justifies it. Only after that intent is committed does the harness fire the actual tool call. This sounds redundant — the tool call itself already encodes \u0026ldquo;what\u0026rdquo; — but it solves several real problems in agentic systems.\u003c/p\u003e","title":"Commit Intent in AI Harness Engineering"},{"content":"Sub-Agent vs Tool-Agent in AI Harness Engineering A sub-agent is another agentic process delegated a goal. It has its own prompt/context, can reason over steps, may call tools, and returns a synthesized result or handoff. Use it when the work benefits from independent judgment.\nExample: Investigate why the auth tests are flaky and report root cause plus fix options.\nA tool-agent is a tool-shaped interface that may internally use agentic behavior, but from the harness perspective it is invoked like a tool: bounded input, bounded output, narrower contract. Use it when you want a capability, not an independent collaborator.\nExample: Run code search and summarize where SessionManager is used, or open a browser, click through login, and report the screenshot result.\nPractical distinction:\nAspect Sub-agent Tool-agent Harness role Delegated worker Callable capability Autonomy High Usually constrained Interface Goal/task prompt Tool schema or command-like call Context Often has its own working context Usually receives explicit inputs Output Judgment, plan, result, handoff Structured result, observation, action result Best for Parallel reasoning, review, implementation, investigation Browser control, repo search, test running, retrieval, diagnostics Mental model: a sub-agent is someone you assign work to; a tool-agent is something you operate through.\nThe line can blur. A browser tool-agent might plan clicks internally, and a sub-agent might be wrapped as a callable tool. The engineering distinction is the contract: sub-agents are goal-directed collaborators; tool-agents are capability endpoints with tighter I/O and lifecycle boundaries.\n","permalink":"https://knowledged.to/notes/ml/sub-agent-vs-tool-agent/","summary":"\u003ch1 id=\"sub-agent-vs-tool-agent-in-ai-harness-engineering\"\u003eSub-Agent vs Tool-Agent in AI Harness Engineering\u003c/h1\u003e\n\u003cp\u003eA sub-agent is another agentic process delegated a goal. It has its own prompt/context, can reason over steps, may call tools, and returns a synthesized result or handoff. Use it when the work benefits from independent judgment.\u003c/p\u003e\n\u003cp\u003eExample: Investigate why the auth tests are flaky and report root cause plus fix options.\u003c/p\u003e\n\u003cp\u003eA tool-agent is a tool-shaped interface that may internally use agentic behavior, but from the harness perspective it is invoked like a tool: bounded input, bounded output, narrower contract. Use it when you want a capability, not an independent collaborator.\u003c/p\u003e","title":"Sub-Agent vs Tool-Agent in AI Harness Engineering"},{"content":"Thinking Token Budget Token budget parameters for thinking LLMs usually cap how many internal reasoning tokens the model may spend before producing the visible answer.\nCommon names by API/provider include:\nmax_tokens / max_output_tokens: caps generated output tokens, sometimes including hidden reasoning tokens depending on the API. reasoning_effort: qualitative budget like low, medium, high; the API maps this to an internal reasoning-token allowance. thinking_budget / budget_tokens: explicit number of hidden reasoning tokens allowed for models that expose thinking controls. max_completion_tokens: in some APIs, caps both reasoning tokens and final answer tokens together. Why it matters:\nHigher budget: useful for hard math, coding, planning, and multi-step debugging. Lower budget: cheaper, faster, enough for simple Q\u0026amp;A or formatting tasks. Too low: model may answer prematurely or miss steps. Too high: slower and more expensive, sometimes overthinks simple tasks. Mental model: total completion budget = hidden reasoning tokens + visible answer tokens\nIf the completion cap is tight, a thinking model may spend tokens reasoning and have too little room left for the final answer.\nExample qualitative setting: { \u0026ldquo;model\u0026rdquo;: \u0026ldquo;reasoning-model\u0026rdquo;, \u0026ldquo;reasoning_effort\u0026rdquo;: \u0026ldquo;medium\u0026rdquo;, \u0026ldquo;max_output_tokens\u0026rdquo;: 1000 }\nExample explicit thinking budget: { \u0026ldquo;thinking\u0026rdquo;: { \u0026ldquo;type\u0026rdquo;: \u0026ldquo;enabled\u0026rdquo;, \u0026ldquo;budget_tokens\u0026rdquo;: 2048 }, \u0026ldquo;max_output_tokens\u0026rdquo;: 1000 }\nThe exact parameter name depends on the model provider and API.\n","permalink":"https://knowledged.to/notes/ml/llm-thinking-token-budgets/","summary":"\u003ch1 id=\"thinking-token-budget\"\u003eThinking Token Budget\u003c/h1\u003e\n\u003cp\u003eToken budget parameters for thinking LLMs usually cap how many internal reasoning tokens the model may spend before producing the visible answer.\u003c/p\u003e\n\u003cp\u003eCommon names by API/provider include:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003emax_tokens / max_output_tokens: caps generated output tokens, sometimes including hidden reasoning tokens depending on the API.\u003c/li\u003e\n\u003cli\u003ereasoning_effort: qualitative budget like low, medium, high; the API maps this to an internal reasoning-token allowance.\u003c/li\u003e\n\u003cli\u003ethinking_budget / budget_tokens: explicit number of hidden reasoning tokens allowed for models that expose thinking controls.\u003c/li\u003e\n\u003cli\u003emax_completion_tokens: in some APIs, caps both reasoning tokens and final answer tokens together.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWhy it matters:\u003c/p\u003e","title":"LLM Thinking Token Budgets"},{"content":"LLM Prompt Cache Options Across Providers A reference covering cache TTL options and other cache-control dimensions across major LLM providers as of May 2026.\nTTL mechanics Fixed-duration TTLs Anthropic: 5-min (default) and 1-hour (extended). Cache writes cost 1.25× base input for 5-min TTL, 2× for 1-hour. Cache reads ≈ 10% of base input. TTL refreshes on each read (sliding window). AWS Bedrock: 5-min default, 1-hour added Jan 2026 for Claude Sonnet 4.5, Haiku 4.5, Opus 4.5. Also refresh-on-read. OpenRouter (Gemini path): 5-min TTL that does NOT update on read (fixed window) — gateway-specific behavior worth checking when going through proxies. Arbitrary / configurable TTL Google Gemini explicit caching: No minimum or maximum bounds on TTL. Default 60 min. You can update TTL on an existing cache and delete it early to stop billing. Billed as cached_tokens × storage_duration (per token-hour), not via a write-time premium. Opaque / provider-managed retention OpenAI: No exposed TTL. Baseline ~5–10 min of idle retention; off-peak can persist up to 1 hour. Extended prompt caching retains KV tensors 1–2h typical, up to 24h max. DeepSeek, Grok, Moonshot, Groq, Kimi K2: Automatic, provider-managed, no exposed TTL. Implicit vs explicit control Implicit (zero-config): OpenAI, DeepSeek, Grok, Moonshot, Groq, Gemini implicit tier. Server decides what to cache when it detects a recurring prefix. Explicit (marked / lifecycle-managed): Anthropic and Alibaba use inline cache_control: {\u0026quot;type\u0026quot;: \u0026quot;ephemeral\u0026quot;} markers. Gemini explicit caching exposes full CRUD on cache objects via API (create, get, update, delete) — caches behave like first-class resources, similar to Valkey keys. Cache breakpoints / layering Anthropic supports up to 4 cache_control breakpoints in a single request. You can mix TTLs within one request, but longer TTL blocks must appear before shorter TTL blocks in the prompt structure (tools → system → messages order). Practical use: 1-hour cache for stable system prompt + tool defs, 5-min cache for mid-conversation context, paying the higher write premium only on the truly stable prefix.\nOpenAI caches in 128-token increments above a 1,024-token prefix floor. No user-controllable breakpoints.\nPricing model dimensions Provider Write cost Read cost Storage cost Anthropic (5-min) 1.25× base input ~10% of base none Anthropic (1-hour) 2× base input ~10% of base none OpenAI base input ~50% of base none DeepSeek base input ~10% of base none Gemini explicit base input ~10% on 2.5+, ~25% on 2.0 per cached-token-hour Gemini implicit base input 10% on cache hit none The storage-time model on Gemini explicit inverts the incentive: a long TTL on idle content costs money even if unused, whereas longer TTLs on Anthropic just mean a one-time higher write premium and want lots of reads to amortize.\nCache scope / isolation Anthropic: Workspace-level isolation (moved from org-level on Feb 5, 2026). Cache entries do not cross workspaces inside the same org. OpenAI: Server-affinity routing — cache hits depend on landing on the same machine that processed the prior request. No explicit user control; affects perceived hit rate at scale. Gemini: Project-scoped. Client-side warming strategies Keepalive pings: Tiny request every TTL−1 min hitting the cached prefix to reset the sliding window. Cheap on Anthropic since pings are cache reads (~10% of base input). Pre-warming on session start: Fire one cheap write request before the user\u0026rsquo;s first real interaction to hide the write latency. Cache-aware request routing: For server-affinity providers (OpenAI), sticky session IDs or consistent hashing on a proxy can improve hit rates. Self-hosted runtimes (vLLM, SGLang) Full menu when running inference yourself:\nLRU / LFU eviction policies, or custom Prefix-tree (radix) cache vs flat KV cache Configurable memory budgets per node Pinning specific prefixes Cross-request KV reuse across multi-turn conversations Manual invalidation Closest analog to managing your own Valkey cache layer.\nAdjacent: semantic caching (distinct from prompt/KV caching) Prompt/KV caching is exact-prefix match. Semantic caching keys on embedding similarity of the query and caches responses — fundamentally different layer. Tools: GPTCache, Redis/Valkey + embeddings, Portkey, Helicone. Trade-offs: false-positive cache hits, embedding compute cost, but big wins when traffic has high semantic overlap without exact prefix overlap. Often layered on top of provider-side KV caching.\nQuick decision notes Workload with stable system prompt, sporadic requests \u0026gt;5 min apart: Anthropic 1-hour TTL or keepalive pings on 5-min. Workload with very long static context, infrequent reuse over hours: Gemini explicit caching with a custom TTL sized to expected reuse window. High-volume API traffic with naturally repetitive prefixes: OpenAI or DeepSeek — zero-config wins. Heterogeneous prompts with semantic overlap but no exact prefix: build a semantic cache layer. Need deterministic eviction control: self-host vLLM/SGLang. ","permalink":"https://knowledged.to/notes/ml/llm-prompt-cache-provider-options/","summary":"\u003ch1 id=\"llm-prompt-cache-options-across-providers\"\u003eLLM Prompt Cache Options Across Providers\u003c/h1\u003e\n\u003cp\u003eA reference covering cache TTL options and other cache-control dimensions across major LLM providers as of May 2026.\u003c/p\u003e\n\u003ch2 id=\"ttl-mechanics\"\u003eTTL mechanics\u003c/h2\u003e\n\u003ch3 id=\"fixed-duration-ttls\"\u003eFixed-duration TTLs\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eAnthropic\u003c/strong\u003e: 5-min (default) and 1-hour (extended). Cache writes cost 1.25× base input for 5-min TTL, 2× for 1-hour. Cache reads ≈ 10% of base input. TTL refreshes on each read (sliding window).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAWS Bedrock\u003c/strong\u003e: 5-min default, 1-hour added Jan 2026 for Claude Sonnet 4.5, Haiku 4.5, Opus 4.5. Also refresh-on-read.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eOpenRouter (Gemini path)\u003c/strong\u003e: 5-min TTL that does NOT update on read (fixed window) — gateway-specific behavior worth checking when going through proxies.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"arbitrary--configurable-ttl\"\u003eArbitrary / configurable TTL\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eGoogle Gemini explicit caching\u003c/strong\u003e: No minimum or maximum bounds on TTL. Default 60 min. You can \u003ccode\u003eupdate\u003c/code\u003e TTL on an existing cache and \u003ccode\u003edelete\u003c/code\u003e it early to stop billing. Billed as \u003ccode\u003ecached_tokens × storage_duration\u003c/code\u003e (per token-hour), not via a write-time premium.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"opaque--provider-managed-retention\"\u003eOpaque / provider-managed retention\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eOpenAI\u003c/strong\u003e: No exposed TTL. Baseline ~5–10 min of idle retention; off-peak can persist up to 1 hour. Extended prompt caching retains KV tensors 1–2h typical, up to 24h max.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDeepSeek, Grok, Moonshot, Groq, Kimi K2\u003c/strong\u003e: Automatic, provider-managed, no exposed TTL.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"implicit-vs-explicit-control\"\u003eImplicit vs explicit control\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eImplicit (zero-config)\u003c/strong\u003e: OpenAI, DeepSeek, Grok, Moonshot, Groq, Gemini implicit tier. Server decides what to cache when it detects a recurring prefix.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eExplicit (marked / lifecycle-managed)\u003c/strong\u003e: Anthropic and Alibaba use inline \u003ccode\u003ecache_control: {\u0026quot;type\u0026quot;: \u0026quot;ephemeral\u0026quot;}\u003c/code\u003e markers. Gemini explicit caching exposes full CRUD on cache objects via API (\u003ccode\u003ecreate\u003c/code\u003e, \u003ccode\u003eget\u003c/code\u003e, \u003ccode\u003eupdate\u003c/code\u003e, \u003ccode\u003edelete\u003c/code\u003e) — caches behave like first-class resources, similar to Valkey keys.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"cache-breakpoints--layering\"\u003eCache breakpoints / layering\u003c/h2\u003e\n\u003cp\u003eAnthropic supports up to 4 \u003ccode\u003ecache_control\u003c/code\u003e breakpoints in a single request. You can mix TTLs within one request, but \u003cstrong\u003elonger TTL blocks must appear before shorter TTL blocks\u003c/strong\u003e in the prompt structure (tools → system → messages order). Practical use: 1-hour cache for stable system prompt + tool defs, 5-min cache for mid-conversation context, paying the higher write premium only on the truly stable prefix.\u003c/p\u003e","title":"LLM Prompt Cache Options Across Providers"},{"content":"LLM Prompt Caching: Implicit vs Explicit Caching in LLM inference is about reusing the KV-cache computed from a prompt prefix so the model doesn\u0026rsquo;t re-process the same tokens on every request. The \u0026ldquo;implicit vs explicit\u0026rdquo; distinction is about who manages that cache.\nPrompt Prefix: The Underlying Mechanism \u0026ldquo;Prefix\u0026rdquo; means literally the starting tokens of the prompt — the bytes from position 0 onward, in order, that two requests have in common before they diverge.\nWhen a transformer processes a prompt, it computes attention keys and values for each token. The KV state for token N depends on every token before it. So if request A is [system prompt][doc X][question 1] and request B is [system prompt][doc X][question 2], the KV state for [system prompt][doc X] is identical in both — the model can skip recomputing it and pick up at the divergence point.\nKey constraint: it has to be a true prefix — byte-identical, from token zero. If request B differs by even one token (a different system prompt, an extra space, a swapped ordering), the cache is invalidated from that point onward, because every downstream token\u0026rsquo;s attention now depends on different upstream state. You cannot cache the middle or end of a prompt while changing the beginning.\nPrompt ordering for cacheability Structure prompts so volatile content comes last:\nSystem prompt (stable across all calls) Tool/function definitions (stable per app version) Large static context — RAG documents, knowledge base chunks (stable per session) Conversation history (grows but is append-only, so prior turns stay prefix-stable) The new user message (the only volatile part) Flip that order — put the user message first — and the cache busts on every turn.\nImplicit Caching Automatic. The provider\u0026rsquo;s inference layer detects when a new request shares a long prefix with a recent one and silently reuses the cached state. No code changes — just send the same system prompt or document context at the start of each request, and if the platform\u0026rsquo;s heuristics fire (usually requiring a minimum prefix length and a short time since the last hit), you get a discount on the cached tokens.\nProviders: Gemini, OpenAI, and DeepSeek all do versions of this.\nPros: Zero integration effort.\nCons: Best-effort. No TTL you control, no guarantee a given request will hit, and cold-start requests pay full price.\nExplicit Caching You tell the provider: \u0026ldquo;store this exact context, give me back a handle, and bill me for storage until it expires.\u0026rdquo; Subsequent requests reference the handle instead of resending the content.\nProviders:\nAnthropic — cache_control: {type: \u0026quot;ephemeral\u0026quot;} markers on message blocks Gemini — CachedContent API Pros: Deterministic. Within the TTL, you will hit. Storage fee + one-time write cost, but reads are dramatically cheaper than re-tokenizing.\nCons: Requires integration code; you pay storage cost and must manage cache lifecycle (creation, TTL, invalidation).\nWhen to Use Which Implicit wins for workloads with naturally repeating prefixes and bursty traffic where you can\u0026rsquo;t easily reason about cache lifetime — chatbots, IDE autocomplete, anything where a system prompt is shared across many short-lived calls. Explicit wins when you have a large, stable context (long document, big tool/schema block, knowledge base chunk) queried many times over a known window. Trade integration code + storage cost for guaranteed savings. In practice teams often layer them — explicit caches for the heavy stable stuff (system prompt + tool definitions + large RAG context), implicit caching catching the rest opportunistically.\n","permalink":"https://knowledged.to/notes/ml/llm-prompt-caching-implicit-vs-explicit/","summary":"\u003ch1 id=\"llm-prompt-caching-implicit-vs-explicit\"\u003eLLM Prompt Caching: Implicit vs Explicit\u003c/h1\u003e\n\u003cp\u003eCaching in LLM inference is about reusing the \u003cstrong\u003eKV-cache\u003c/strong\u003e computed from a prompt prefix so the model doesn\u0026rsquo;t re-process the same tokens on every request. The \u0026ldquo;implicit vs explicit\u0026rdquo; distinction is about \u003cem\u003ewho manages that cache\u003c/em\u003e.\u003c/p\u003e\n\u003ch2 id=\"prompt-prefix-the-underlying-mechanism\"\u003ePrompt Prefix: The Underlying Mechanism\u003c/h2\u003e\n\u003cp\u003e\u0026ldquo;Prefix\u0026rdquo; means literally the starting tokens of the prompt — the bytes from position 0 onward, in order, that two requests have in common before they diverge.\u003c/p\u003e","title":"LLM Prompt Caching: Implicit vs Explicit"},{"content":"Vectors vs Tensors — Are They the Same? Short answer: related but not identical. A vector is a special case of a tensor.\nThe math hierarchy Term Rank Shape example Scalar 0 a single number Vector 1 [d] — a 1D array Matrix 2 [m, n] — a 2D array Tensor N [d1, d2, ..., dN] — generic N-dimensional array Every vector is a tensor (specifically, a rank-1 tensor). Not every tensor is a vector.\nWhy the terminology blurs In deep learning frameworks (PyTorch, JAX, TensorFlow), everything is called a \u0026ldquo;tensor\u0026rdquo; by convention — even scalars and vectors — because that\u0026rsquo;s the underlying data type the framework operates on. That\u0026rsquo;s a major reason the words get used interchangeably in ML writing.\nIn the KV cache context specifically Both terms apply at different zoom levels:\nPer-token, per-head, per-layer: the key and value really are vectors — typically 64 or 128 numbers each (the head dimension). That\u0026rsquo;s why a lot of content reaches for \u0026ldquo;vector\u0026rdquo; — it\u0026rsquo;s accurate at that grain and more intuitive. The full cached structure: has shape roughly [batch, num_layers, num_heads, sequence_length, head_dim]. That\u0026rsquo;s a rank-5 tensor. When you \u0026ldquo;load the cache,\u0026rdquo; you\u0026rsquo;re loading the whole multi-dimensional block, not a single vector — so \u0026ldquo;tensor\u0026rdquo; is more precise for the stored artifact. Analogy Saying \u0026ldquo;loading vectors\u0026rdquo; is like describing a spreadsheet as \u0026ldquo;loading numbers.\u0026rdquo; Technically true (every cell is a number), but it understates the structure. Saying \u0026ldquo;loading tensors\u0026rdquo; is like saying \u0026ldquo;loading the spreadsheet\u0026rdquo; — captures the actual shape of the thing. TL;DR Per-token K and V → vectors The full KV cache block → tensor In framework code → everything is called a tensor regardless of rank In casual technical writing → often used interchangeably, and that\u0026rsquo;s usually fine ","permalink":"https://knowledged.to/notes/ml/vectors-vs-tensors/","summary":"\u003ch1 id=\"vectors-vs-tensors--are-they-the-same\"\u003eVectors vs Tensors — Are They the Same?\u003c/h1\u003e\n\u003cp\u003eShort answer: related but not identical. A vector is a special case of a tensor.\u003c/p\u003e\n\u003ch2 id=\"the-math-hierarchy\"\u003eThe math hierarchy\u003c/h2\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n      \u003ctr\u003e\n          \u003cth\u003eTerm\u003c/th\u003e\n          \u003cth\u003eRank\u003c/th\u003e\n          \u003cth\u003eShape example\u003c/th\u003e\n      \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eScalar\u003c/td\u003e\n          \u003ctd\u003e0\u003c/td\u003e\n          \u003ctd\u003ea single number\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eVector\u003c/td\u003e\n          \u003ctd\u003e1\u003c/td\u003e\n          \u003ctd\u003e\u003ccode\u003e[d]\u003c/code\u003e — a 1D array\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eMatrix\u003c/td\u003e\n          \u003ctd\u003e2\u003c/td\u003e\n          \u003ctd\u003e\u003ccode\u003e[m, n]\u003c/code\u003e — a 2D array\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eTensor\u003c/td\u003e\n          \u003ctd\u003eN\u003c/td\u003e\n          \u003ctd\u003e\u003ccode\u003e[d1, d2, ..., dN]\u003c/code\u003e — generic N-dimensional array\u003c/td\u003e\n      \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eEvery vector is a tensor (specifically, a rank-1 tensor). Not every tensor is a vector.\u003c/p\u003e","title":"Vectors vs Tensors"},{"content":"Why LLM Caching Is Only for Input Tokens Why prompt caching applies to inputs and not outputs in LLM APIs (Anthropic, OpenAI, Google). The asymmetry comes down to how inputs vs. outputs are computed, and what\u0026rsquo;s actually reusable across requests.\nInputs are processed in parallel; outputs are generated sequentially When a prompt comes in, the transformer computes KV (key/value) tensors for every token in one forward pass — the prefill phase. Those KV tensors are a deterministic function of the input, so they can be stashed and reused if the same prefix shows up again.\nOutput generation is the decode phase: one token at a time, each step depending on the sampled previous token. Even with the same input, sampling (temperature, top-p) can produce different sequences. There\u0026rsquo;s nothing stable to cache across requests.\nInputs are reused; outputs typically aren\u0026rsquo;t Prompt caching exists because long system prompts, RAG contexts, code repos, and document attachments get sent again and again across calls. Caching them turns expensive recomputation into a lookup.\nOutputs, by contrast, are consumed once by the user and rarely sent back verbatim. Caching a 2000-token answer to reuse it would require a future request to ask for that exact answer — which almost never happens. When it does, that\u0026rsquo;s response caching at the application layer (e.g. memoizing a deterministic query), not model-level caching.\nBut outputs DO get cached — just on the next turn In a multi-turn conversation, yesterday\u0026rsquo;s assistant message becomes part of today\u0026rsquo;s input. At that point, the provider\u0026rsquo;s prompt cache will reuse those tokens.\nSo there isn\u0026rsquo;t really a \u0026ldquo;no caching for outputs\u0026rdquo; rule. It\u0026rsquo;s more that output tokens only become cacheable once they\u0026rsquo;ve transitioned into being input tokens for a subsequent call. Anthropic, OpenAI, and Google all do this implicitly when you replay a conversation.\nThe KV cache during decode is a separate thing Inside a single generation, each newly produced token\u0026rsquo;s KV vectors get appended to a running cache so the next token doesn\u0026rsquo;t have to re-attend over the whole sequence from scratch. That\u0026rsquo;s a within-request optimization and is universal.\nIt\u0026rsquo;s not what people mean by \u0026ldquo;prompt caching.\u0026rdquo; The cross-request prompt cache that providers charge a discount for is specifically about the prefill phase being skippable.\nPricing reflects the compute asymmetry Output tokens cost more than input tokens (often 4–5×) because each one requires a full forward pass through the model with no parallelism. Cached input tokens cost even less than uncached input — you\u0026rsquo;re skipping the prefill compute entirely and just loading tensors. There\u0026rsquo;s no equivalent shortcut for output generation. You can\u0026rsquo;t \u0026ldquo;skip\u0026rdquo; producing a token you haven\u0026rsquo;t produced yet. TL;DR Input Output Computation Parallel prefill Sequential decode Determinism Deterministic given input Stochastic (sampling) Reuse pattern Same prompts sent repeatedly Generated once, rarely resent Cacheable across requests? Yes Not until it becomes input on the next turn Inputs are deterministic, parallelizable, and frequently reused — perfect cache candidates. Outputs are sequential, stochastic, and consumed once. The moment they\u0026rsquo;re not consumed once, they\u0026rsquo;ve become inputs anyway.\n","permalink":"https://knowledged.to/notes/ml/llm-caching-input-tokens/","summary":"\u003ch1 id=\"why-llm-caching-is-only-for-input-tokens\"\u003eWhy LLM Caching Is Only for Input Tokens\u003c/h1\u003e\n\u003cp\u003eWhy prompt caching applies to inputs and not outputs in LLM APIs (Anthropic, OpenAI, Google). The asymmetry comes down to how inputs vs. outputs are computed, and what\u0026rsquo;s actually reusable across requests.\u003c/p\u003e\n\u003ch2 id=\"inputs-are-processed-in-parallel-outputs-are-generated-sequentially\"\u003eInputs are processed in parallel; outputs are generated sequentially\u003c/h2\u003e\n\u003cp\u003eWhen a prompt comes in, the transformer computes KV (key/value) tensors for every token in one forward pass — the \u003cstrong\u003eprefill\u003c/strong\u003e phase. Those KV tensors are a deterministic function of the input, so they can be stashed and reused if the same prefix shows up again.\u003c/p\u003e","title":"Why LLM Caching Is Only for Input Tokens"},{"content":"Model Drift Model drift is the general phenomenon where a deployed model\u0026rsquo;s predictive performance degrades over time, even though nothing about the model itself has changed. The model is the same; the world it operates in isn\u0026rsquo;t.\nTaxonomy Drift is usually classified by what\u0026rsquo;s shifting in the underlying probability distributions.\nData drift (covariate shift) The distribution of input features P(X) changes, but the relationship P(Y|X) stays the same. A fraud detection model starts seeing a higher fraction of mobile-wallet payments — inputs look different, but the rules for \u0026ldquo;is this fraud\u0026rdquo; haven\u0026rsquo;t changed.\nConcept drift The more dangerous case: P(Y|X) itself changes. The same input now maps to a different correct output. Spam classifiers face this constantly — spammers adapt, so features that signaled spam two years ago no longer do. A credit-risk model built before COVID had its P(default | income, employment) relationship rewritten by the pandemic.\nLabel drift (prior shift) P(Y) changes — the base rate of the target variable shifts. During a recession, the base rate of loan defaults rises regardless of any individual borrower\u0026rsquo;s profile.\nTemporal patterns Sudden — a market crash, a policy change Gradual — slow demographic or behavioral evolution Incremental — many small changes compounding Recurring/seasonal — retail patterns around Black Friday, flu cases in winter Detection Production detection combines statistical tests on inputs/outputs with direct performance monitoring against ground truth (when it eventually arrives).\nInput distribution tests Kolmogorov-Smirnov — continuous features Chi-squared — categorical features Population Stability Index (PSI) — a workhorse in finance ML; interpretable and stable KL divergence, Jensen-Shannon, Wasserstein distance — compare full distributions rather than summary statistics Output/performance monitoring Prediction confidence distributions Calibration curves Lagged accuracy/precision/recall once labels arrive Embedding-based drift detection — project incoming data through the model\u0026rsquo;s embedding layer and run distribution tests in that space; catches semantic shifts that raw feature stats miss The hard part isn\u0026rsquo;t the statistical test, it\u0026rsquo;s setting a threshold sensitive enough to catch real degradation but not so jumpy that you retrain on noise.\nMitigation strategies Strategy How it works Tradeoff Scheduled retraining Cron-driven, periodic Predictable, may retrain too often or too rarely Trigger-based retraining Kicks off when drift metrics cross a threshold Reactive, depends on threshold tuning Online/continual learning Incremental updates as data streams in Catastrophic forgetting risk Champion-challenger Challenger trains on recent data in shadow; promoted if it beats champion on holdout Operationally clean, doubles training cost Catastrophic forgetting: when continual learning makes the model lose old capabilities as it absorbs new ones.\nScheduled retraining is the default for a reason — predictable and easy to reason about.\nLLM-specific drift For large language models, drift takes specific forms distinct from the classical taxonomy.\nKnowledge staleness A model with a Jan 2024 training cutoff doesn\u0026rsquo;t know about post-cutoff events. Technically not drift — the model didn\u0026rsquo;t degrade, the world moved on — but the user-perceived effect is the same. Standard mitigation is RAG over a continuously updated corpus rather than retraining the base model.\nBehavioral drift across versions Chen, Zou, and Zaharia (Stanford/Berkeley, 2023): \u0026ldquo;How Is ChatGPT\u0026rsquo;s Behavior Changing over Time?\u0026rdquo; documented measurable shifts in GPT-4\u0026rsquo;s behavior on identical prompts across a few months. Some tasks improved, others regressed. Likely culprits: successive RLHF rounds, safety fine-tuning, inference-stack changes — not base weights \u0026ldquo;decaying\u0026rdquo; on their own.\nAlignment tax Each RLHF or safety pass tends to slightly degrade some capabilities (creative writing, instruction-following on edge cases) in exchange for behavioral gains. Over many iterations this compounds. Part of why users perceive models as \u0026ldquo;getting worse\u0026rdquo; even when benchmark scores improve.\nMode collapse / diversity loss RLHF over-converges on a narrow style — outputs become more uniform, hedged, and predictable, even if individual responses are higher-quality on average.\nOperationally relevant flavors for production LLM stacks For Go services hitting hosted LLMs with user-facing SLAs and billing on top, three matter most:\nKnowledge staleness — managed via RAG and fresh retrieval Prompt drift — user inputs evolve as people get savvier with the product Provider-side behavioral drift — if you use a hosted model whose weights you don\u0026rsquo;t control, the provider can silently change behavior Mitigation for (3): ship a regression eval suite against your LLM provider as part of CI. A fixed set of (prompt, expected behavior) pairs running nightly catches silent provider changes before users do. With OTel-heavy observability, treat eval scores as another time-series metric alongside latency and error rate.\nReferences Chen, Zou, Zaharia (2023). How Is ChatGPT\u0026rsquo;s Behavior Changing over Time? — Stanford/Berkeley. Population Stability Index — standard in credit risk monitoring; see finance ML literature. ","permalink":"https://knowledged.to/notes/ml/model-drift/","summary":"\u003ch1 id=\"model-drift\"\u003eModel Drift\u003c/h1\u003e\n\u003cp\u003eModel drift is the general phenomenon where a deployed model\u0026rsquo;s predictive performance degrades over time, even though nothing about the model itself has changed. The model is the same; the world it operates in isn\u0026rsquo;t.\u003c/p\u003e\n\u003ch2 id=\"taxonomy\"\u003eTaxonomy\u003c/h2\u003e\n\u003cp\u003eDrift is usually classified by what\u0026rsquo;s shifting in the underlying probability distributions.\u003c/p\u003e\n\u003ch3 id=\"data-drift-covariate-shift\"\u003eData drift (covariate shift)\u003c/h3\u003e\n\u003cp\u003eThe distribution of input features \u003ccode\u003eP(X)\u003c/code\u003e changes, but the relationship \u003ccode\u003eP(Y|X)\u003c/code\u003e stays the same. A fraud detection model starts seeing a higher fraction of mobile-wallet payments — inputs look different, but the rules for \u0026ldquo;is this fraud\u0026rdquo; haven\u0026rsquo;t changed.\u003c/p\u003e","title":"Model Drift"},{"content":"PPO — Proximal Policy Optimization PPO is a reinforcement learning algorithm from OpenAI (Schulman et al., 2017) that became the default workhorse for RLHF — it\u0026rsquo;s what trained InstructGPT and the original ChatGPT.\nCore Idea Policy gradient methods are unstable because a single large update can collapse the policy. PPO fixes this by staying close to the previous policy on each update — the \u0026ldquo;proximal\u0026rdquo; part. It does this with a clipped surrogate objective:\nL = m i n ( r ( θ ) · A , c l i p ( r ( θ ) , 1 - ε , 1 + ε ) · A ) Where:\nr(θ) = probability ratio between new and old policy A = advantage (how much better an action was than the baseline) ε = clip range, typically 0.1–0.2 If the new policy tries to change the probability of an action by more than ε, the gradient gets clipped — preventing destructive updates while still allowing improvement.\nIn RLHF Specifically PPO uses four models loaded simultaneously:\nPolicy — the LLM being trained Reference — frozen copy of the policy for the KL penalty (keeps the model from drifting too far from its SFT origin) Reward model — scores completions Value model (critic) — estimates expected return for advantage calculation That fourth model is exactly what GRPO eliminates by using group-relative baselines instead.\nWhy It Dominated Simpler than TRPO (its predecessor, which used a hard KL constraint via constrained optimization). More stable than vanilla policy gradient. Works well across a huge range of tasks — robotics, games, and LLM fine-tuning all use the same algorithm with minimal changes. Limitations Memory-heavy: four models in GPU memory at once. Critic is hard to train with sparse/delayed rewards — common in RLHF where the reward only comes at end of generation. Hyperparameter-sensitive: KL coefficient, clip range, value loss weighting all need tuning. These limitations motivated alternatives like DPO (no RL at all, direct preference optimization on pairs) and GRPO (drops the critic).\n","permalink":"https://knowledged.to/notes/ml/ppo-proximal-policy-optimization/","summary":"\u003ch1 id=\"ppo--proximal-policy-optimization\"\u003ePPO — Proximal Policy Optimization\u003c/h1\u003e\n\u003cp\u003ePPO is a reinforcement learning algorithm from OpenAI (Schulman et al., 2017) that became the default workhorse for RLHF — it\u0026rsquo;s what trained InstructGPT and the original ChatGPT.\u003c/p\u003e\n\u003ch2 id=\"core-idea\"\u003eCore Idea\u003c/h2\u003e\n\u003cp\u003ePolicy gradient methods are unstable because a single large update can collapse the policy. PPO fixes this by \u003cstrong\u003estaying close to the previous policy on each update\u003c/strong\u003e — the \u0026ldquo;proximal\u0026rdquo; part. It does this with a \u003cstrong\u003eclipped surrogate objective\u003c/strong\u003e:\u003c/p\u003e","title":"PPO — Proximal Policy Optimization"},{"content":"GRPO — Group Relative Policy Optimization GRPO is a reinforcement learning algorithm introduced by DeepSeek (DeepSeekMath, later DeepSeek-R1) as a more efficient alternative to PPO for fine-tuning LLMs with RL.\nCore Idea PPO needs a separate value model (critic) of comparable size to the policy to estimate the baseline for advantage calculation. That doubles memory and compute. GRPO ditches the critic entirely.\nInstead, for each prompt it samples a group of G outputs from the current policy, scores each with the reward model, and uses the group\u0026rsquo;s mean and standard deviation as the baseline:\nA _ i = ( r _ i - m e a n ( r _ 1 . . r _ G ) ) / s t d ( r _ 1 . . r _ G ) An output\u0026rsquo;s \u0026ldquo;advantage\u0026rdquo; is just how much better or worse it scored than its siblings from the same prompt. Outputs above the group mean get pushed up, below get pushed down. The relative ranking within the group is the signal.\nWhy It Matters Cheaper: no critic network → roughly half the memory footprint vs PPO. Naturally suited to verifiable rewards: for math/code where you can grade outputs with a checker, sample G attempts, grade them, let the relative scores drive learning. No need to train a value model — notoriously hard to fit for sparse rewards. Stable: keeps PPO\u0026rsquo;s clipped surrogate objective and KL penalty against a reference model, so it inherits PPO\u0026rsquo;s stability properties without the critic. Where It\u0026rsquo;s Used DeepSeek-R1 reasoning training is the headline use case — GRPO with rule-based rewards (correctness + format) elicited chain-of-thought without any SFT bootstrapping in the R1-Zero variant. Has become a common choice for RLHF/RLVR pipelines where you want PPO\u0026rsquo;s behavior without the critic overhead. Trade-off Leans on having a useful reward signal across the group. If all G samples score identically (all wrong, all right), the advantage collapses to zero and you learn nothing from that prompt. Pairs best with tasks where sampling produces meaningful score variance.\nRelationship to PPO GRPO is essentially PPO with the critic replaced by a group-relative Monte Carlo baseline. Everything else — the clipped surrogate objective, the KL penalty against the reference model, the importance sampling ratio — is inherited from PPO.\n","permalink":"https://knowledged.to/notes/ml/grpo-group-relative-policy-optimization/","summary":"\u003ch1 id=\"grpo--group-relative-policy-optimization\"\u003eGRPO — Group Relative Policy Optimization\u003c/h1\u003e\n\u003cp\u003eGRPO is a reinforcement learning algorithm introduced by DeepSeek (DeepSeekMath, later DeepSeek-R1) as a more efficient alternative to PPO for fine-tuning LLMs with RL.\u003c/p\u003e\n\u003ch2 id=\"core-idea\"\u003eCore Idea\u003c/h2\u003e\n\u003cp\u003ePPO needs a separate \u003cstrong\u003evalue model (critic)\u003c/strong\u003e of comparable size to the policy to estimate the baseline for advantage calculation. That doubles memory and compute. GRPO ditches the critic entirely.\u003c/p\u003e\n\u003cp\u003eInstead, for each prompt it samples a \u003cstrong\u003egroup\u003c/strong\u003e of G outputs from the current policy, scores each with the reward model, and uses the group\u0026rsquo;s mean and standard deviation as the baseline:\u003c/p\u003e","title":"GRPO — Group Relative Policy Optimization"},{"content":"Tool-DC: Strategic Anchor Grouping — Web Search Example This is a concrete example illustrating how the Strategic Anchor Grouping mechanism works in the Tool-DC framework. See also: notes/ml/tool-dc-framework.md.\nSetup Query: \u0026ldquo;search the web for recent AI news\u0026rdquo;\nTool library: 20 tools total\nRetriever returns top 3:\nT_top = [Google Search, Bing Search, DuckDuckGo Search] T_tail = 17 remaining tools (Calculator, Weather API, Wikipedia, Code Executor, etc.) With K=3, Tool-DC creates 4 groups:\nS₀ — Full top-K group (kept as baseline) [ G o o g l e S e a r c h , B i n g S e a r c h , D u c k D u c k G o S e a r c h ] This is the problematic group. All three tools do essentially the same thing — search the web — but have slightly different argument schemas:\nGoogle Search: query + num_results + safe_search Bing Search: query + count + market DuckDuckGo Search: query + region The model sees three nearly-identical tools and gets confused: it might call Bing but fill in num_results (Google\u0026rsquo;s argument), producing a hallucinated, schema-invalid call.\nS₁ — Google Search as anchor [ G o o g l e S e a r c h , C a l c u l a t o r , W e a t h e r A P I , C o d e E x e c u t o r , W i k i p e d i a ] Google Search is the only web search tool in the group. The distractors are clearly unrelated. The model has no trouble picking Google Search and correctly filling query, num_results, and safe_search because there\u0026rsquo;s nothing competing for its attention.\nS₂ — Bing Search as anchor [ B i n g S e a r c h , D a t a b a s e Q u e r y , E m a i l S e n d e r , F i l e R e a d e r , T r a n s l a t o r ] Bing is the clear winner in this group. The model correctly calls Bing with query, count, and market — no confusion with Google\u0026rsquo;s schema.\nS₃ — DuckDuckGo Search as anchor [ D u c k D u c k G o S e a r c h , I m a g e G e n e r a t o r , S l a c k M e s s e n g e r , C a l e n d a r A P I , C u r r e n c y C o n v e r t e r ] DuckDuckGo stands out cleanly. Arguments filled correctly.\nAfter Parallel Inference: Check Step The Check step validates each output against its schema:\nS₀ → ❌ invalid (model called Bing but used num_results — wrong argument key) S₁ → ✅ Google Search(query=\u0026quot;recent AI news\u0026quot;, num_results=5, safe_search=true) S₂ → ✅ Bing Search(query=\u0026quot;recent AI news\u0026quot;, count=5, market=\u0026quot;en-US\u0026quot;) S₃ → ✅ DuckDuckGo Search(query=\u0026quot;recent AI news\u0026quot;, region=\u0026quot;us-en\u0026quot;) Retry Step Three valid candidates. The Retry step assembles a clean group from just those three tools and gives the model one final pass:\n[ + G o p o r g i l o e r S v e a a l r i c d h a , t e B d i n o g u t S p e u a t r s c h a , s D c u o c n k t D e u x c t k G o S e a r c h ] This looks like S₀ again — but now the model has seen the validated, schema-correct calls for each tool as context. Instead of starting cold trying to pick between three identical-seeming tools, it\u0026rsquo;s refining a decision informed by what already passed validation. It might pick Google Search as the final answer, now with the correct arguments.\nCore Intuition The anchor groups aren\u0026rsquo;t about finding the best tool in isolation — they\u0026rsquo;re about giving each similar tool a fair, low-noise environment to prove it can produce a valid call. The confusion between Google/Bing/DuckDuckGo doesn\u0026rsquo;t happen because they\u0026rsquo;re never in the same context during the Try phase. The Retry phase then handles the final disambiguation, but with schema-validated evidence in hand rather than starting from scratch.\nKey points:\nSimilar tools are deliberately separated into different anchor groups Each anchor is paired with clearly unrelated distractors from T_tail Distractors are disjoint across groups (no tool appears twice) T_tail tools get coverage — the correct tool might have been ranked outside top-K by the retriever The Retry step is not starting fresh — it has schema-validated evidence to work from ","permalink":"https://knowledged.to/notes/ml/tool-dc-strategic-anchor-grouping-example/","summary":"\u003ch1 id=\"tool-dc-strategic-anchor-grouping--web-search-example\"\u003eTool-DC: Strategic Anchor Grouping — Web Search Example\u003c/h1\u003e\n\u003cp\u003eThis is a concrete example illustrating how the Strategic Anchor Grouping mechanism works in the Tool-DC framework. See also: \u003ccode\u003enotes/ml/tool-dc-framework.md\u003c/code\u003e.\u003c/p\u003e\n\u003ch2 id=\"setup\"\u003eSetup\u003c/h2\u003e\n\u003cp\u003eQuery: \u003cstrong\u003e\u0026ldquo;search the web for recent AI news\u0026rdquo;\u003c/strong\u003e\u003cbr\u003e\nTool library: 20 tools total\u003cbr\u003e\nRetriever returns top 3:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eT_top\u003c/strong\u003e = [Google Search, Bing Search, DuckDuckGo Search]\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eT_tail\u003c/strong\u003e = 17 remaining tools (Calculator, Weather API, Wikipedia, Code Executor, etc.)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWith K=3, Tool-DC creates \u003cstrong\u003e4 groups\u003c/strong\u003e:\u003c/p\u003e","title":"Tool-DC Strategic Anchor Grouping — Web Search Example"},{"content":"AgentFlow: In-the-Flow Agentic System Optimization Source: arXiv:2510.05592 — ICLR 2026 Oral (Top 1.1%)\nAuthors: Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu (Stanford University, Texas A\u0026amp;M, UC San Diego, Lambda)\nThe Problem It Solves Standard tool-augmented LLMs (like Search-R1 or ToRL) train a single monolithic policy that interleaves thinking and tool calls in one big context. This works okay on short tasks but scales poorly on long-horizon problems: the context grows, the reward signal is sparse (you only find out at the very end whether you succeeded), and the model generalizes weakly to new tool configurations. AgentFlow is built to fix all three of those.\nThe System: Four Specialized Modules AgentFlow decomposes the agent into four roles connected through a shared evolving memory:\nPlanner — the only trainable module; a policy (πθ) that looks at the query, the available tools, and the current memory state, then decides what to do next: which sub-goal to pursue and which tool to call. Executor — actually runs the tool and returns results. Verifier — checks whether the result solves the sub-goal, producing a binary yes/no signal. If no, memory is updated and the planner tries again. Generator — when the verifier says yes (or the turn budget is exhausted), takes the full memory and produces the final answer. The key design choice: only the Planner is trained. The other three modules can be anything (frozen LLMs, rule-based systems, external APIs), and the system still benefits from training the planner on-policy in the live multi-turn environment.\nThe Training Algorithm: Flow-GRPO This is the paper\u0026rsquo;s main technical contribution. The challenge is that RL across multi-turn trajectories is hard: credit assignment is tricky (which of the 10 turns was responsible for success or failure?), and the full trajectory is too long to optimize in one shot.\nFlow-GRPO solves this with two ideas:\n1. Broadcast a single trajectory-level reward to every turn. Rather than trying to assign partial credit to each step, every action in the trajectory gets the same reward — 1 if the final answer was correct, 0 if not (evaluated by an LLM-as-judge). If the overall trajectory succeeded, every decision along the way is reinforced.\n2. Group-normalize advantages across parallel rollouts. For each query, the system samples G trajectories in parallel. The advantage for each trajectory is normalized by the group mean and standard deviation — the same idea as GRPO — keeping training stable even with sparse rewards.\nThe combination turns intractable multi-turn RL into a sequence of tractable single-turn policy updates.\nCritical finding: Offline SFT as a baseline caused a catastrophic 19% performance collapse. Online RL (Flow-GRPO) gave a 17.2% improvement. The on-policy, in-the-flow nature of training is essential — you can\u0026rsquo;t learn from static demonstrations.\nResults Tested across 10 benchmarks with a 7B backbone (Qwen-2.5-7B), outperforming GPT-4o:\nTask Type Benchmarks Gain over baselines Search Bamboogle, 2Wiki, HotpotQA, Musique +14.9% Agentic GAIA +14.0% Math AIME 2024, AMC 23, Game of 24 +14.5% Scientific GPQA, MedQA +4.1% Additional scaling findings:\nPerformance keeps improving as inference turns increase from 3 to 10 Consistent gains across backbone sizes from 3B to 7B If internal tool engines are upgraded (e.g. 7B → GPT-4o tools), performance improves further without retraining Practical Relevance for AI Engineers The model and code are open source. If you\u0026rsquo;re building a multi-step agent — anything that calls tools across multiple turns — the AgentFlow architecture is a concrete blueprint:\nSeparate planner, executor, verifier, and generator Train only the planner, on-policy, with trajectory-level rewards Use Flow-GRPO for stable multi-turn RL Code: https://github.com/lupantech/AgentFlow\nModel: https://huggingface.co/AgentFlow\nDemo: https://huggingface.co/spaces/AgentFlow/agentflow\nReferences Project page: https://agentflow.stanford.edu/ Paper: https://arxiv.org/abs/2510.05592 GitHub: https://github.com/lupantech/AgentFlow ","permalink":"https://knowledged.to/notes/ml/agentflow/","summary":"\u003ch1 id=\"agentflow-in-the-flow-agentic-system-optimization\"\u003eAgentFlow: In-the-Flow Agentic System Optimization\u003c/h1\u003e\n\u003cp\u003e\u003cstrong\u003eSource:\u003c/strong\u003e arXiv:2510.05592 — ICLR 2026 Oral (Top 1.1%)\u003cbr\u003e\n\u003cstrong\u003eAuthors:\u003c/strong\u003e Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu (Stanford University, Texas A\u0026amp;M, UC San Diego, Lambda)\u003c/p\u003e\n\u003ch2 id=\"the-problem-it-solves\"\u003eThe Problem It Solves\u003c/h2\u003e\n\u003cp\u003eStandard tool-augmented LLMs (like Search-R1 or ToRL) train a single monolithic policy that interleaves thinking and tool calls in one big context. This works okay on short tasks but scales poorly on long-horizon problems: the context grows, the reward signal is sparse (you only find out at the very end whether you succeeded), and the model generalizes weakly to new tool configurations. AgentFlow is built to fix all three of those.\u003c/p\u003e","title":"AgentFlow"},{"content":"Tool-DC Framework: Try, Check and Retry for Long-context Tool-Calling Source: arXiv:2603.11495 — Accepted at ACL 2026\nAuthors: Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du (Wuhan University), Dacheng Tao (NTU)\nThe Core Problem When you give an LLM access to a large library of tools — say 20, 50, or hundreds of APIs — performance degrades sharply. The paper shows that even going from fewer than 10 tools to 20 causes significant accuracy drops across all tested models, especially smaller ones. Two things go wrong: the sheer length of the context buries the signal, and semantically similar tools with slightly different argument schemas confuse the model when it\u0026rsquo;s trying to fill in the right parameters.\nThe Try-Check-Retry Pipeline (Training-free variant) Tool-DC\u0026rsquo;s training-free version (TF) is a divide-and-conquer inference wrapper you can drop onto any model without retraining.\nTry — Grouping and Local Inference. Rather than showing the model all N tools at once, Tool-DC first uses a retriever (e.g. BM25) to pull the top-K most relevant tools. It then constructs K parallel groups: each group has one of those top-K tools as an \u0026ldquo;anchor,\u0026rdquo; plus a disjoint subset of lower-ranked tools from the remainder. The key insight is that each anchor tool gets its own group, which prevents similar tools from competing with each other in the same context. The model then runs local inference independently on each group, outputting either a tool call (tool name + arguments) or a null token.\nCheck — Schema Consistency Validation. Each local output is filtered by a rule-based validator against three constraints: the function name must exist in the tool set, all required argument keys must be present, and argument values must match the defined data types. This step catches hallucinated function names and malformed argument structures before they propagate. The valid outputs form a refined candidate set.\nRetry — Global Aggregation. The original tool definitions for everything in the validated candidate set are retrieved and assembled into a much smaller, high-signal context. The model then makes a final global call over this clean subset — essentially getting a second pass where the noise has been filtered out and it can self-refine.\nThe Training-based Variant (TB) The TF version requires multiple forward passes, which adds latency. The training-based version (TB) addresses this by internalizing the Try-Check-Retry reasoning into the model weights via fine-tuning. The process: run TF on a training dataset, collect the successful reasoning traces (local inference → validation → global decision), synthesize those into Chain-of-Thought data with a structured rationale template (Candidate Selection → Validation → Final Review), and fine-tune the model on it. At inference time, the model executes the same reasoning in a single forward pass.\nResult: Using TB, Qwen2.5-7B scores 83.16% on the Berkeley Function-Calling Leaderboard, surpassing OpenAI o3 and Claude Haiku 4.5.\nResults Tool-DC (TF): up to +25.10% average gains on BFCL and ACEBench benchmarks vs. baseline Tool-DC (TB): Qwen2.5-7B reaches 83.16% on BFCL, outperforming proprietary models including OpenAI o3 and Claude Haiku 4.5 Practical Relevance for AI Engineers If you\u0026rsquo;re building agents with large tool registries — MCP servers, API-heavy workflows, or any system where the model chooses from dozens of functions — Tool-DC is directly applicable. The TF variant is plug-and-play with no training required. The TB variant is worth exploring if you\u0026rsquo;re fine-tuning a smaller open model and want to close the gap with proprietary models on tool-calling tasks.\nReferences Paper: https://arxiv.org/abs/2603.11495 Full HTML: https://arxiv.org/html/2603.11495 ","permalink":"https://knowledged.to/notes/ml/tool-dc-framework/","summary":"\u003ch1 id=\"tool-dc-framework-try-check-and-retry-for-long-context-tool-calling\"\u003eTool-DC Framework: Try, Check and Retry for Long-context Tool-Calling\u003c/h1\u003e\n\u003cp\u003e\u003cstrong\u003eSource:\u003c/strong\u003e arXiv:2603.11495 — Accepted at ACL 2026\u003cbr\u003e\n\u003cstrong\u003eAuthors:\u003c/strong\u003e Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du (Wuhan University), Dacheng Tao (NTU)\u003c/p\u003e\n\u003ch2 id=\"the-core-problem\"\u003eThe Core Problem\u003c/h2\u003e\n\u003cp\u003eWhen you give an LLM access to a large library of tools — say 20, 50, or hundreds of APIs — performance degrades sharply. The paper shows that even going from fewer than 10 tools to 20 causes significant accuracy drops across all tested models, especially smaller ones. Two things go wrong: the sheer length of the context buries the signal, and semantically similar tools with slightly different argument schemas confuse the model when it\u0026rsquo;s trying to fill in the right parameters.\u003c/p\u003e","title":"Tool-DC Framework"},{"content":"Top-K in RAG Search In Retrieval-Augmented Generation (RAG), top-k is the number of most relevant document chunks the retriever returns from the vector store for a given query. The \u0026ldquo;k\u0026rdquo; is literally just a number — top-3, top-5, top-10, etc.\nHow it works Embed the query into a vector Run a similarity search (cosine, dot product, etc.) against indexed chunks Retriever ranks every chunk by similarity score Top-k says \u0026ldquo;give me the k highest-scoring ones\u0026rdquo; Those chunks get stuffed into the LLM\u0026rsquo;s context as grounding material before generation Choosing k — the tradeoff Too low (k=1, 2):\nRisk missing relevant context If the answer is split across multiple chunks, or the best chunk wasn\u0026rsquo;t ranked #1, you\u0026rsquo;re stuck Too high (k=20+):\nDilutes the signal with marginally-relevant chunks Burns context window and tokens Can actually hurt answer quality — research shows LLMs degrade with too much irrelevant context (\u0026ldquo;lost in the middle\u0026rdquo; problem) Typical values Defaults are usually k=3 to k=10, depending on chunk size and task Common pattern: pair with a reranker Stage 1: retrieve top-k=20 with cheap vector similarity (high recall) Stage 2: rerank with a cross-encoder, keep top 3-5 for the final prompt (high precision) Related knob: similarity threshold Some retrievers also expose a similarity threshold — drop anything below a score cutoff regardless of rank. Useful when \u0026ldquo;no relevant context\u0026rdquo; is a valid outcome and you don\u0026rsquo;t want to force k chunks when none are actually good.\nQuick reference k value Use case 1-2 High-precision lookup, short context budgets 3-5 Most common production default 10-20 First stage before reranking 20+ Almost always pair with a reranker ","permalink":"https://knowledged.to/notes/ml/top-k-in-rag-search/","summary":"\u003ch1 id=\"top-k-in-rag-search\"\u003eTop-K in RAG Search\u003c/h1\u003e\n\u003cp\u003eIn Retrieval-Augmented Generation (RAG), \u003cstrong\u003etop-k\u003c/strong\u003e is the number of most relevant document chunks the retriever returns from the vector store for a given query. The \u0026ldquo;k\u0026rdquo; is literally just a number — top-3, top-5, top-10, etc.\u003c/p\u003e\n\u003ch2 id=\"how-it-works\"\u003eHow it works\u003c/h2\u003e\n\u003col\u003e\n\u003cli\u003eEmbed the query into a vector\u003c/li\u003e\n\u003cli\u003eRun a similarity search (cosine, dot product, etc.) against indexed chunks\u003c/li\u003e\n\u003cli\u003eRetriever ranks every chunk by similarity score\u003c/li\u003e\n\u003cli\u003eTop-k says \u0026ldquo;give me the k highest-scoring ones\u0026rdquo;\u003c/li\u003e\n\u003cli\u003eThose chunks get stuffed into the LLM\u0026rsquo;s context as grounding material before generation\u003c/li\u003e\n\u003c/ol\u003e\n\u003ch2 id=\"choosing-k--the-tradeoff\"\u003eChoosing k — the tradeoff\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eToo low (k=1, 2):\u003c/strong\u003e\u003c/p\u003e","title":"Top-K in RAG Search"},{"content":"Why Skirts Became Associated With Women and Trousers With Men The short answer is: skirts are not an inherently “female” form of clothing, and trousers are not an inherently “male” one. Across history, many men wore non-bifurcated garments — garments that do not pass between the legs — such as tunics, robes, kilts, togas, kaftans, sarongs, and long shirts. The strong Western association of women = skirts/dresses and men = trousers developed gradually from a mix of practical needs, class signals, gender norms, modesty rules, and later industrial-era fashion conventions.\nKey terms Bifurcated clothing: Clothing divided into two leg sections, such as trousers, breeches, leggings, or pants. Non-bifurcated clothing: Clothing that hangs as one piece around the lower body, such as skirts, robes, dresses, tunics, kilts, and sarongs. Gendered clothing: Clothing whose social meaning is tied to masculinity or femininity, even if the garment itself has no biological necessity. Men did not always wear trousers It is easy to assume that men’s clothing “naturally” evolved into trousers because modern Western menswear is trouser-based. But historically, that was not universal.\nExamples of male non-bifurcated clothing include:\nAncient Rome: Men commonly wore tunics and togas. Trousers were associated with some “barbarian” peoples outside Rome. Ancient Greece: Men wore chitons and himations, both draped garments. Scotland and Ireland: Kilts and related wrapped garments were worn by men. Middle East and North Africa: Robes, kaftans, djellabas, and thobes have long been worn by men. South and Southeast Asia: Men have worn dhotis, lungis, sarongs, and similar garments. Japan: Men wore kimono and other robe-like garments, sometimes with hakama, a divided or pleated lower garment. So the deeper question is not “Why did women’s clothing fail to become trouser-like?” but rather: Why did trousers become strongly associated with men in many societies, especially in modern Europe and its cultural descendants?\nTrousers were especially useful for riding horses One major reason trousers spread among men is that they are practical for horse riding.\nWhen riding astride — with one leg on each side of the animal — bifurcated garments protect the inner thighs, reduce chafing, and allow easier movement. This mattered especially for:\nsoldiers, cavalry, messengers, nomadic horse cultures, herders, hunters, travelers. Many early trouser-wearing cultures were associated with riding or cold climates. In places where men were more likely to be soldiers, riders, or outdoor laborers, trousers became linked with masculine public activity.\nThis was not because women were physically unable to wear trousers. Rather, in many societies, women were less socially expected — or less socially allowed — to ride astride, fight, travel independently, or do certain kinds of public labor.\nSkirts and robes were practical in many contexts Non-bifurcated garments are not primitive or impractical. They have real advantages:\nThey are simple to cut and sew, especially before industrial textile production. They use rectangular cloth efficiently. They allow ventilation in warm climates. They can fit a range of body sizes. They are easier to adjust for pregnancy or body changes. They can be layered for warmth. They can communicate status through fabric volume, decoration, and drape. Before modern tailoring, many garments were made from woven rectangles of cloth. A robe, tunic, wrap, or skirt could be easier to produce than fitted trousers, which require more shaping and seams.\nWomen’s dress was shaped by modesty rules In many European societies, especially from the medieval period onward, women’s clothing became closely tied to ideas of modesty, sexual propriety, and social control.\nParadoxically, skirts could be considered more “modest” than trousers because they concealed the exact shape and separation of the legs. Trousers, by contrast, visibly outline the body and clearly divide the legs, which some cultures considered inappropriate for women.\nThis is one reason women’s trousers were often controversial: not because they exposed more skin, but because they symbolically revealed or emphasized the legged structure of the body and were associated with male mobility and authority.\nTrousers became symbols of male public power In Europe, trousers and related garments became increasingly associated with men who moved through public life: soldiers, laborers, merchants, riders, and officials.\nOver time, this created a symbolic link:\nTrousers = masculinity, mobility, work, citizenship, authority\nMeanwhile, women’s skirts and dresses became associated with domesticity, femininity, sexual respectability, and class presentation.\nThis symbolism became so strong that “wearing the pants” became an idiom meaning holding authority in a household or relationship.\nClass also mattered Upper-class women’s dresses were often deliberately impractical. Long skirts, trains, corsets, petticoats, and delicate fabrics signaled that the wearer did not need to perform heavy manual labor.\nIn this sense, women’s fashion often displayed social status through restricted movement:\nA long skirt could show refinement. Pale fabric could show that the wearer did not do dirty labor. Complex undergarments could show wealth and leisure. A narrow silhouette could show discipline and elite femininity. This does not mean all women were idle. Working-class women worked hard, often in skirts that were shorter, tucked, aproned, or otherwise adapted. But elite fashion strongly influenced what counted as “proper” feminine dress.\nWomen did wear bifurcated garments in some cultures and situations The idea that women never wore pants is false.\nWomen have worn bifurcated garments in many contexts, including:\nCentral Asian and Middle Eastern cultures, where loose trousers were worn by both women and men in some periods. China, where women at times wore trousers, especially for labor or under robes. Horse-riding cultures, where practical riding clothes could include divided garments. Industrial labor, especially during wartime, when women working in factories wore trousers or overalls. Sports and bicycling, where bloomers and divided skirts emerged in the 19th century. The stronger taboo against women wearing trousers was especially pronounced in certain European and Christian-influenced contexts, and later in societies shaped by European fashion norms.\nSide-saddle riding reinforced skirts for elite women In European aristocratic culture, women were often expected to ride side-saddle, with both legs on one side of the horse, rather than astride. This allowed them to ride while wearing long skirts and was considered more modest and feminine.\nSide-saddle riding was not simply a natural adaptation to skirts; it also reinforced the idea that respectable women should not sit astride animals in a way associated with men.\nSo clothing, riding technique, and gender norms supported each other:\nWomen were expected to wear skirts. Skirts made astride riding harder. Side-saddle riding became the respectable feminine method. This reinforced the idea that trousers and astride riding were masculine. Modern trousers for women became common only recently in the West In the 19th and early 20th centuries, women who wore trousers in Western societies were often seen as challenging gender roles. Dress reformers argued that women needed more practical clothing for health, work, bicycling, and political equality.\nMajor shifts happened through:\nthe women’s rights movement, bicycling and sports, World War I and World War II factory work, Hollywood fashion, postwar casual wear, feminist movements of the 1960s and 1970s, changing workplace norms. By the late 20th century, trousers had become ordinary women’s clothing in many Western countries, though skirts and dresses still retained feminine associations.\nSummary Women’s clothing did not evolve without fabric between the legs because of a single biological or practical reason. The skirt/dress association came from many overlapping historical forces:\nMen also historically wore skirts, robes, and tunics. Trousers spread partly because they were useful for riding, war, and some forms of labor. Men were more often assigned public, military, and mobile roles. Women’s clothing was shaped by modesty rules and ideals of femininity. Elite women’s fashion often signaled status through impracticality. Trousers became symbolic of male authority and public life. Women did wear trousers in many cultures, but Western norms often discouraged it. The modern Western split — men in trousers, women in skirts or dresses — is therefore best understood as a cultural and historical convention, not a universal or inevitable outcome of clothing design.\n","permalink":"https://knowledged.to/notes/fashion/skirts-trousers-gender-history/","summary":"\u003ch1 id=\"why-skirts-became-associated-with-women-and-trousers-with-men\"\u003eWhy Skirts Became Associated With Women and Trousers With Men\u003c/h1\u003e\n\u003cp\u003eThe short answer is: \u003cstrong\u003eskirts are not an inherently “female” form of clothing, and trousers are not an inherently “male” one\u003c/strong\u003e. Across history, many men wore non-bifurcated garments — garments that do \u003cstrong\u003enot\u003c/strong\u003e pass between the legs — such as tunics, robes, kilts, togas, kaftans, sarongs, and long shirts. The strong Western association of \u003cstrong\u003ewomen = skirts/dresses\u003c/strong\u003e and \u003cstrong\u003emen = trousers\u003c/strong\u003e developed gradually from a mix of practical needs, class signals, gender norms, modesty rules, and later industrial-era fashion conventions.\u003c/p\u003e","title":"Why Skirts Became Feminine and Trousers Masculine"},{"content":"Multi-Layer Perceptron (MLP) A Multi-Layer Perceptron (MLP) is one of the foundational types of artificial neural network. It learns to map inputs to outputs by passing data through a series of layers of interconnected nodes (\u0026ldquo;neurons\u0026rdquo;), adjusting internal weights during training until its predictions improve.\nBackground: The Single Perceptron To understand an MLP, start with its building block — the perceptron (single neuron):\nIt takes several numerical inputs $x_1, x_2, \\ldots, x_n$. Each input is multiplied by a learned weight $w_i$ (how important that input is). The results are summed, a bias term $b$ is added (a constant that shifts the output), and the total is passed through an activation function $f$ to produce an output. $$\\text{output} = f!\\left(\\sum_{i} w_i x_i + b\\right)$$\nA single perceptron can only learn linearly separable patterns — i.e., problems whose decision boundary is a straight line (or hyperplane). Real-world problems are rarely that simple.\nWhat Makes It \u0026ldquo;Multi-Layer\u0026rdquo;? An MLP stacks multiple layers of perceptrons:\nLayer Role Input layer Receives raw features (e.g., pixel values, numbers). No computation here. Hidden layer(s) One or more intermediate layers that learn abstract representations. This is where the real learning happens. Output layer Produces the final prediction (e.g., a class probability or a continuous value). The layers between input and output are called hidden because their values are not directly observed in the data.\nI n p u x x x t 1 2 3 L a ─ ─ ─ y ─ ─ ─ e ┐ ┤ ┘ r ├ ├ ─ ─ ─ ─ ► ► [ [ H n n i e e d u u d r r e o o n n n ] ] L a ─ ─ y ─ ─ e ┐ ├ ┘ r ─ ─ ► [ n O e u u t r p o u n t ] L ─ a ─ y ► e r ŷ Activation Functions Each neuron applies an activation function to introduce non-linearity — without this, stacking layers would still only produce a linear model, no matter how deep. Common choices:\nReLU (Rectified Linear Unit): $f(x) = \\max(0, x)$ — most widely used in hidden layers today. Sigmoid: $f(x) = \\frac{1}{1+e^{-x}}$ — squashes output to (0, 1); used in binary classification outputs. Softmax: Generalises sigmoid to multiple classes; used in multi-class output layers. Tanh: Squashes output to (−1, 1); sometimes used in hidden layers. How an MLP Learns: Backpropagation Training an MLP means finding the weights that minimise prediction error. This is done by:\nForward pass — feed an input through the network to get a prediction $\\hat{y}$. Compute loss — measure how wrong the prediction is using a loss function (e.g., Mean Squared Error for regression, Cross-Entropy for classification). Backward pass (backpropagation) — use the chain rule of calculus to compute how much each weight contributed to the error, producing a gradient for every weight. Weight update — adjust every weight slightly in the direction that reduces the error, using an optimiser like Stochastic Gradient Descent (SGD) or Adam. This cycle repeats over many epochs (full passes through the training data) until the loss is acceptably low.\nA Concrete Example Suppose you want to classify whether an email is spam (1) or not (0) using two features: word count and exclamation-mark count.\nInput layer: 2 neurons (one per feature). Hidden layer: 4 neurons with ReLU activation. Output layer: 1 neuron with Sigmoid activation → outputs a probability between 0 and 1. During training, the MLP learns weights that combine word count and exclamation marks in a non-linear way to separate spam from non-spam.\nKey Properties \u0026amp; Limitations Property Detail Universal approximation An MLP with at least one hidden layer and a non-linear activation can approximate any continuous function (given enough neurons). Fully connected Every neuron in one layer connects to every neuron in the next — hence also called a fully connected network or dense network. Scalability MLPs struggle with raw images or sequences; specialised architectures (CNNs for images, RNNs/Transformers for sequences) usually outperform them there. Overfitting With many parameters, MLPs can memorise training data. Regularisation techniques like dropout (randomly zeroing neurons during training) and weight decay help. Summary An MLP is a feedforward neural network with:\nAn input layer, one or more hidden layers, and an output layer. Non-linear activation functions enabling it to learn complex patterns. Trained via backpropagation and gradient descent. It is often the first neural network architecture to learn, and understanding it well forms the foundation for studying deeper and more specialised architectures.\n","permalink":"https://knowledged.to/ai/concepts/multi-layer-perceptron/","summary":"\u003ch1 id=\"multi-layer-perceptron-mlp\"\u003eMulti-Layer Perceptron (MLP)\u003c/h1\u003e\n\u003cp\u003eA \u003cstrong\u003eMulti-Layer Perceptron (MLP)\u003c/strong\u003e is one of the foundational types of artificial neural network. It learns to map inputs to outputs by passing data through a series of layers of interconnected nodes (\u0026ldquo;neurons\u0026rdquo;), adjusting internal weights during training until its predictions improve.\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"background-the-single-perceptron\"\u003eBackground: The Single Perceptron\u003c/h2\u003e\n\u003cp\u003eTo understand an MLP, start with its building block — the \u003cstrong\u003eperceptron\u003c/strong\u003e (single neuron):\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eIt takes several numerical inputs $x_1, x_2, \\ldots, x_n$.\u003c/li\u003e\n\u003cli\u003eEach input is multiplied by a learned \u003cstrong\u003eweight\u003c/strong\u003e $w_i$ (how important that input is).\u003c/li\u003e\n\u003cli\u003eThe results are summed, a \u003cstrong\u003ebias\u003c/strong\u003e term $b$ is added (a constant that shifts the output), and the total is passed through an \u003cstrong\u003eactivation function\u003c/strong\u003e $f$ to produce an output.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e$$\\text{output} = f!\\left(\\sum_{i} w_i x_i + b\\right)$$\u003c/p\u003e","title":"Multi-Layer Perceptron (MLP)"},{"content":"Attention in Machine Learning Attention is a mechanism that lets a model dynamically decide which parts of the input matter most when producing each piece of output. Instead of compressing everything into one fixed representation, the model computes a weighted combination of inputs where the weights are learned and depend on context.\nIntuition When translating \u0026ldquo;the cat sat on the mat\u0026rdquo; to French, generating the word for \u0026ldquo;cat\u0026rdquo; should mostly pay attention to \u0026ldquo;cat\u0026rdquo; in the source — not \u0026ldquo;mat\u0026rdquo; or \u0026ldquo;on.\u0026rdquo; Attention makes this routing explicit and differentiable.\nBefore attention (Bahdanau et al., 2014, in neural machine translation), encoder-decoder RNNs had to squeeze the whole source sentence into a single hidden vector, which broke down on longer inputs.\nMechanics: Query, Key, Value The standard formulation is scaled dot-product attention:\nEach input position produces three vectors via learned linear projections: a query (Q), a key (K), and a value (V). For a given query, compute its dot product with every key. This gives a similarity score: how relevant is each position to what I\u0026rsquo;m currently looking for? Scale by √d_k (to keep gradients stable under large d_k) and apply softmax to turn scores into a probability distribution. Take the weighted sum of the values using those probabilities. In one line:\nA t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K ᵀ / √ d _ k ) · V Self-attention and why it was a big deal The Transformer (Vaswani et al., 2017, \u0026ldquo;Attention is All You Need\u0026rdquo;) made self-attention the central operation: every token attends to every other token in the same sequence — Q, K, V all come from the same input. This unlocked two things RNNs couldn\u0026rsquo;t do well:\nLong-range dependencies — any position can directly reference any other in one step, instead of information having to flow through many recurrent timesteps. Parallelism — all positions are processed simultaneously, which is why Transformers train so much faster than RNNs on GPUs. Important variants Multi-head attention — run several attention operations in parallel with different projections, then concatenate. Each head can specialize (one tracks syntax, another tracks coreference, etc.). Causal / masked attention — in decoders, mask out future positions so a token only attends to previous ones. This is what makes autoregressive generation possible. Cross-attention — Q comes from the decoder, K and V from the encoder. Used in seq2seq Transformers and in diffusion models for conditioning on text. Efficiency variants: FlashAttention — memory-efficient exact attention via tiling and recomputation. Grouped-query attention (GQA) / multi-query attention (MQA) — share K, V across heads to shrink the KV cache. Sliding-window / sparse attention — for long contexts where full O(n²) attention is too expensive. TL;DR Attention is content-based, soft, differentiable lookup. Self-attention applied that lookup to a sequence\u0026rsquo;s own tokens. Modern LLMs are essentially scaled-up stacks of self-attention layers (plus feedforward blocks and normalization).\nReferences Bahdanau, Cho, Bengio (2014). Neural Machine Translation by Jointly Learning to Align and Translate. Vaswani et al. (2017). Attention Is All You Need. Dao et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. ","permalink":"https://knowledged.to/notes/ml/attention/","summary":"\u003ch1 id=\"attention-in-machine-learning\"\u003eAttention in Machine Learning\u003c/h1\u003e\n\u003cp\u003eAttention is a mechanism that lets a model dynamically decide \u003cem\u003ewhich parts of the input matter most\u003c/em\u003e when producing each piece of output. Instead of compressing everything into one fixed representation, the model computes a weighted combination of inputs where the weights are learned and depend on context.\u003c/p\u003e\n\u003ch2 id=\"intuition\"\u003eIntuition\u003c/h2\u003e\n\u003cp\u003eWhen translating \u0026ldquo;the cat sat on the mat\u0026rdquo; to French, generating the word for \u0026ldquo;cat\u0026rdquo; should mostly pay attention to \u0026ldquo;cat\u0026rdquo; in the source — not \u0026ldquo;mat\u0026rdquo; or \u0026ldquo;on.\u0026rdquo; Attention makes this routing explicit and differentiable.\u003c/p\u003e","title":"Attention in Machine Learning"},{"content":"Molecular Evolution of Pediculus humanus and the Origin of Clothing Authors: Ralf Kittler, Manfred Kayser, Mark Stoneking (Max Planck Institute for Evolutionary Anthropology) Journal: Current Biology, Vol. 13, Issue 16, pp. 1414–1417 (19 August 2003) DOI: 10.1016/S0960-9822(03)00507-4 · PMID: 12932325\nThe Question When did humans start wearing clothing regularly? Clothing leaves almost no archaeological trace, so the date has long been speculative. The authors use an unusual proxy: the human body louse.\nThe Key Insight Two forms of Pediculus humanus parasitize humans:\nHead louse (P. h. capitis) — lives and feeds on the scalp. Body louse (P. h. corporis) — feeds on the body but lives in clothing. Body lice could not have evolved before clothing existed. Dating the origin of the body louse therefore puts a lower bound on the regular use of clothing.\nMethods Sequenced two mitochondrial and two nuclear DNA segments from ~40 head and body lice collected worldwide. Used a chimpanzee louse (Pediculus schaeffi) as an outgroup to calibrate divergence. Applied a molecular clock to estimate when body lice diverged from head lice. Analyzed mtDNA diversity patterns for demographic signals. Findings Origin of body lice: ~72,000 ± 42,000 years ago. Greater louse diversity in Africa than elsewhere → African origin of human lice, mirroring the out-of-Africa pattern in their hosts. mtDNA signatures show a demographic expansion that tracks the modern human dispersal from Africa. Why It Matters Clothing is a surprisingly recent innovation in human evolution — well after the emergence of anatomically modern humans (~200,000 ya) but plausibly aligned with northward expansion into colder climates. Demonstrates how parasite genetics can illuminate host behavior that leaves no direct fossil or archaeological record. Provides an independent molecular date that complements (and slightly post-dates) the earliest indirect archaeological hints of clothing (e.g., hide-scraping tools). Caveats Wide confidence interval (±42,000 years) due to small sample and molecular-clock uncertainty. Establishes when body lice became a distinct ecological form — occasional or seasonal clothing use could predate this. Sources Cell.com: https://www.cell.com/current-biology/fulltext/S0960-9822(03)00507-4 PubMed: https://pubmed.ncbi.nlm.nih.gov/12932325/ ScienceDirect: https://www.sciencedirect.com/science/article/pii/S0960982203005074 ","permalink":"https://knowledged.to/notes/fashion/pediculus-humanus-origin-of-clothing/","summary":"\u003ch1 id=\"molecular-evolution-of-pediculus-humanus-and-the-origin-of-clothing\"\u003eMolecular Evolution of \u003cem\u003ePediculus humanus\u003c/em\u003e and the Origin of Clothing\u003c/h1\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors:\u003c/strong\u003e Ralf Kittler, Manfred Kayser, Mark Stoneking (Max Planck Institute for Evolutionary Anthropology)\n\u003cstrong\u003eJournal:\u003c/strong\u003e \u003cem\u003eCurrent Biology\u003c/em\u003e, Vol. 13, Issue 16, pp. 1414–1417 (19 August 2003)\n\u003cstrong\u003eDOI:\u003c/strong\u003e 10.1016/S0960-9822(03)00507-4 · \u003cstrong\u003ePMID:\u003c/strong\u003e 12932325\u003c/p\u003e\n\u003ch2 id=\"the-question\"\u003eThe Question\u003c/h2\u003e\n\u003cp\u003eWhen did humans start wearing clothing regularly? Clothing leaves almost no archaeological trace, so the date has long been speculative. The authors use an unusual proxy: the human body louse.\u003c/p\u003e","title":"Molecular Dating of Clothing Origins via Body Louse Evolution"},{"content":"Paleolithic Eyed Needles and the Evolution of Dress Authors: Ian Gilligan, Francesco d\u0026rsquo;Errico, Luc Doyon, Wei Wang, Yaroslav V. Kuzmin Published: Science Advances 10, eadp2887 — 28 June 2024 (DOI) Type: Review (Anthropology)\nTL;DR Eyed needles weren\u0026rsquo;t invented to tailor clothes — bone awls already did that. Their arrival ~40,000 years ago signals something bigger: the rise of layered garments (including underwear) and the shift from decorating skin to decorating clothing, transforming clothes from physical necessity into social dress.\nKey Argument Tailoring predates eyed needles. Bone awls (~80 kya, Blombos Cave) and even lithic burins were already producing fitted garments. A 39,600 cal B.P. punctured bone from Canyars (Catalonia) shows tailored leather was made 14,000 years before eyed needles reached Europe. So why invent the eye? Two complementary drivers requiring finer, faster sewing: Underwear / multi-layer assemblages for thermal insulation in deteriorating Late Pleistocene climates. Adornment of clothing — sewing beads, pendants, and fur trim onto garments as more of the body got covered. Clothing → Dress. Once bodies were continuously covered, social signaling migrated from skin (ochre, tattoos) onto cloth surfaces. Clothing acquired symbolic functions and became permanent, decoupled from climate. Earliest Eyed Needles (selected) Site Age (cal B.P.) Region Denisova Cave ~40,000 Southern Siberia Mezmaiskaya Cave ~38,000 Caucasus Zhoukoudian Upper Cave 35–33,000 NE East Asia Yana RHS (192 needles, 8 varieties) 33,000 Arctic Siberia (71°N) Shuidonggou 2 32,000 N. Central E. Asia Kostenki-15 / Potočka Cave 30,000 Europe Broken Mammoth 14,000 NW North America Evolutionary Scenario Stage Tech Era Loose hides, ad hoc cover Stone hide-scrapers Mid-Pleistocene Simple wrapped garments Notched tools, borers ~300 kya+ Fitted/tailored clothing Bone awls ~80 kya (Blombos) Finer sewing, layered + decorated Eyed needles ~40 kya, Eurasia Regional morphological diversity Specialized needle kits LGM onward Supporting Evidence (multi-disciplinary) Archaeology: geographic match between eyed needles and cold late-MIS 3 / LGM environments; rise of beads sewn onto garments (Sunghir burials, Üçağızlı, Shuidonggou 2). Paleoclimate: intensification of glacial cycles from mid-Pleistocene. Physiology: Paleolithic furs ≈ doubled insulation of modern wovens → 2-layer Paleolithic ~ 4-layer modern. Genetics: clothing-lice divergence supports deep clothing origins. Negative evidence: No Pleistocene eyed needles in the Southern Hemisphere — milder climate, no thermal driver. Why It Matters Eyed needles are a small technological step but a quantum leap culturally: they mark the transition where clothing stopped being seasonal protection and became a permanent vehicle for identity, status, and modesty — a state that persists to the present.\n","permalink":"https://knowledged.to/notes/fashion/paleolithic-eyed-needles-and-dress/","summary":"\u003ch1 id=\"paleolithic-eyed-needles-and-the-evolution-of-dress\"\u003ePaleolithic Eyed Needles and the Evolution of Dress\u003c/h1\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors:\u003c/strong\u003e Ian Gilligan, Francesco d\u0026rsquo;Errico, Luc Doyon, Wei Wang, Yaroslav V. Kuzmin\n\u003cstrong\u003ePublished:\u003c/strong\u003e \u003cem\u003eScience Advances\u003c/em\u003e 10, eadp2887 — 28 June 2024 (\u003ca href=\"https://www.science.org/doi/10.1126/sciadv.adp2887\"\u003eDOI\u003c/a\u003e)\n\u003cstrong\u003eType:\u003c/strong\u003e Review (Anthropology)\u003c/p\u003e\n\u003ch2 id=\"tldr\"\u003eTL;DR\u003c/h2\u003e\n\u003cp\u003eEyed needles weren\u0026rsquo;t invented to tailor clothes — bone awls already did that. Their arrival ~40,000 years ago signals something bigger: the rise of \u003cstrong\u003elayered garments (including underwear)\u003c/strong\u003e and the shift from decorating skin to \u003cstrong\u003edecorating clothing\u003c/strong\u003e, transforming clothes from physical necessity into social \u003cem\u003edress\u003c/em\u003e.\u003c/p\u003e","title":"Paleolithic Eyed Needles and the Evolution of Dress"},{"content":"MCP Interaction Model Components (official MCP nomenclature) Host — The user-facing application that embeds the LLM and enforces policy (Claude Desktop, Claude Code, an IDE plugin, etc.). It owns the user, the model, and the trust boundary. Client — A protocol connector that lives inside the Host. One Client per Server, holding a 1:1 stateful session. The Host spawns Clients as needed. Server — The process that exposes capabilities (tools, resources, prompts) over the MCP protocol. Can be local (stdio transport) or remote (Streamable HTTP transport). Authorization Server (AS) — For remote Servers: the OAuth 2.1 issuer of access tokens. May be the Server itself or a separate identity provider. Resource Server (RS) — OAuth role played by the remote MCP Server when it validates bearer tokens on incoming requests. User — The human who approves connections, consents to tool calls, and answers elicitations. LLM — Not technically an MCP component, but the reasoning engine the Host drives; never talks to a Server directly. Phase 1 — Transport \u0026amp; connection Host → Client: Host launches a Client configured for a specific Server (command + args for stdio, or URL for HTTP). Client ↔ Server: Transport established. stdio: Host spawns the Server as a subprocess; JSON-RPC over stdin/stdout. Streamable HTTP: Client opens an HTTPS connection; bidirectional via POST + SSE stream. Phase 2 — Authorization (remote Servers only) MCP uses OAuth 2.1 + PKCE, with Resource Indicators (RFC 8707) and Dynamic Client Registration (RFC 7591).\nClient → Server: Initial request without a token. Server (as RS) → Client: 401 Unauthorized with WWW-Authenticate pointing at /.well-known/oauth-protected-resource. Client → RS metadata endpoint: Fetches Protected Resource Metadata, which names the Authorization Server(s). Client → AS metadata endpoint: Fetches /.well-known/oauth-authorization-server (RFC 8414). Client → AS: Dynamic Client Registration (if supported) to obtain a client_id. Client → Host → User: Host opens browser to AS\u0026rsquo;s /authorize with PKCE challenge + resource parameter (binds token to this Server). User ↔ AS: User authenticates and consents. AS → Client: Redirect with authorization code. Client → AS: Exchanges code + PKCE verifier at /token for an access token (and optional refresh token). Client → Server: Retries request with Authorization: Bearer \u0026lt;token\u0026gt;. Server validates audience, scopes, expiry. Phase 3 — Initialization handshake Client → Server: initialize request — declares protocol version, Client capabilities (roots, sampling, elicitation), and Client info. Server → Client: initialize response — agreed protocol version, Server capabilities (tools, resources, prompts, logging), Server info. Client → Server: notifications/initialized — session is now live. Phase 4 — Capability discovery Client → Server: tools/list, resources/list, prompts/list, resources/templates/list. Server → Client: Returns JSON Schemas, URIs, descriptions. Host: Injects these into the LLM\u0026rsquo;s context as available tools/resources, often filtered by user-granted permissions. Phase 5 — Operation (the steady state) Tool calling (model-initiated) LLM → Host: Emits a tool-use request. Host → User: (Policy-dependent) prompts for permission. Host → Client → Server: tools/call with name + arguments. Server: Executes; may consult its own backend / APIs. Server → Client → Host → LLM: CallToolResult (content blocks: text, image, resource links, or isError: true). Resource reading (host/app-initiated) Client → Server: resources/read with a URI. Server → Client: Contents (text or blob). Optional Client → Server resources/subscribe; Server pushes notifications/resources/updated. Prompts (user-initiated) User → Host: Picks a prompt (e.g. via slash menu). Client → Server: prompts/get with arguments. Server → Client: Rendered message list, fed into the LLM. Sampling (Server-initiated LLM call) Server → Client: sampling/createMessage — Server asks the Host\u0026rsquo;s LLM to complete something. Host → User: Confirms (human-in-the-loop required by spec). Host → LLM → Host → Client → Server: Completion returned. Elicitation (Server asks the User for input) Server → Client: elicitation/create with a JSON Schema describing requested fields. Host → User: Renders a form. User → Host → Client → Server: Structured response or decline. Roots (Client tells Server which filesystem scopes are in-bounds) Client → Server: roots/list on demand; notifications/roots/list_changed when they change. Notifications (out-of-band, either direction) notifications/tools/list_changed, …/resources/list_changed, …/prompts/list_changed notifications/progress for long-running calls (tied to a progressToken) notifications/message for log lines (logging/setLevel controls verbosity) notifications/cancelled to abort an in-flight request Phase 6 — Shutdown Client → Server: Closes transport (stdio: close stdin → Server exits; HTTP: terminate session, optionally DELETE the session ID). Trust boundaries The Host is the only component that talks to both the User and the LLM. The Server never sees either directly — every user prompt or model token routed to it goes through the Host\u0026rsquo;s policy layer. Authentication (who the user is) happens at the AS; authorization (what this token can do at this Server) is enforced by the Server as RS on every request via token validation + scopes. Consent for tool execution, sampling, and elicitation is the Host\u0026rsquo;s responsibility, not the Server\u0026rsquo;s. ","permalink":"https://knowledged.to/notes/ml/mcp-interaction-model/","summary":"\u003ch1 id=\"mcp-interaction-model\"\u003eMCP Interaction Model\u003c/h1\u003e\n\u003ch2 id=\"components-official-mcp-nomenclature\"\u003eComponents (official MCP nomenclature)\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eHost\u003c/strong\u003e — The user-facing application that embeds the LLM and enforces policy (Claude Desktop, Claude Code, an IDE plugin, etc.). It owns the user, the model, and the trust boundary.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eClient\u003c/strong\u003e — A protocol connector that lives inside the Host. \u003cstrong\u003eOne Client per Server\u003c/strong\u003e, holding a 1:1 stateful session. The Host spawns Clients as needed.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eServer\u003c/strong\u003e — The process that exposes capabilities (tools, resources, prompts) over the MCP protocol. Can be local (stdio transport) or remote (Streamable HTTP transport).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAuthorization Server (AS)\u003c/strong\u003e — For remote Servers: the OAuth 2.1 issuer of access tokens. May be the Server itself or a separate identity provider.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eResource Server (RS)\u003c/strong\u003e — OAuth role played by the remote MCP Server when it validates bearer tokens on incoming requests.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUser\u003c/strong\u003e — The human who approves connections, consents to tool calls, and answers elicitations.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLLM\u003c/strong\u003e — Not technically an MCP component, but the reasoning engine the Host drives; never talks to a Server directly.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"phase-1--transport--connection\"\u003ePhase 1 — Transport \u0026amp; connection\u003c/h2\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eHost → Client\u003c/strong\u003e: Host launches a Client configured for a specific Server (command + args for stdio, or URL for HTTP).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eClient ↔ Server\u003c/strong\u003e: Transport established.\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003estdio\u003c/strong\u003e: Host spawns the Server as a subprocess; JSON-RPC over stdin/stdout.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStreamable HTTP\u003c/strong\u003e: Client opens an HTTPS connection; bidirectional via POST + SSE stream.\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003c/ol\u003e\n\u003ch2 id=\"phase-2--authorization-remote-servers-only\"\u003ePhase 2 — Authorization (remote Servers only)\u003c/h2\u003e\n\u003cp\u003eMCP uses \u003cstrong\u003eOAuth 2.1 + PKCE\u003c/strong\u003e, with Resource Indicators (RFC 8707) and Dynamic Client Registration (RFC 7591).\u003c/p\u003e","title":"MCP Interaction Model"},{"content":"SWE-bench \u0026amp; SWE-bench Pro Explained SWE-bench is a benchmark that tests whether an AI model can actually fix real GitHub issues from open-source Python repositories (like Django, Flask, scikit-learn, etc.). The model is given a repo, a bug report or feature request, and has to produce a code patch that makes the failing tests pass — without being told what to change.\nIt\u0026rsquo;s considered one of the more meaningful coding benchmarks because it tests end-to-end software engineering ability: reading existing code, understanding context, making targeted changes, and not breaking other things.\nSWE-bench Pro is a harder variant with:\nMore complex, multi-file issues Less \u0026ldquo;solved\u0026rdquo; training data (so models can\u0026rsquo;t pattern-match from memorized solutions) Tasks that require reasoning across larger codebases What \u0026ldquo;64.3% on SWE-bench Pro\u0026rdquo; means in practice: Claude Opus 4.7 successfully resolves ~64 out of every 100 real-world GitHub issues it\u0026rsquo;s given. The remaining ~36 it either gets wrong or doesn\u0026rsquo;t attempt. That\u0026rsquo;s a high bar — these are issues that stumped human developers enough to file a bug report, and the fix has to pass the existing test suite.\nWhy it matters for AI app builders: If you\u0026rsquo;re using Claude in a coding agent, agentic code review, or automated PR workflow, SWE-bench Pro performance is a reasonable proxy for how well it will handle messy, real-world codebases as opposed to clean textbook problems. A model with a high score here is less likely to produce patches that break unrelated tests or misread the codebase structure.\nThe short version: it\u0026rsquo;s currently the closest thing the industry has to a \u0026ldquo;does this model actually write working code?\u0026rdquo; test.\nSWE = Software Engineering. The full name is Software Engineering Benchmark, created by researchers at Princeton and Stanford in 2023.\n","permalink":"https://knowledged.to/ai/benchmarks/swe-bench/","summary":"\u003ch1 id=\"swe-bench--swe-bench-pro-explained\"\u003eSWE-bench \u0026amp; SWE-bench Pro Explained\u003c/h1\u003e\n\u003cp\u003e\u003cstrong\u003eSWE-bench\u003c/strong\u003e is a benchmark that tests whether an AI model can actually fix real GitHub issues from open-source Python repositories (like Django, Flask, scikit-learn, etc.). The model is given a repo, a bug report or feature request, and has to produce a code patch that makes the failing tests pass — without being told what to change.\u003c/p\u003e\n\u003cp\u003eIt\u0026rsquo;s considered one of the more meaningful coding benchmarks because it tests end-to-end software engineering ability: reading existing code, understanding context, making targeted changes, and not breaking other things.\u003c/p\u003e","title":"SWE-bench \u0026 SWE-bench Pro Explained"},{"content":"Building AI Agents in Go If you want to build AI agents in Go, there are a few Agent SDKs and frameworks available in 2026 that make it easier to integrate with LLMs, tools, and multi-agent workflows.\nBelow is a runnable Go example using a modern Agent SDK pattern. I\u0026rsquo;ll show you a minimal agent that can receive a prompt, call an LLM API, and return a response.\nExample: Minimal AI Agent in Go package main import ( \u0026#34;context\u0026#34; \u0026#34;fmt\u0026#34; \u0026#34;log\u0026#34; \u0026#34;os\u0026#34; \u0026#34;time\u0026#34; \u0026#34;github.com/ingenimax/agent-sdk-go/agent\u0026#34; \u0026#34;github.com/ingenimax/agent-sdk-go/llm\u0026#34; ) func main() { // Load API key from environment variable apiKey := os.Getenv(\u0026#34;OPENAI_API_KEY\u0026#34;) if apiKey == \u0026#34;\u0026#34; { log.Fatal(\u0026#34;Please set the OPENAI_API_KEY environment variable\u0026#34;) } // Create a new LLM client (example: OpenAI GPT model) llmClient, err := llm.NewOpenAI(apiKey, llm.WithModel(\u0026#34;gpt-4o-mini\u0026#34;)) if err != nil { log.Fatalf(\u0026#34;Failed to create LLM client: %v\u0026#34;, err) } // Create an agent with a simple reasoning function myAgent := agent.New(\u0026#34;helper-agent\u0026#34;, agent.WithLLM(llmClient), agent.WithSystemPrompt(\u0026#34;You are a helpful assistant that answers concisely.\u0026#34;), ) // Context with timeout for safety ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) defer cancel() // Run the agent with a user query response, err := myAgent.Run(ctx, \u0026#34;Explain the difference between concurrency and parallelism in Go.\u0026#34;) if err != nil { log.Fatalf(\u0026#34;Agent error: %v\u0026#34;, err) } fmt.Println(\u0026#34;Agent Response:\u0026#34;) fmt.Println(response) } How This Works agent-sdk-go – A Go framework for building AI agents with modular tools, memory, and reasoning loops. LLM Client – Connects to an LLM provider (OpenAI in this example). Agent – Wraps the LLM with a system prompt and optional tools. Run – Executes the reasoning loop and returns the answer. Installation go get github.com/Ingenimax/agent-sdk-go Features of Modern Go Agent SDKs Tool Integration – Agents can call APIs, databases, or custom functions. Multi-Agent Workflows – Agents can hand off tasks to other agents. Memory – Store and recall conversation history. Streaming – Get partial responses in real time. Concurrency – Use Go\u0026rsquo;s goroutines for parallel tool calls. ","permalink":"https://knowledged.to/notes/ml/ai-agents-in-go/","summary":"\u003ch1 id=\"building-ai-agents-in-go\"\u003eBuilding AI Agents in Go\u003c/h1\u003e\n\u003cp\u003eIf you want to build \u003cstrong\u003eAI agents in Go\u003c/strong\u003e, there are a few \u003cstrong\u003eAgent SDKs\u003c/strong\u003e and frameworks available in 2026 that make it easier to integrate with LLMs, tools, and multi-agent workflows.\u003c/p\u003e\n\u003cp\u003eBelow is a \u003cstrong\u003erunnable Go example\u003c/strong\u003e using a modern Agent SDK pattern. I\u0026rsquo;ll show you a \u003cstrong\u003eminimal agent\u003c/strong\u003e that can receive a prompt, call an LLM API, and return a response.\u003c/p\u003e\n\u003ch2 id=\"example-minimal-ai-agent-in-go\"\u003eExample: Minimal AI Agent in Go\u003c/h2\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"\u003e\u003ccode class=\"language-go\" data-lang=\"go\"\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003epackage\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003emain\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e (\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;context\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;fmt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;log\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;os\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;time\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;github.com/ingenimax/agent-sdk-go/agent\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;github.com/ingenimax/agent-sdk-go/llm\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#66d9ef\"\u003efunc\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003emain\u003c/span\u003e() {\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#75715e\"\u003e// Load API key from environment variable\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#a6e22e\"\u003eapiKey\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e:=\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003eos\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eGetenv\u003c/span\u003e(\u003cspan style=\"color:#e6db74\"\u003e\u0026#34;OPENAI_API_KEY\u0026#34;\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#66d9ef\"\u003eif\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003eapiKey\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e==\u003c/span\u003e \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e        \u003cspan style=\"color:#a6e22e\"\u003elog\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eFatal\u003c/span\u003e(\u003cspan style=\"color:#e6db74\"\u003e\u0026#34;Please set the OPENAI_API_KEY environment variable\u0026#34;\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#75715e\"\u003e// Create a new LLM client (example: OpenAI GPT model)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#a6e22e\"\u003ellmClient\u003c/span\u003e, \u003cspan style=\"color:#a6e22e\"\u003eerr\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e:=\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003ellm\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eNewOpenAI\u003c/span\u003e(\u003cspan style=\"color:#a6e22e\"\u003eapiKey\u003c/span\u003e, \u003cspan style=\"color:#a6e22e\"\u003ellm\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eWithModel\u003c/span\u003e(\u003cspan style=\"color:#e6db74\"\u003e\u0026#34;gpt-4o-mini\u0026#34;\u003c/span\u003e))\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#66d9ef\"\u003eif\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003eerr\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e!=\u003c/span\u003e \u003cspan style=\"color:#66d9ef\"\u003enil\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e        \u003cspan style=\"color:#a6e22e\"\u003elog\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eFatalf\u003c/span\u003e(\u003cspan style=\"color:#e6db74\"\u003e\u0026#34;Failed to create LLM client: %v\u0026#34;\u003c/span\u003e, \u003cspan style=\"color:#a6e22e\"\u003eerr\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#75715e\"\u003e// Create an agent with a simple reasoning function\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#a6e22e\"\u003emyAgent\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e:=\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003eagent\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eNew\u003c/span\u003e(\u003cspan style=\"color:#e6db74\"\u003e\u0026#34;helper-agent\u0026#34;\u003c/span\u003e,\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e        \u003cspan style=\"color:#a6e22e\"\u003eagent\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eWithLLM\u003c/span\u003e(\u003cspan style=\"color:#a6e22e\"\u003ellmClient\u003c/span\u003e),\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e        \u003cspan style=\"color:#a6e22e\"\u003eagent\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eWithSystemPrompt\u003c/span\u003e(\u003cspan style=\"color:#e6db74\"\u003e\u0026#34;You are a helpful assistant that answers concisely.\u0026#34;\u003c/span\u003e),\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    )\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#75715e\"\u003e// Context with timeout for safety\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#a6e22e\"\u003ectx\u003c/span\u003e, \u003cspan style=\"color:#a6e22e\"\u003ecancel\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e:=\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003econtext\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eWithTimeout\u003c/span\u003e(\u003cspan style=\"color:#a6e22e\"\u003econtext\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eBackground\u003c/span\u003e(), \u003cspan style=\"color:#ae81ff\"\u003e15\u003c/span\u003e\u003cspan style=\"color:#f92672\"\u003e*\u003c/span\u003e\u003cspan style=\"color:#a6e22e\"\u003etime\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eSecond\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#66d9ef\"\u003edefer\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003ecancel\u003c/span\u003e()\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#75715e\"\u003e// Run the agent with a user query\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#a6e22e\"\u003eresponse\u003c/span\u003e, \u003cspan style=\"color:#a6e22e\"\u003eerr\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e:=\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003emyAgent\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eRun\u003c/span\u003e(\u003cspan style=\"color:#a6e22e\"\u003ectx\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;Explain the difference between concurrency and parallelism in Go.\u0026#34;\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#66d9ef\"\u003eif\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003eerr\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e!=\u003c/span\u003e \u003cspan style=\"color:#66d9ef\"\u003enil\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e        \u003cspan style=\"color:#a6e22e\"\u003elog\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003eFatalf\u003c/span\u003e(\u003cspan style=\"color:#e6db74\"\u003e\u0026#34;Agent error: %v\u0026#34;\u003c/span\u003e, \u003cspan style=\"color:#a6e22e\"\u003eerr\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#a6e22e\"\u003efmt\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003ePrintln\u003c/span\u003e(\u003cspan style=\"color:#e6db74\"\u003e\u0026#34;Agent Response:\u0026#34;\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#a6e22e\"\u003efmt\u003c/span\u003e.\u003cspan style=\"color:#a6e22e\"\u003ePrintln\u003c/span\u003e(\u003cspan style=\"color:#a6e22e\"\u003eresponse\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003ch2 id=\"how-this-works\"\u003eHow This Works\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003e\u003ccode\u003eagent-sdk-go\u003c/code\u003e\u003c/strong\u003e – A Go framework for building AI agents with modular tools, memory, and reasoning loops.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLLM Client\u003c/strong\u003e – Connects to an LLM provider (OpenAI in this example).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAgent\u003c/strong\u003e – Wraps the LLM with a system prompt and optional tools.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRun\u003c/strong\u003e – Executes the reasoning loop and returns the answer.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"installation\"\u003eInstallation\u003c/h2\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003ego get github.com/Ingenimax/agent-sdk-go\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003ch2 id=\"features-of-modern-go-agent-sdks\"\u003eFeatures of Modern Go Agent SDKs\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eTool Integration\u003c/strong\u003e – Agents can call APIs, databases, or custom functions.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMulti-Agent Workflows\u003c/strong\u003e – Agents can hand off tasks to other agents.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMemory\u003c/strong\u003e – Store and recall conversation history.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStreaming\u003c/strong\u003e – Get partial responses in real time.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eConcurrency\u003c/strong\u003e – Use Go\u0026rsquo;s goroutines for parallel tool calls.\u003c/li\u003e\n\u003c/ul\u003e\n\u003chr\u003e","title":"AI Agents in Go"},{"content":"Six-Dimension Art Evaluation Rubric Source paper: Learning-based Artificial Intelligence Artwork: Methodology Taxonomy and Quality Evaluation, ACM Computing Surveys (2024).\nOrigin The rubric was built from art vocabulary and traditional principles of painting analysis, then validated through a user study to confirm the weightings felt reasonable across different artwork types. The goal was a consistent, repeatable way to evaluate AI-generated artworks across different styles.\nThe Six Dimensions Beauty (50%) — The dominant criterion. Encompasses overall compositional harmony: balance, proportion, the arrangement of visual elements, and the pleasing relationship between subjects. An image can score well on every other dimension and still feel wrong if the composition is off. This is where Gestalt principles are most directly applied — does the whole hang together?\nColor (10%) — Palette coherence and emotional resonance. Not just whether colors are technically accurate, but whether the color relationships feel intentional and expressive — harmony, contrast, temperature, and the mood they collectively create.\nTexture (10%) — Surface quality and material plausibility. In AI-generated imagery this is particularly diagnostic: does skin feel like skin, fabric like fabric, stone like stone? Also covers handling quality — brushstroke character in painterly work, grain in photography-style renders.\nContent Detail (10%) — The richness and specificity of what\u0026rsquo;s depicted. Vague, generic content scores lower; precise, particular content scores higher. Captures whether the image has something specific to say visually, or whether it\u0026rsquo;s a generic assembly of shapes.\nLine (10%) — The clarity, expressiveness, and intentionality of linework and edges. In illustrative or painterly work this is about the quality of mark-making. In photorealistic work it\u0026rsquo;s about edge definition and whether contours feel decisive. Weak or confused linework is a common failure mode in AI generation.\nStyle (10%) — Consistency and distinctiveness of the artistic voice across the image. Does the work feel coherent — as if it came from a single artistic sensibility — or like a pastiche of different references pasted together? This dimension penalizes the generic, blended-everything quality that AI tools often default to.\nWhy Beauty Gets Half the Weight The researchers\u0026rsquo; user study found that compositional harmony was the single strongest predictor of how people overall judged an artwork\u0026rsquo;s quality. The other five dimensions tend to function as amplifiers or detractors of foundational compositional success. You can\u0026rsquo;t rescue a badly composed image with great color alone.\nPractical Use The rubric works as a structured critique checklist when evaluating AI outputs — instead of asking \u0026ldquo;is this good?\u0026rdquo;, you ask six specific questions. It also works as a prompting guide: if you know which dimension is weak in a generation, you can target your next prompt specifically at that dimension (e.g., \u0026ldquo;stronger sense of light source and shadow\u0026rdquo; targets beauty; \u0026ldquo;consistent loose brushwork throughout\u0026rdquo; targets style).\nSources Learning-based AI Artwork: Methodology Taxonomy and Quality Evaluation (ACM) Creative generation and evaluation system of art design (Springer) ","permalink":"https://knowledged.to/notes/ml/art-evaluation-rubric/","summary":"\u003ch1 id=\"six-dimension-art-evaluation-rubric\"\u003eSix-Dimension Art Evaluation Rubric\u003c/h1\u003e\n\u003cp\u003eSource paper: \u003ca href=\"https://dl.acm.org/doi/10.1145/3698105\"\u003e\u003cem\u003eLearning-based Artificial Intelligence Artwork: Methodology Taxonomy and Quality Evaluation\u003c/em\u003e\u003c/a\u003e, ACM Computing Surveys (2024).\u003c/p\u003e\n\u003ch2 id=\"origin\"\u003eOrigin\u003c/h2\u003e\n\u003cp\u003eThe rubric was built from \u003cem\u003eart vocabulary\u003c/em\u003e and traditional principles of painting analysis, then validated through a user study to confirm the weightings felt reasonable across different artwork types. The goal was a consistent, repeatable way to evaluate AI-generated artworks across different styles.\u003c/p\u003e\n\u003ch2 id=\"the-six-dimensions\"\u003eThe Six Dimensions\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eBeauty (50%)\u003c/strong\u003e — The dominant criterion. Encompasses overall compositional harmony: balance, proportion, the arrangement of visual elements, and the pleasing relationship between subjects. An image can score well on every other dimension and still feel wrong if the composition is off. This is where Gestalt principles are most directly applied — does the whole hang together?\u003c/p\u003e","title":"Six-Dimension Art Evaluation Rubric"},{"content":"Gestalt Principles Gestalt principles are a set of rules from psychology describing how the human mind naturally organizes visual information into meaningful wholes rather than perceiving a collection of separate parts. The name comes from the German word Gestalt, meaning \u0026ldquo;shape\u0026rdquo; or \u0026ldquo;form,\u0026rdquo; and the core idea is captured in the phrase: the whole is greater than the sum of its parts.\nThe Main Principles Proximity — elements that are close together are perceived as belonging to the same group.\nSimilarity — elements that look alike (same color, shape, size) are grouped together mentally.\nContinuity — the eye naturally follows lines and curves, preferring smooth, continuous paths over abrupt changes.\nClosure — the mind fills in missing information to complete a familiar shape, even when parts of it are absent.\nFigure/Ground — we instinctively separate a scene into a foreground subject (figure) and a background (ground).\nCommon Fate — elements moving in the same direction are perceived as a group.\nSymmetry — symmetrical compositions feel stable and orderly; the mind prefers balanced arrangements.\nOrigin Gestalt principles were developed in the early 20th century by German psychologists Max Wertheimer, Kurt Koffka, and Wolfgang Köhler.\nApplications They remain foundational in graphic design, UI/UX, and visual art. In AI research, Gestalt principles are now being built directly into aesthetic evaluation models, because compositions that respect these principles tend to feel more coherent and visually satisfying to human viewers.\n","permalink":"https://knowledged.to/notes/psychology/gestalt-principles/","summary":"\u003ch1 id=\"gestalt-principles\"\u003eGestalt Principles\u003c/h1\u003e\n\u003cp\u003eGestalt principles are a set of rules from psychology describing how the human mind naturally organizes visual information into meaningful wholes rather than perceiving a collection of separate parts. The name comes from the German word \u003cem\u003eGestalt\u003c/em\u003e, meaning \u0026ldquo;shape\u0026rdquo; or \u0026ldquo;form,\u0026rdquo; and the core idea is captured in the phrase: \u003cstrong\u003ethe whole is greater than the sum of its parts\u003c/strong\u003e.\u003c/p\u003e\n\u003ch2 id=\"the-main-principles\"\u003eThe Main Principles\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eProximity\u003c/strong\u003e — elements that are close together are perceived as belonging to the same group.\u003c/p\u003e","title":"Gestalt Principles"},{"content":"Rubric: Meaning and Origin A rubric is a scoring guide or evaluation framework that breaks down quality into specific, defined criteria. It provides a structured way to assess something by listing what to look for and, often, how much weight each criterion carries — rather than relying on a vague overall impression.\nIn everyday use, rubrics appear most commonly in education (e.g., grading rubrics for essays) and in evaluation contexts where consistent, transparent judgment is needed.\nOrigin The word comes from the Latin rubrica, meaning \u0026ldquo;red ochre\u0026rdquo; or \u0026ldquo;red earth.\u0026rdquo; In medieval manuscripts, scribes used red ink to write headings, titles, and instructional text — these red-lettered sections were called rubrics. Over time the term shifted from referring to the physical red markings to meaning any set of rules, headings, or guiding instructions, and eventually settled into its modern sense of a structured evaluation guide.\n","permalink":"https://knowledged.to/notes/vocabulary/rubric/","summary":"\u003ch1 id=\"rubric-meaning-and-origin\"\u003eRubric: Meaning and Origin\u003c/h1\u003e\n\u003cp\u003eA \u003cstrong\u003erubric\u003c/strong\u003e is a scoring guide or evaluation framework that breaks down quality into specific, defined criteria. It provides a structured way to assess something by listing what to look for and, often, how much weight each criterion carries — rather than relying on a vague overall impression.\u003c/p\u003e\n\u003cp\u003eIn everyday use, rubrics appear most commonly in education (e.g., grading rubrics for essays) and in evaluation contexts where consistent, transparent judgment is needed.\u003c/p\u003e","title":"Rubric: Meaning and Origin"},{"content":"LLM as Judge Using a language model to evaluate the outputs of another model (or itself) instead of relying on humans or rigid automated metrics like BLEU/ROUGE/exact-match. Give the judge model a response (or a pair of responses) plus a rubric or question, and it returns a score, a label, or a winner.\nWhy it exists For open-ended generation — chat answers, code explanations, summaries, agent traces — string-overlap metrics don\u0026rsquo;t capture quality, and human eval is slow and expensive. Once frontier LLMs got good enough, they became decent proxies for human raters on a lot of tasks, so they\u0026rsquo;re now the default evaluator in MT-Bench, Chatbot Arena, G-Eval, and most internal eval pipelines.\nIt\u0026rsquo;s also the \u0026ldquo;AI feedback\u0026rdquo; half of RLAIF and Constitutional AI — the judge produces the preference signal that would otherwise come from a human, which then feeds into DPO or GRPO. This is why it shows up next to ELO: Chatbot Arena runs pairwise LLM (and human) judgments and uses Elo to convert win-rates into a ranking.\nCommon shapes Single-answer grading — judge scores one response against a rubric (1–10, pass/fail, criterion-wise). Pairwise — judge picks A or B given the same prompt. Cleaner signal, pairs well with Elo. Reference-based vs reference-free — with or without a gold answer to compare against. Known biases Position bias — favors the first option in pairwise. Verbosity bias — longer answers score higher. Self-preference bias — a model rates its own outputs higher. Style over substance — confident, well-formatted, wrong answers beat hedged correct ones. Mitigations Swap positions and average across both orderings. Use a judge that\u0026rsquo;s stronger than the models being evaluated. Force the judge to produce chain-of-thought reasoning before the score. Ensemble multiple judges. Calibrate against a small human-labeled set so you know how much to trust the numbers. Mental model LLM-as-judge is a cheap, scalable, biased estimator of human preference. Useful as a gradient during iteration; dangerous if you treat its absolute scores as ground truth.\nRelated ELO scoring for AI evaluation (Chatbot Arena ranking) RLAIF, Constitutional AI DPO, GRPO (consume preference signals the judge produces) MT-Bench, G-Eval ","permalink":"https://knowledged.to/ai/concepts/llm-as-judge/","summary":"\u003ch1 id=\"llm-as-judge\"\u003eLLM as Judge\u003c/h1\u003e\n\u003cp\u003eUsing a language model to evaluate the outputs of another model (or itself) instead of relying on humans or rigid automated metrics like BLEU/ROUGE/exact-match. Give the judge model a response (or a pair of responses) plus a rubric or question, and it returns a score, a label, or a winner.\u003c/p\u003e\n\u003ch2 id=\"why-it-exists\"\u003eWhy it exists\u003c/h2\u003e\n\u003cp\u003eFor open-ended generation — chat answers, code explanations, summaries, agent traces — string-overlap metrics don\u0026rsquo;t capture quality, and human eval is slow and expensive. Once frontier LLMs got good enough, they became decent proxies for human raters on a lot of tasks, so they\u0026rsquo;re now the default evaluator in MT-Bench, Chatbot Arena, G-Eval, and most internal eval pipelines.\u003c/p\u003e","title":"LLM as Judge"},{"content":"Creativity meter The problem today at BuddyHQ is that we are aiming for output quality that is far better than what the ChatGPTs and Claudes provide. This require the Creative Director of BuddyHQ to be asking questions that enrich the understanding of the needs of our users. This means we need to ask correct questions to our users when the user has not provided enough details for BuddyHQ to operate.\nThis is where we face the challenge. For some users this will be alright, because they are looking to use the final output. For others, this may be a distraction, as they are looking to get to the output soon and refine from there.\nHow do we choose which path to take? Can we introduce a creativity meter that generates the assets based on creativity meter input?\nMeter starting point - Follow my instructions, when not clear, ask Meter mid point - Follow my instructions, when not clear, make assumptions and proceed Meter end point - Take idea from the user, be creative in all choices ","permalink":"https://knowledged.to/notes/engineering/buddyhq-creativity-meter/","summary":"\u003ch1 id=\"creativity-meter\"\u003eCreativity meter\u003c/h1\u003e\n\u003cp\u003eThe problem today at BuddyHQ is that we are aiming for output quality that is far better than what the ChatGPTs and Claudes provide. This require the Creative Director of BuddyHQ to be asking questions that enrich the understanding of the needs of our users. This means we need to ask correct questions to our users when the user has not provided enough details for BuddyHQ to operate.\u003c/p\u003e","title":"BuddyHQ Creativity Meter — Idea"},{"content":"In harness engineering, a commitment gate is a point in the workflow where the agent must prove a change meets defined criteria before it can be merged or committed. It’s a quality-control checkpoint that turns “looks good” into an enforceable decision, usually through tests, lint rules, architectural checks, or explicit approval rules.[1][2]\nWhat it does A commitment gate is meant to stop bad or incomplete work from being accepted just because the agent produced it. In harness-engineering terms, this fits the broader pattern of using constraints, feedback loops, and quality gates to make AI agents reliable. OpenAI’s harness-engineering write-up emphasizes that the real job is designing environments and feedback loops so agents can work safely and consistently, rather than relying on humans to catch every mistake.[2][1]\nIn practice Typical commitment gates can include things like:\nCI checks that must pass before merge. Architectural or policy linters. Verification steps that compare the output against a spec. Human review for high-risk actions or edge cases. The idea is that the agent can keep moving fast, but only within boundaries that make the result trustworthy.[1][2]\nWhy it matters Without a commitment gate, an agent may generate code that is syntactically valid but still wrong, unsafe, or architecturally inconsistent. With a gate, the system only “commits” when the work satisfies the chosen standards, which is the core reliability mechanism in harness engineering.[2][1]\nIf you meant a different “commitment gate” from Harness’s cloud cost product or another internal Harness term, say which Harness area you’re referring to and I’ll narrow it down.\nSources [1] Harness engineering: leveraging Codex in an agent-first \u0026hellip; https://openai.com/index/harness-engineering/ [2] Harness Engineering for AI Coding Agents: Constraints \u0026hellip; https://www.augmentcode.com/guides/harness-engineering-ai-coding-agents [3] What is harness engineering? - SIG https://www.softwareimprovementgroup.com/blog/what-is-harness-engineering/ [4] Harness Engineering: What It Means for QA https://testcollab.com/blog/harness-engineering [5] Harness Engineering as Categorical Architecture https://arxiv.org/abs/2605.12239 [6] Cloud Cost Management feature: Commitment Orchestrator https://www.harness.io/products/cloud-cost-management/commitment-orchestrator [7] Signed Commits using Git Experience https://developer.harness.io/docs/platform/git-experience/signed-commits-harness [8] Agent Harness Engineering https://addyosmani.com/blog/agent-harness-engineering/ [9] Harness Commitment Orchestrator: A Modernized FinOps \u0026hellip; https://www.harness.io/blog/harness-commitment-orchestrator-a-modernized-finops-experience [10] Harness \u0026amp; SonarQube Integration | Code Quality \u0026amp; Security https://www.sonarsource.com/integrations/harness/ [11] Harness Engineering with Nothing but Markdown https://dev.to/aws-builders/harness-engineering-with-nothing-but-markdown-g6b [12] Harness Engineering https://engineering.harness.io [13] Commitment Orchestrator Events APIs https://apidocs.harness.io/commitment-orchestrator-events-apis [14] ai-boost/awesome-harness-engineering https://github.com/ai-boost/awesome-harness-engineering [15] Harness - APIs.io https://apis.io/providers/harness/\n","permalink":"https://knowledged.to/ai/concepts/commitment-gate/","summary":"\u003cp\u003eIn harness engineering, a \u003cstrong\u003ecommitment gate\u003c/strong\u003e is a point in the workflow where the agent must prove a change meets defined criteria before it can be merged or committed. It’s a quality-control checkpoint that turns “looks good” into an enforceable decision, usually through tests, lint rules, architectural checks, or explicit approval rules.[1][2]\u003c/p\u003e\n\u003ch2 id=\"what-it-does\"\u003eWhat it does\u003c/h2\u003e\n\u003cp\u003eA commitment gate is meant to stop bad or incomplete work from being accepted just because the agent produced it. In harness-engineering terms, this fits the broader pattern of using constraints, feedback loops, and quality gates to make AI agents reliable. OpenAI’s harness-engineering write-up emphasizes that the real job is designing environments and feedback loops so agents can work safely and consistently, rather than relying on humans to catch every mistake.[2][1]\u003c/p\u003e","title":"Commitment Gate (Harness Engineering)"},{"content":"Commitment Gate (Harness Engineering) A commitment gate is a verification checkpoint between an agent producing a candidate output and that output being \u0026ldquo;locked in\u0026rdquo; — emitted as a final answer, written to disk, or used to call an irreversible tool. Instead of running skills along a fixed script and fusing results at the end (where errors propagate silently into late fusion), a harness with commitment gates pauses at each commit point, runs relative checks, and either lets the result through or triggers a targeted recovery loop.\nCanonical formulation (Affordance Agent Harness) A-Harness gates commitments with three relative checks:\nCross-skill agreement — do independent skills/tools converge on the same answer? Disagreement signals underdetermined evidence, not noise to average away. Cross-scale stability — does the answer hold under perturbations of scale, framing, or granularity? Brittleness usually means the model latched onto a spurious feature. Evidence sufficiency — is there enough grounding to commit, or is the agent extrapolating? A failed gate doesn\u0026rsquo;t kill the trajectory — it routes back to the planner/router to gather more evidence, retry a different skill, or escalate. Errors become control-flow signals at the gate, not silent corruption that a final judge has to untangle later.\nWhere it sits in the harness vocabulary Phase gates (e.g. in Plan-Execute-Verify loops) fire at workflow boundaries — coarse-grained. Commitment gates fire at every point where the agent would otherwise burn down optionality — fine-grained. The two compose: PEV gives you outer phase structure, commitment gates enforce verification within a phase before any irreversible step.\nAnalogies for backend / infra mental model The agent equivalent of a CI check that blocks merge. A Temporal activity that won\u0026rsquo;t transition state until a precondition holds. A deterministic outer-harness constraint enforcing what the probabilistic inner model can\u0026rsquo;t guarantee on its own. Why it matters LLM compliance with instructions is probabilistic, not deterministic. Commitment gates are the deterministic outer layer that turns probabilistic reasoning into dependable action by refusing to commit on weak evidence and forcing targeted retries instead of silent failure propagation.\nReferences Affordance Agent Harness: Verification-Gated Skill Orchestration (arXiv) — origin of the three-check formulation. Augment Code, Harness Engineering for AI Coding Agents — broader harness-engineering framing, deterministic outer constraints vs. probabilistic inner model. Adnan Masood, Agent Harness Engineering — The Rise of the AI Control Plane — PEV phase-gate context. ","permalink":"https://knowledged.to/notes/ml/commitment-gate/","summary":"\u003ch1 id=\"commitment-gate-harness-engineering\"\u003eCommitment Gate (Harness Engineering)\u003c/h1\u003e\n\u003cp\u003eA \u003cstrong\u003ecommitment gate\u003c/strong\u003e is a verification checkpoint between an agent producing a candidate output and that output being \u0026ldquo;locked in\u0026rdquo; — emitted as a final answer, written to disk, or used to call an irreversible tool. Instead of running skills along a fixed script and fusing results at the end (where errors propagate silently into late fusion), a harness with commitment gates pauses at each commit point, runs relative checks, and either lets the result through or triggers a targeted recovery loop.\u003c/p\u003e","title":"Commitment Gate (Harness Engineering)"},{"content":"Defense-in-Depth Defense-in-depth is a security strategy that uses multiple layers of defenses so that if one layer fails, others still protect the system. The idea comes from military fortification — castles didn\u0026rsquo;t rely on a single wall; they had moats, outer walls, inner walls, keeps, and so on. Breaching one didn\u0026rsquo;t mean the attacker won.\nIn Information Security This translates to combining different controls rather than depending on any single one. A typical stack might include:\nPerimeter defenses — firewalls, network segmentation Endpoint protection — antivirus, host hardening Identity controls — authentication, MFA, least-privilege access Application-level safeguards — input validation, secure coding Data protections — encryption at rest and in transit Monitoring and detection — logging, intrusion detection, SIEM Operational practices — patching, backups, incident response Core Assumption Any single control will eventually fail or be bypassed — through bugs, misconfiguration, insider action, or a novel attack. Layering means an attacker has to defeat several independent mechanisms in sequence, which:\nRaises the cost of attack Increases the chance of detection Limits blast radius when something does go wrong What It Is Not Not stacking redundant copies of the same control (ten firewalls in a row) Not an excuse for weak individual layers The layers should be diverse — different mechanisms addressing different failure modes — and each should be reasonably strong on its own.\nBeyond IT The concept applies broadly: safety engineering, nuclear plants, aviation, and everyday systems work all use the same principle. For example, in a webhook processing pipeline:\nIdempotent webhook handlers (application layer) Database constraints (data layer) Monitoring and alerting (detection layer) Each layer catches failures the others might miss.\n","permalink":"https://knowledged.to/notes/security/defense-in-depth/","summary":"\u003ch1 id=\"defense-in-depth\"\u003eDefense-in-Depth\u003c/h1\u003e\n\u003cp\u003eDefense-in-depth is a security strategy that uses multiple layers of defenses so that if one layer fails, others still protect the system. The idea comes from military fortification — castles didn\u0026rsquo;t rely on a single wall; they had moats, outer walls, inner walls, keeps, and so on. Breaching one didn\u0026rsquo;t mean the attacker won.\u003c/p\u003e\n\u003ch2 id=\"in-information-security\"\u003eIn Information Security\u003c/h2\u003e\n\u003cp\u003eThis translates to combining different controls rather than depending on any single one. A typical stack might include:\u003c/p\u003e","title":"Defense-in-Depth"},{"content":"BuddyHQ Desktop App — Architecture Decision Date: 2026-04-28\nStack Being Wrapped appui: React 18 + Vite 6 SPA — FabricJS canvas, Revideo video editor, SSE chat, Zustand, TanStack Query, private packages (@buddyhq/richtext-editor, @buddyhq/video-renderer) appwebexpms: Go 1.25 BFF at app.buddyhq.ai — cookie-based JWT auth, SSE proxied to mwms, 16 downstream service clients, 62 handler files Options Evaluated PWA — appui reuse 97%, appwebexpms 100% Days to ship. No bundle overhead. But: not a real desktop app, no system tray/global shortcuts/native menus, not App Store distributable.\nElectron + appui (cloud BFF) — RECOMMENDED FOR V1 appui reuse ~90%, appwebexpms 100% unchanged. Bundle ~200 MB. 2–4 weeks to v1.\nChromium renderer eliminates all WebKit risk for FabricJS and Revideo appwebexpms needs zero changes — desktop is just another browser client Cookie JWT + SSE work natively in Electron IPC layer is TypeScript (no new language for Go team) electron-builder, electron-updater, code signing all solved Cons: large bundle, memory ~300–600 MB at idle, Mac App Store is painful Minimal structure: electron/main.ts (BrowserWindow + IPC), electron/preload.ts (contextBridge), appui/src unchanged + platform.ts shim for native vs web APIs.\nTauri 2.0 + appui (cloud BFF) — V2 if mobile matters appui reuse ~85%, appwebexpms 100% unchanged. Bundle ~15 MB. 4–8 weeks to v1.\nMuch smaller bundle, better security model, Tauri 2.0 supports iOS + Android Serious risk: WebKit ≠ Chromium — FabricJS, Revideo, and @buddyhq/video-renderer are untested on WebKit. Budget 2–3 weeks for WebKit fixes. SSE buffering issues on WebKit require careful testing Cookie SameSite config for app.buddyhq.ai needs Tauri http plugin setup Rust IPC (Tauri commands) is a new language for the Go team Tauri + Local Go Sidecar — NOT for v1 appwebexpms cannot be reused as-is (depends on Valkey, GCS, 16 K8s microservices). Would need a new appdesktopbff stripping cloud dependencies. Doubles maintenance surface. 4–6 months.\nNative SwiftUI — NOT for v1 Zero appui reuse. Full rewrite: canvas editor, image editor (FabricJS), video timeline (Revideo), SSE chat, rich text. macOS only. 12–24 months.\nDecision Electron for v1. appwebexpms unchanged — desktop app points at https://app.buddyhq.ai.\nIf mobile (iOS/Android) lands on the 12-month roadmap, switch to Tauri 2.0 from the start and absorb the WebKit debugging cost early.\nappwebexpms Changes for v1 None. If offline support needed later: add SQLite cache in Electron main process (better-sqlite3). If desktop-specific APIs needed: add /desktop/* route namespace to appwebexpms.\n","permalink":"https://knowledged.to/notes/engineering/buddyhq-desktop-app-architecture-decision/","summary":"\u003ch1 id=\"buddyhq-desktop-app--architecture-decision\"\u003eBuddyHQ Desktop App — Architecture Decision\u003c/h1\u003e\n\u003cp\u003eDate: 2026-04-28\u003c/p\u003e\n\u003ch2 id=\"stack-being-wrapped\"\u003eStack Being Wrapped\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003eappui: React 18 + Vite 6 SPA — FabricJS canvas, Revideo video editor, SSE chat, Zustand, TanStack Query, private packages (@buddyhq/richtext-editor, @buddyhq/video-renderer)\u003c/li\u003e\n\u003cli\u003eappwebexpms: Go 1.25 BFF at app.buddyhq.ai — cookie-based JWT auth, SSE proxied to mwms, 16 downstream service clients, 62 handler files\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"options-evaluated\"\u003eOptions Evaluated\u003c/h2\u003e\n\u003ch3 id=\"pwa--appui-reuse-97-appwebexpms-100\"\u003ePWA — appui reuse 97%, appwebexpms 100%\u003c/h3\u003e\n\u003cp\u003eDays to ship. No bundle overhead. But: not a real desktop app, no system tray/global shortcuts/native menus, not App Store distributable.\u003c/p\u003e","title":"BuddyHQ Desktop App — Architecture Decision"},{"content":"BuddyHQ Desktop App — Architecture Analysis Context Codebase analyzed: /Users/subhash/code/bhq (polyrepo-in-a-folder, independent deployable services)\nKey components evaluated Component What it is Key constraints appui React 18 + Vite 6 SPA FabricJS canvas, Revideo video editor, SSE chat, private packages (@buddyhq/richtext-editor, @buddyhq/video-renderer) appwebexpms Go 1.25 BFF at app.buddyhq.ai Cookie-based JWT auth, SSE proxy to mwms, 16 downstream service clients appui tech stack: React 18, TypeScript, Vite 6, Tailwind CSS, Radix UI, shadcn/ui, Zustand 5, TanStack Query 5, React Router 7, FabricJS, Revideo, Framer Motion, react-dnd, react-moveable.\nappwebexpms: Go, chi router, cookie JWT auth, Valkey (Redis-compat), GCS, 62 API handler files, SSE proxied to mwms with FlushInterval=-1.\nOne existing native experiment: knowledged-mac (SwiftUI macOS utility for the knowledged KB service — not a full app port).\nOption 1 — PWA Add manifest.json + service worker to appui.\nappui reuse: ~97% | appwebexpms reuse: 100% Bundle: ~0 MB extra Pros: Zero new infra, ships in days, auto-updates, cross-platform Cons: Not a real desktop app. No system tray, global shortcuts, native notifications, menubar, OS file pickers. Not App Store distributable. No offline support. Verdict: Good for an installable shortcut, not a first-class desktop product. Option 2 — Electron + appui (cloud BFF) ✅ RECOMMENDED FOR V1 Wrap appui in an Electron shell. BrowserWindow loads the Vite build. Electron uses Chromium — cookies, SSE, REST all work natively.\nappui reuse: ~90% | appwebexpms reuse: 100% Bundle: ~150–250 MB Architecture:\n[ E l - - - - - e c B N A I L t r a u P o r o t t C c o w i o a n s v - b l e e u r M r p i f a W m d d i i i e a g l n n n t e e d u e P o s r v I r w , i / o ( a O c t e e r l c v s a e o i s y c n a ] t t r e i o x p n t c - B M u r a p i i d d n a g t e e [ r R ) e - - - - n d S T - S A e a a S l r m l C E l e e k o r s o w p R k o r = e t i r i a o e k v a c s a p t a a t p p u n e u S p t a i P . h t p ] A b i a u w v c d o e k d r l a y k y g h s e q s . n a a i i t n i c v l e u l d y e d Pros:\nChromium renderer = zero compatibility risk for FabricJS, Revideo, private packages appwebexpms needs zero changes — desktop app is just another browser client Native menus, tray, global shortcuts, OS file pickers via IPC electron-updater, electron-builder, code signing, notarization all solved macOS + Windows + Linux from one codebase Cookie-based JWT works via Electron session API IPC layer is TypeScript — no new language for Go team Cons:\nBundle ~200 MB (Chromium included) Memory ~300–600 MB at idle Mac App Store distribution is painful (sandboxing, notarization) Security: nodeIntegration must be disabled, contextBridge required appui changes needed:\nAdd platform.ts utility detecting window.electron Route native file picker vs through IPC Deep-link handling via app.setAsDefaultProtocolClient Minimal structure:\na p p e s p u l r a i e m p t c c - c a r r / k d t i e a a e r n l y g s o . o . e k n t a t . t s d s j . s p t o / s n — — — — — B c S a e r o y p l o n s p e w t t u c s e e i t e x m / r r t s o W B t r n i r r c , n i a d d y ( e o g p l w e i n e , c p c e o m t a x n r u p w o t o + o n o s r - - i m k b u n e s u p g n p i d u a l a s c d t a e e e f r r e o , , r I e I P c l P C o e C p c A y t h P ) r a I o n s n d - l u e p r d s a t e r Time to v1: 2–4 weeks.\nOption 3 — Tauri 2.0 + appui (cloud BFF) Tauri replaces Electron\u0026rsquo;s Chromium with the OS native WebView (WebKit on macOS/iOS, WebView2 on Windows). Rust binary handles the native shell.\nappui reuse: ~85% | appwebexpms reuse: 100% Bundle: ~10–20 MB Pros:\nBundle ~10–20 MB (no bundled Chromium) Memory ~100–200 MB at idle Tauri 2.0 targets macOS + Windows + Linux + iOS + Android — mobile path Better security model (capabilities-based permissions) Mac App Store friendly Faster startup Cons (serious for this stack):\nWebKit on macOS ≠ Chromium. FabricJS, Revideo, @buddyhq/video-renderer are tested against Chrome. Canvas2D, WebGL, OffscreenCanvas differences on WebKit are real. Budget 2–3 weeks of WebKit-specific fixes. SSE on WebKit has historically had buffering issues (fixed in recent Safari but Tauri pins specific WebKit versions). useSSEChat hook needs careful testing. Cookie-based JWT: Set-Cookie with SameSite requires Tauri http plugin + setCookies configuration for app.buddyhq.ai. Non-trivial. Rust learning curve for Go-heavy team (IPC = Tauri commands in Rust). @buddyhq/video-renderer may use WebCodecs/Workers in Chromium-specific ways. Time to v1: 4–8 weeks (+ WebKit debugging).\nBest for: macOS-first, bundle size is a marketing concern, mobile (iOS/Android) is on 12-month roadmap.\nOption 4 — Tauri 2.0 + appui + Local Go Sidecar Tauri + a slimmed-down appwebexpms-derived Go binary as a Tauri sidecar. App talks to localhost instead of app.buddyhq.ai.\nappui reuse: ~85% | appwebexpms reuse: ~30–40% (subset of handlers) Pros:\nTrue offline support for some features Faster response times (localhost) Air-gapped environments Cons:\nappwebexpms cannot be used as-is: depends on Valkey, GCS, 16 downstream microservices in K8s. A stripped appdesktopbff must be created — significant new codebase. Cross-compilation toolchain for Go sidecar (GOOS=darwin/windows/linux GOARCH=arm64/amd64) Sidecar process lifecycle complexity (crash recovery, port conflicts, startup sequencing) Auth model changes: cloud BFF uses HttpOnly cookie JWT; local BFF needs Keychain/Credential Store Doubles maintenance surface — every appwebexpms API change must be mirrored Time to v1: 4–6 months. Not recommended for v1.\nOption 5 — Native macOS SwiftUI Build from scratch in SwiftUI consuming appwebexpms APIs directly. Existing precedent: knowledged-mac.\nappui reuse: 0% | appwebexpms reuse: 100% (as API server, unchanged) Pros:\nBest native experience (Handoff, Spotlight, Share Sheet, VoiceOver free) Bundle ~5–15 MB Mac App Store friendly by default Cons:\nZero code reuse from appui — canvas, image editor (FabricJS), video timeline (Revideo), SSE chat, rich text editor all must be rebuilt macOS only (Windows requires separate stack) Team expertise is React + Go, not Swift 12–24+ month project for full feature set Decision Matrix PWA Electron Tauri Tauri+Sidecar SwiftUI appui reuse ~97% ~90% ~85% ~85% 0% appwebexpms reuse 100% 100% 100% 30-40% 100% (API) Bundle size ~0 MB ~200 MB ~15 MB ~25 MB ~10 MB Windows support ✓ ✓ ✓ ✓ ✗ FabricJS/Revideo risk None None Medium Medium N/A SSE risk None None Low Low Medium Cookie auth risk None None Low High Low Time to v1 Days 2-4 weeks 4-8 weeks 4-6 months 12-24 months Native OS integration Poor Good Good Good Excellent Mobile path ✗ ✗ ✓ ✓ ✗ Team expertise fit ✓✓ ✓✓ ✓ ✓ ✗ Recommendation Electron for v1. Evaluate Tauri migration for v2 if mobile becomes a priority.\nRationale:\nappui IS the product — canvas, video editor, image editor, SSE chat are tested against Chromium. Electron eliminates WebKit risk. appwebexpms needs zero changes — desktop app is another browser client. Time to market: 2–4 weeks vs 4–8 weeks for Tauri. Go team doesn\u0026rsquo;t need to learn Rust — Electron IPC is TypeScript. Bundle size concern is real but accepted by users of Figma, Notion, Slack, Linear, VS Code. appwebexpms for v1: No changes needed. Desktop app points at https://app.buddyhq.ai. If later needed:\nOffline support: Add SQLite cache layer (better-sqlite3) in Electron main process for recent conversations/assets. Cloud BFF still handles writes. Desktop-specific APIs: Add /desktop/* route namespace to appwebexpms for features needing desktop context (local file path handling). If mobile is on the 12-month roadmap: Use Tauri 2.0 from the start. Budget 2–3 weeks extra for FabricJS/Revideo WebKit compatibility work. Tauri 2.0\u0026rsquo;s iOS/Android target from the same frontend codebase is a genuine competitive advantage.\n","permalink":"https://knowledged.to/notes/engineering/buddyhq-desktop-app-architecture/","summary":"\u003ch1 id=\"buddyhq-desktop-app--architecture-analysis\"\u003eBuddyHQ Desktop App — Architecture Analysis\u003c/h1\u003e\n\u003ch2 id=\"context\"\u003eContext\u003c/h2\u003e\n\u003cp\u003eCodebase analyzed: /Users/subhash/code/bhq (polyrepo-in-a-folder, independent deployable services)\u003c/p\u003e\n\u003ch3 id=\"key-components-evaluated\"\u003eKey components evaluated\u003c/h3\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n      \u003ctr\u003e\n          \u003cth\u003eComponent\u003c/th\u003e\n          \u003cth\u003eWhat it is\u003c/th\u003e\n          \u003cth\u003eKey constraints\u003c/th\u003e\n      \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u003ccode\u003eappui\u003c/code\u003e\u003c/td\u003e\n          \u003ctd\u003eReact 18 + Vite 6 SPA\u003c/td\u003e\n          \u003ctd\u003eFabricJS canvas, Revideo video editor, SSE chat, private packages (@buddyhq/richtext-editor, @buddyhq/video-renderer)\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u003ccode\u003eappwebexpms\u003c/code\u003e\u003c/td\u003e\n          \u003ctd\u003eGo 1.25 BFF at app.buddyhq.ai\u003c/td\u003e\n          \u003ctd\u003eCookie-based JWT auth, SSE proxy to mwms, 16 downstream service clients\u003c/td\u003e\n      \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003ccode\u003eappui\u003c/code\u003e tech stack: React 18, TypeScript, Vite 6, Tailwind CSS, Radix UI, shadcn/ui, Zustand 5, TanStack Query 5, React Router 7, FabricJS, Revideo, Framer Motion, react-dnd, react-moveable.\u003c/p\u003e","title":"BuddyHQ Desktop App — Architecture Analysis"},{"content":"Open-weight models Open-weight models are AI models where the trained parameters (weights) are made publicly available so others can download, run, and often fine-tune them locally.\nThe core idea (plain terms) When an AI model is trained, it learns billions (or trillions) of numbers—these are its weights.\nAn open-weight model gives you access to those numbers.\nThat means you can:\nRun the model on your own machine or server Fine-tune it with your own data Inspect or modify how it behaves (to some extent) How this differs from other terms 1. Open-weight vs Closed models Open-weight: You get the weights Example: LLaMA 2, Mistral 7B Closed model: You only get API access Example: GPT-4 With closed models, you use them—but you don’t own or inspect them.\n2. Open-weight vs Open-source This is where people get sloppy—don’t.\nOpen-weight → weights are available Open-source → weights plus training data, code, and full transparency Most “open” AI models today are actually open-weight, not fully open-source.\nWhy open-weight models matter They give you control and flexibility:\nRun offline (privacy-sensitive use cases) Lower cost at scale (no API fees) Customize behavior deeply (fine-tuning, LoRA, etc.) Experiment freely (research, tooling, edge deployments) This is why models like Gemma are popular with developers.\nThe catch (don’t ignore this) Open-weight doesn’t automatically mean:\nFully transparent Free for all use cases (licenses can restrict usage) Easy to run (hardware requirements can be heavy) So before jumping in, always check:\nLicense terms Model size vs your hardware Ecosystem support (e.g., MLX, Ollama, etc.) Simple analogy Closed model = Streaming on Netflix Open-weight model = Downloading the movie file Open-source model = Getting the movie plus the script, raw footage, and editing tools If you’re planning to use something like Gemma on your Mac (e.g., with MLX), open-weight models are exactly what make that possible.\n","permalink":"https://knowledged.to/notes/ml/open-weight-models/","summary":"\u003ch2 id=\"open-weight-models\"\u003eOpen-weight models\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eOpen-weight models\u003c/strong\u003e are AI models where the \u003cstrong\u003etrained parameters (weights)\u003c/strong\u003e are made publicly available so others can download, run, and often fine-tune them locally.\u003c/p\u003e\n\u003chr\u003e\n\u003ch3 id=\"the-core-idea-plain-terms\"\u003eThe core idea (plain terms)\u003c/h3\u003e\n\u003cp\u003eWhen an AI model is trained, it learns billions (or trillions) of numbers—these are its \u003cstrong\u003eweights\u003c/strong\u003e.\u003cbr\u003e\nAn \u003cem\u003eopen-weight\u003c/em\u003e model gives you access to those numbers.\u003c/p\u003e\n\u003cp\u003eThat means you can:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eRun the model on your own machine or server\u003c/li\u003e\n\u003cli\u003eFine-tune it with your own data\u003c/li\u003e\n\u003cli\u003eInspect or modify how it behaves (to some extent)\u003c/li\u003e\n\u003c/ul\u003e\n\u003chr\u003e\n\u003ch3 id=\"how-this-differs-from-other-terms\"\u003eHow this differs from other terms\u003c/h3\u003e\n\u003ch4 id=\"1-open-weight-vs-closed-models\"\u003e1. Open-weight vs Closed models\u003c/h4\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eOpen-weight\u003c/strong\u003e: You get the weights\n\u003cul\u003e\n\u003cli\u003eExample: LLaMA 2, Mistral 7B\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eClosed model\u003c/strong\u003e: You only get API access\n\u003cul\u003e\n\u003cli\u003eExample: GPT-4\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWith closed models, you \u003cem\u003euse\u003c/em\u003e them—but you don’t \u003cem\u003eown or inspect\u003c/em\u003e them.\u003c/p\u003e","title":"Open-weight Models"},{"content":"Cross-Entropy in AI Cross-entropy is a concept from Information Theory that is widely used in machine learning to measure how different two probability distributions are. In AI, it is most commonly used as a loss function to evaluate how well a model’s predicted probabilities match the actual (true) labels. 🧠 Intuition Think of cross-entropy as a way to answer:\n“How surprised is the model when it sees the true answer?”\nIf the model assigns high probability to the correct answer → low surprise → low loss If the model assigns low probability to the correct answer → high surprise → high loss 📐 Formal Definition For a true distribution P and predicted distribution Q, cross-entropy is:\nH(P, Q) = - Σ P(x) * log(Q(x))\nIn classification (simplified case): If the true label is one-hot encoded:\nLoss = -log(predicted_probability_of_true_class)\n🔍 Example Suppose you’re doing a classification task with 3 classes:\nTrue label: [0, 1, 0] (Class 2 is correct) Case 1: Good prediction Predicted: [0.1, 0.8, 0.1] Loss = -log(0.8) ≈ 0.22 (low)\nCase 2: Bad prediction Predicted: [0.7, 0.2, 0.1] Loss = -log(0.2) ≈ 1.61 (high)\n👉 The worse the prediction, the higher the loss. ⚙️ Why Cross-Entropy is Used Works naturally with probabilities Strongly penalizes confident wrong predictions Differentiable → ideal for gradient-based optimization Pairs well with softmax in classification models 🔗 Relationship to Other Concepts Entropy: Measures uncertainty in a distribution Cross-Entropy: Measures mismatch between two distributions KL(P || Q) = CrossEntropy(P, Q) - Entropy(P)\n🧩 Where You’ll See It Classification models (e.g., logistic regression, neural networks) Language models (predicting next word probabilities) Image classification tasks Any probabilistic prediction system 🧭 Quick Summary Cross-entropy measures how wrong a predicted probability distribution is Lower is better It’s the default loss function for most classification problems in AI ","permalink":"https://knowledged.to/notes/ml/cross-entropy-in-ai/","summary":"\u003ch2 id=\"cross-entropy-in-ai\"\u003eCross-Entropy in AI\u003c/h2\u003e\n\u003ch2 id=\"in-ai-it-is-most-commonly-used-as-a-loss-function-to-evaluate-how-well-a-models-predicted-probabilities-match-the-actual-true-labels\"\u003e\u003cstrong\u003eCross-entropy\u003c/strong\u003e is a concept from Information Theory that is widely used in \u003cstrong\u003emachine learning\u003c/strong\u003e to measure how different two probability distributions are.\nIn AI, it is most commonly used as a \u003cstrong\u003eloss function\u003c/strong\u003e to evaluate how well a model’s predicted probabilities match the actual (true) labels.\u003c/h2\u003e\n\u003ch2 id=\"-intuition\"\u003e🧠 Intuition\u003c/h2\u003e\n\u003cp\u003eThink of cross-entropy as a way to answer:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003e\u003cem\u003e“How surprised is the model when it sees the true answer?”\u003c/em\u003e\u003c/p\u003e","title":"Cross-Entropy in AI"},{"content":"Fine-Tuning Techniques for LLMs Fine-tuning techniques can be grouped along a few axes: what you optimize (full weights vs. small additions), what signal you train on (labels, instructions, preferences, rewards), and how the data is generated (human, synthetic, AI-judged).\nFull Fine-Tuning (FFT) Update every parameter in the model on a target dataset. Highest capacity, but expensive in memory and prone to catastrophic forgetting. Mostly reserved for smaller models or when you have lots of high-quality data and compute.\nSupervised Fine-Tuning (SFT) / Instruction Tuning Train on (prompt, ideal_response) pairs with standard cross-entropy. \u0026ldquo;Instruction tuning\u0026rdquo; is just SFT where the dataset is a mix of tasks phrased as instructions (FLAN, Alpaca, etc.). This is almost always step one in the post-training pipeline before any preference work.\nContinued Pretraining / Domain-Adaptive Pretraining Same objective as pretraining (next-token prediction) but on a domain corpus — code, legal, medical, your company\u0026rsquo;s docs. Done before SFT when you need to inject knowledge rather than behavior.\nParameter-Efficient Fine-Tuning (PEFT) Freeze the base model, train a small number of new or selected parameters. The dominant family in practice:\nLoRA — inject trainable low-rank matrices A and B such that ΔW = BA is added to frozen weights. Trains \u0026lt;1% of parameters with near-FFT quality. QLoRA — load the base in 4-bit (NF4) and train LoRA adapters on top. Lets you fine-tune 65B+ models on a single GPU. DoRA (Weight-Decomposed LoRA) — decomposes weights into magnitude and direction, applies LoRA only to direction. Closes more of the gap to FFT. Adapters (Houlsby, Pfeiffer) — small bottleneck MLPs inserted between transformer layers. Predates LoRA; LoRA mostly replaced it because it adds no inference latency. Prefix Tuning — prepend trainable vectors to the keys/values at every layer. Prompt Tuning / Soft Prompts — only learn embeddings prepended to the input; cheapest, weakest. P-Tuning v2 — prefix tuning generalized across all layers, competitive with FFT on many tasks. IA³ — learn three vectors per layer that rescale keys, values, and FFN activations. Even fewer parameters than LoRA. BitFit — train only the bias terms. Surprisingly decent baseline. Preference / Alignment Fine-Tuning After SFT, you align the model to preferences over responses:\nRLHF (PPO) — train a reward model on human-ranked pairs, then use PPO to maximize that reward with a KL penalty against the SFT model. The original ChatGPT recipe. Complex, unstable, expensive. DPO (Direct Preference Optimization) — skips the reward model entirely; derives a closed-form loss directly on preference pairs. Much simpler and now the default for most open-source alignment. IPO, KTO, ORPO, SimPO — variants that fix specific DPO failure modes (overfitting, needing paired data, requiring a separate SFT stage, etc.). ORPO is notable for combining SFT and preference learning into one stage. GRPO (Group Relative Policy Optimization) — drops the value/critic network of PPO; instead samples a group of completions per prompt and uses their relative rewards as advantages. Used in DeepSeek-R1. Memory-efficient and works well when you have a verifiable reward signal. RLAIF / Constitutional AI — same loop as RLHF, but preferences come from an AI judge guided by a written constitution rather than human labelers. Reasoning / Verifiable-Reward Fine-Tuning The newer branch (DeepSeek-R1, OpenAI o-series style). Use RL (often GRPO) with rule-based, verifiable rewards — does the math answer match? does the code pass the tests? — to elicit long chain-of-thought without needing a learned reward model. Can be combined with distilling the resulting reasoning traces into smaller models.\nKnowledge Distillation Train a student to match a teacher\u0026rsquo;s outputs. In modern LLM practice this usually means SFT on the teacher\u0026rsquo;s generated responses (and increasingly its reasoning traces), rather than the original logit-matching formulation. The DeepSeek-R1 distilled variants are the canonical recent example.\nMulti-task and Mixture Fine-Tuning Train on a mixture of tasks/datasets simultaneously, often with task-specific prompt templates. T0, FLAN-T5, and most modern instruction-tuned models do this. Helps generalization but requires careful data balancing.\nTypical Modern Recipe For a chat model: pretraining → continued pretraining (optional) → SFT → DPO (or RLHF, or GRPO if reasoning-focused), with LoRA/QLoRA used at any stage where you want to save compute.\n","permalink":"https://knowledged.to/notes/ml/fine-tuning-techniques/","summary":"\u003ch1 id=\"fine-tuning-techniques-for-llms\"\u003eFine-Tuning Techniques for LLMs\u003c/h1\u003e\n\u003cp\u003eFine-tuning techniques can be grouped along a few axes: \u003cstrong\u003ewhat you optimize\u003c/strong\u003e (full weights vs. small additions), \u003cstrong\u003ewhat signal you train on\u003c/strong\u003e (labels, instructions, preferences, rewards), and \u003cstrong\u003ehow the data is generated\u003c/strong\u003e (human, synthetic, AI-judged).\u003c/p\u003e\n\u003ch2 id=\"full-fine-tuning-fft\"\u003eFull Fine-Tuning (FFT)\u003c/h2\u003e\n\u003cp\u003eUpdate every parameter in the model on a target dataset. Highest capacity, but expensive in memory and prone to catastrophic forgetting. Mostly reserved for smaller models or when you have lots of high-quality data and compute.\u003c/p\u003e","title":"Fine-Tuning Techniques for LLMs"},{"content":"Deterministic Graders (for LLM / AI Evaluation) Definition A deterministic grader is an evaluation function that produces the same result every time for the same input — no randomness, no LLM-in-the-loop judgment. You check the model\u0026rsquo;s output against a fixed, code-based rule.\nConcrete Examples Exact string match — \u0026ldquo;Does the output equal Paris?\u0026rdquo; Regex match — \u0026ldquo;Does the output contain a valid ISO date?\u0026rdquo; Structured-output validation — \u0026ldquo;Does this parse as JSON and pass the schema?\u0026rdquo; Code execution / unit tests — \u0026ldquo;Run the generated function against these test cases. Did they pass?\u0026rdquo; Numeric tolerance — \u0026ldquo;Is the answer within 0.01 of the expected value?\u0026rdquo; Set membership — \u0026ldquo;Is the classification label one of {positive, negative, neutral}?\u0026rdquo; Contrast: Model-Graded / LLM-as-Judge The opposite approach is a model-graded (or \u0026ldquo;LLM-as-judge\u0026rdquo;) evaluator, where you ask another model something like \u0026ldquo;Is this answer helpful and correct?\u0026rdquo;\nThat is non-deterministic: the same output can get different scores across runs, the judge has its own biases, and it costs tokens per eval.\nWhy Prefer Deterministic Graders Reproducible. Re-running the eval suite produces identical numbers. Regressions become real signal instead of noise. Cheap and fast. A regex runs in microseconds; a judge model costs a real API call per example. Debuggable. When a test fails, the rule that failed is right there in code. With a model judge, you\u0026rsquo;re debugging another model\u0026rsquo;s opinion. Trustworthy. No risk of the judge being wrong, sycophantic, or inconsistent across runs. Practical Rule If you can express the correctness check as code (string match, schema validation, unit test, numeric comparison), do that. Reserve model-graded eval for cases that genuinely need it — open-ended generation, tone, creative writing, summarization quality — where no code rule captures what \u0026ldquo;good\u0026rdquo; means.\nHybrid Pattern (Common in Production) Deterministic graders for the parts you can verify (structure, key facts, tool-call correctness, schema conformance). A smaller model-graded slice for subjective dimensions (helpfulness, tone, fluency). This keeps most of your eval signal reproducible while still covering the open-ended parts.\nRelated Best Practice (from 2026 AI engineering guidance) \u0026ldquo;Use the simplest grader that works.\u0026rdquo; Prefer deterministic checks over model-based grading wherever the correctness criterion can be expressed as code. Evaluate on held-out, diverse, real-world inputs and give partial credit across outcome, tool use, and safety dimensions rather than binary pass/fail.\n","permalink":"https://knowledged.to/ai/concepts/deterministic-graders/","summary":"\u003ch1 id=\"deterministic-graders-for-llm--ai-evaluation\"\u003eDeterministic Graders (for LLM / AI Evaluation)\u003c/h1\u003e\n\u003ch2 id=\"definition\"\u003eDefinition\u003c/h2\u003e\n\u003cp\u003eA \u003cstrong\u003edeterministic grader\u003c/strong\u003e is an evaluation function that produces the same result every time for the same input — no randomness, no LLM-in-the-loop judgment. You check the model\u0026rsquo;s output against a fixed, code-based rule.\u003c/p\u003e\n\u003ch2 id=\"concrete-examples\"\u003eConcrete Examples\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eExact string match\u003c/strong\u003e — \u0026ldquo;Does the output equal \u003ccode\u003eParis\u003c/code\u003e?\u0026rdquo;\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRegex match\u003c/strong\u003e — \u0026ldquo;Does the output contain a valid ISO date?\u0026rdquo;\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStructured-output validation\u003c/strong\u003e — \u0026ldquo;Does this parse as JSON and pass the schema?\u0026rdquo;\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCode execution / unit tests\u003c/strong\u003e — \u0026ldquo;Run the generated function against these test cases. Did they pass?\u0026rdquo;\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eNumeric tolerance\u003c/strong\u003e — \u0026ldquo;Is the answer within 0.01 of the expected value?\u0026rdquo;\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSet membership\u003c/strong\u003e — \u0026ldquo;Is the classification label one of \u003ccode\u003e{positive, negative, neutral}\u003c/code\u003e?\u0026rdquo;\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"contrast-model-graded--llm-as-judge\"\u003eContrast: Model-Graded / LLM-as-Judge\u003c/h2\u003e\n\u003cp\u003eThe opposite approach is a \u003cstrong\u003emodel-graded\u003c/strong\u003e (or \u0026ldquo;LLM-as-judge\u0026rdquo;) evaluator, where you ask another model something like \u0026ldquo;Is this answer helpful and correct?\u0026rdquo;\u003c/p\u003e","title":"Deterministic Graders (for LLM / AI Evaluation)"},{"content":"Unsloth Studio — Fine-tuning Dataset Formats Unsloth Studio supports several dataset formats depending on your fine-tuning goal. Files can be uploaded directly as JSONL, JSON, CSV, Parquet, PDF, or DOCX.\nFormat Overview 1. Raw Text (Continued Pretraining) Used to inject domain knowledge without any structure. The model learns from continuous prose.\nT h e m i t o c h o n d r i a i s t h e p o w e r h o u s e o f t h e c e l l . A T P s y n t h e s i s o c c u r s v i a o x i d a t i v e p h o s p h o r y l a t i o n . . . Best for: books, articles, documentation dumps, codebases.\n2. Alpaca Format (Single-turn Instruction) A structured instruction-following format with three fields: instruction, input (optional), and output.\n{\u0026#34;instruction\u0026#34;: \u0026#34;Summarize what a Temporal workflow is\u0026#34;, \u0026#34;input\u0026#34;: \u0026#34;\u0026#34;, \u0026#34;output\u0026#34;: \u0026#34;A Temporal workflow is a durable function that executes reliably across failures, retries, and server restarts.\u0026#34;} {\u0026#34;instruction\u0026#34;: \u0026#34;Explain idempotency in the context of billing\u0026#34;, \u0026#34;input\u0026#34;: \u0026#34;A payment service that charges per API call\u0026#34;, \u0026#34;output\u0026#34;: \u0026#34;Idempotency ensures that if the same charge request is submitted multiple times, only one charge is processed. This is typically achieved using a unique idempotency key per transaction.\u0026#34;} The input field is optional — omit it if the instruction is self-contained.\n3. ShareGPT Format (Multi-turn Conversation) Used for chatbot-style fine-tuning. Uses \u0026quot;from\u0026quot; / \u0026quot;value\u0026quot; keys, alternating between human and gpt turns.\n{\u0026#34;conversations\u0026#34;: [{\u0026#34;from\u0026#34;: \u0026#34;human\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;What is a distributed transaction?\u0026#34;}, {\u0026#34;from\u0026#34;: \u0026#34;gpt\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;A distributed transaction spans multiple services or databases, ensuring all-or-nothing semantics across them.\u0026#34;}]} {\u0026#34;conversations\u0026#34;: [{\u0026#34;from\u0026#34;: \u0026#34;human\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;How do retries work in async systems?\u0026#34;}, {\u0026#34;from\u0026#34;: \u0026#34;gpt\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;Retries re-attempt a failed operation after a delay. They should be combined with idempotency to avoid duplicate side effects, and exponential backoff to reduce load during failures.\u0026#34;}]} Unsloth provides a standardize_sharegpt() utility to normalize minor variations in this format.\n4. ChatML Format (Multi-turn, OpenAI-style) The format used by OpenAI and defaulted to by Hugging Face. Uses \u0026quot;role\u0026quot; / \u0026quot;content\u0026quot; keys.\n{\u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;system\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;You are a systems engineering expert.\u0026#34;}, {\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Explain backpressure in message queues.\u0026#34;}, {\u0026#34;role\u0026#34;: \u0026#34;assistant\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Backpressure is a mechanism where a consumer signals to a producer to slow down when it cannot keep up with the rate of incoming messages.\u0026#34;}]} {\u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;What is the difference between at-least-once and exactly-once delivery?\u0026#34;}, {\u0026#34;role\u0026#34;: \u0026#34;assistant\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;At-least-once guarantees a message is delivered but may result in duplicates. Exactly-once guarantees no duplicates but requires coordination overhead such as idempotency checks or transactional messaging.\u0026#34;}]} Studio auto-maps columns when it detects this structure.\n5. Reasoning Format (Chain-of-Thought / R1-style) Used when fine-tuning models that should exhibit step-by-step reasoning. The answer includes a \u0026lt;think\u0026gt; block followed by the final response.\n{\u0026#34;instruction\u0026#34;: \u0026#34;Is this API design idempotent?\u0026#34;, \u0026#34;input\u0026#34;: \u0026#34;POST /charge with body {amount, user_id}\u0026#34;, \u0026#34;output\u0026#34;: \u0026#34;\u0026lt;think\u0026gt;The endpoint uses POST and takes a user_id and amount. Without an explicit idempotency key, resubmitting the same request could result in double charges.\u0026lt;/think\u0026gt;\\n\\nNo, this design is not idempotent as written. Include an `idempotency_key` field in the request body and check it server-side before processing.\u0026#34;} Use this for distilled reasoning models (e.g. DeepSeek-R1 variants). To train a non-reasoning model to gain reasoning ability, use GRPO/RL instead.\nJSONL Format Rules JSONL (JSON Lines) means one JSON object per line, with no commas between lines and no outer array.\nCorrect JSONL:\n{ { \" \" i i n n s s t t r r u u c c t t i i o o n n \" \" : : \" \" . . . . . . \" \" , , \" \" o o u u t t p p u u t t \" \" : : \" \" . . . . . . \" \" } } Wrong — this is plain JSON, not JSONL:\n[ ] { { \" \" i i n n s s t t r r u u c c t t i i o o n n \" \" : : \" \" . . . . . . \" \" , , \" \" o o u u t t p p u u t t \" \" : : \" \" . . . . . . \" \" } } , The wrapped array format will cause parse errors in Unsloth and the underlying Hugging Face datasets loader.\nDataset Size Guidelines Dataset Size Recommended Model 1,000+ rows Fine-tune the base model 300–1,000 rows Either base or instruct model \u0026lt; 300 rows Fine-tune the instruct model Minimum recommended: 100 rows for reasonable results.\nHow Studio Handles Column Mapping If Studio cannot automatically detect the format, a Dataset Preview dialog opens where you manually assign columns to roles: instruction, input, output, image, etc. Suggested mappings are pre-filled where possible.\nSynthetic Dataset Generation (Data Recipes) If you have unstructured source material (PDFs, CSVs, DOCX), Unsloth Studio\u0026rsquo;s Data Recipes feature (powered by NVIDIA NeMo DataDesigner) can automatically convert it into a training-ready dataset via a visual node workflow. You can also feed the generated data back into an LLM to iteratively improve quality.\n","permalink":"https://knowledged.to/notes/ml/unsloth-studio-dataset-formats/","summary":"\u003ch1 id=\"unsloth-studio--fine-tuning-dataset-formats\"\u003eUnsloth Studio — Fine-tuning Dataset Formats\u003c/h1\u003e\n\u003cp\u003eUnsloth Studio supports several dataset formats depending on your fine-tuning goal. Files can be uploaded directly as \u003cstrong\u003eJSONL, JSON, CSV, Parquet, PDF, or DOCX\u003c/strong\u003e.\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"format-overview\"\u003eFormat Overview\u003c/h2\u003e\n\u003ch3 id=\"1-raw-text-continued-pretraining\"\u003e1. Raw Text (Continued Pretraining)\u003c/h3\u003e\n\u003cp\u003eUsed to inject domain knowledge without any structure. The model learns from continuous prose.\u003c/p\u003e\n\n\n\n\u003cdiv class=\"goat svg-container \"\u003e\n  \n    \u003csvg\n      xmlns=\"http://www.w3.org/2000/svg\"\n      font-family=\"Menlo,Lucida Console,monospace\"\n      \n        viewBox=\"0 0 816 25\"\n      \u003e\n      \u003cg transform='translate(8,16)'\u003e\n\u003ctext text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'\u003eT\u003c/text\u003e\n\u003ctext text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'\u003em\u003c/text\u003e\n\u003ctext text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='48' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'\u003en\u003c/text\u003e\n\u003ctext text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'\u003ed\u003c/text\u003e\n\u003ctext text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'\u003er\u003c/text\u003e\n\u003ctext text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='120' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='136' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='144' y='4' fill='currentColor' style='font-size:1em'\u003es\u003c/text\u003e\n\u003ctext text-anchor='middle' x='160' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='168' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='176' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='192' y='4' fill='currentColor' style='font-size:1em'\u003ep\u003c/text\u003e\n\u003ctext text-anchor='middle' x='200' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='208' y='4' fill='currentColor' style='font-size:1em'\u003ew\u003c/text\u003e\n\u003ctext text-anchor='middle' x='216' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='224' y='4' fill='currentColor' style='font-size:1em'\u003er\u003c/text\u003e\n\u003ctext text-anchor='middle' x='232' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='240' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='248' y='4' fill='currentColor' style='font-size:1em'\u003eu\u003c/text\u003e\n\u003ctext text-anchor='middle' x='256' y='4' fill='currentColor' style='font-size:1em'\u003es\u003c/text\u003e\n\u003ctext text-anchor='middle' x='264' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='280' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='288' y='4' fill='currentColor' style='font-size:1em'\u003ef\u003c/text\u003e\n\u003ctext text-anchor='middle' x='304' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='312' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='320' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='336' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='344' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='352' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='360' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='368' y='4' fill='currentColor' style='font-size:1em'\u003e.\u003c/text\u003e\n\u003ctext text-anchor='middle' x='384' y='4' fill='currentColor' style='font-size:1em'\u003eA\u003c/text\u003e\n\u003ctext text-anchor='middle' x='392' y='4' fill='currentColor' style='font-size:1em'\u003eT\u003c/text\u003e\n\u003ctext text-anchor='middle' x='400' y='4' fill='currentColor' style='font-size:1em'\u003eP\u003c/text\u003e\n\u003ctext text-anchor='middle' x='416' y='4' fill='currentColor' style='font-size:1em'\u003es\u003c/text\u003e\n\u003ctext text-anchor='middle' x='424' y='4' fill='currentColor' style='font-size:1em'\u003ey\u003c/text\u003e\n\u003ctext text-anchor='middle' x='432' y='4' fill='currentColor' style='font-size:1em'\u003en\u003c/text\u003e\n\u003ctext text-anchor='middle' x='440' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='448' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='456' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='464' y='4' fill='currentColor' style='font-size:1em'\u003es\u003c/text\u003e\n\u003ctext text-anchor='middle' x='472' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='480' y='4' fill='currentColor' style='font-size:1em'\u003es\u003c/text\u003e\n\u003ctext text-anchor='middle' x='496' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='504' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='512' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='520' y='4' fill='currentColor' style='font-size:1em'\u003eu\u003c/text\u003e\n\u003ctext text-anchor='middle' x='528' y='4' fill='currentColor' style='font-size:1em'\u003er\u003c/text\u003e\n\u003ctext text-anchor='middle' x='536' y='4' fill='currentColor' style='font-size:1em'\u003es\u003c/text\u003e\n\u003ctext text-anchor='middle' x='552' y='4' fill='currentColor' style='font-size:1em'\u003ev\u003c/text\u003e\n\u003ctext text-anchor='middle' x='560' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='568' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='584' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='592' y='4' fill='currentColor' style='font-size:1em'\u003ex\u003c/text\u003e\n\u003ctext text-anchor='middle' x='600' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='608' y='4' fill='currentColor' style='font-size:1em'\u003ed\u003c/text\u003e\n\u003ctext text-anchor='middle' x='616' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='624' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='632' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='640' y='4' fill='currentColor' style='font-size:1em'\u003ev\u003c/text\u003e\n\u003ctext text-anchor='middle' x='648' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='664' y='4' fill='currentColor' style='font-size:1em'\u003ep\u003c/text\u003e\n\u003ctext text-anchor='middle' x='672' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='680' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='688' y='4' fill='currentColor' style='font-size:1em'\u003es\u003c/text\u003e\n\u003ctext text-anchor='middle' x='696' y='4' fill='currentColor' style='font-size:1em'\u003ep\u003c/text\u003e\n\u003ctext text-anchor='middle' x='704' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='712' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='720' y='4' fill='currentColor' style='font-size:1em'\u003er\u003c/text\u003e\n\u003ctext text-anchor='middle' x='728' y='4' fill='currentColor' style='font-size:1em'\u003ey\u003c/text\u003e\n\u003ctext text-anchor='middle' x='736' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='744' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='752' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='760' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='768' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='776' y='4' fill='currentColor' style='font-size:1em'\u003en\u003c/text\u003e\n\u003ctext text-anchor='middle' x='784' y='4' fill='currentColor' style='font-size:1em'\u003e.\u003c/text\u003e\n\u003ctext text-anchor='middle' x='792' y='4' fill='currentColor' style='font-size:1em'\u003e.\u003c/text\u003e\n\u003ctext text-anchor='middle' x='800' y='4' fill='currentColor' style='font-size:1em'\u003e.\u003c/text\u003e\n\u003c/g\u003e\n\n    \u003c/svg\u003e\n  \n\u003c/div\u003e\n\u003cp\u003eBest for: books, articles, documentation dumps, codebases.\u003c/p\u003e","title":"Unsloth Studio — Fine-tuning Dataset Formats"},{"content":"Mixture of Experts (MoE) Mixture of Experts is an architecture pattern in machine learning where a model is divided into many specialized sub-networks (\u0026ldquo;experts\u0026rdquo;), with a routing mechanism that selectively activates only a subset of them for any given input.\nCore Idea Instead of passing every input through all parameters of a model, MoE routes each token (or input) to only a few relevant experts. This decouples total parameter count from compute per forward pass — you can have a massive model that\u0026rsquo;s still fast and efficient to run.\nKey Components 1. Experts Each expert is typically a feed-forward network (FFN). In Transformer-based MoE models, the dense FFN layer in each Transformer block is replaced by a bank of N expert FFNs.\n2. Router / Gating Network A small learned network that takes the input token\u0026rsquo;s representation and outputs a probability distribution over all experts. The top-K experts (usually K=1 or K=2) are selected for each token.\n3. Sparse Activation Only the selected K experts compute outputs for a given token. The results are weighted by the router\u0026rsquo;s scores and summed. If you have 64 experts but K=2, only ~3% of expert parameters activate per token.\nWhy It Matters Property Dense Model MoE Model Total parameters Fixed Very large Active parameters per token All Small fraction Training compute High Lower per step Inference speed Baseline Faster (same active params) Memory footprint Proportional High (all experts in memory) This is how models like GPT-4, Mixtral 8x7B, Gemini 1.5, and DeepSeek-V3 achieve massive capacity without proportional compute costs.\nChallenges Load Balancing Routers tend to collapse — they learn to always route to the same few popular experts, leaving others unused. Solutions include auxiliary load-balancing losses that penalize uneven expert utilization.\nCommunication Overhead (Distributed Training) In large-scale training, experts are sharded across GPUs. Routing tokens to experts on different devices requires all-to-all communication, which is expensive.\nMemory All experts must be held in memory even if most are idle during a given forward pass. This makes MoE models memory-hungry despite being compute-efficient.\nTraining Instability The discrete routing (top-K selection) is non-differentiable, which can cause instability. Techniques like straight-through estimators or soft routing during early training are used to mitigate this.\nModern MoE Variants Mixtral 8x7B — 8 experts per layer, top-2 routing. Effectively ~13B active params out of 47B total. DeepSeek-V3 / MoE — Uses fine-grained experts with a large number of experts per layer (e.g., 256), with shared experts that are always active plus routed ones. Switch Transformer (Google) — Pioneered top-1 routing for simplicity and showed MoE scales well. Expert Choice routing — Instead of tokens choosing experts, experts choose their top-K tokens. Better load balancing by design. MoE in the Context of Transformers In a standard Transformer block:\nA t t e n t i o n → F F N In an MoE Transformer block:\nA t t e n t i o n → R o u t e r → [ E x p e r t 1 , E x p e r t 2 , . , E x p e r t N ] → w e i g h t e d s u m The attention layers remain dense — only the FFN layers are sparsified with experts.\nIntuition Think of it like a team of specialists. A generalist handles everything but isn\u0026rsquo;t optimal for any task. MoE lets you route a medical question to the \u0026ldquo;medicine expert,\u0026rdquo; a coding question to the \u0026ldquo;code expert,\u0026rdquo; etc. — but all learned end-to-end without hand-labeling who specializes in what.\n","permalink":"https://knowledged.to/notes/ml/mixture-of-experts/","summary":"\u003ch1 id=\"mixture-of-experts-moe\"\u003eMixture of Experts (MoE)\u003c/h1\u003e\n\u003cp\u003eMixture of Experts is an architecture pattern in machine learning where a model is divided into many specialized sub-networks (\u0026ldquo;experts\u0026rdquo;), with a routing mechanism that selectively activates only a subset of them for any given input.\u003c/p\u003e\n\u003ch2 id=\"core-idea\"\u003eCore Idea\u003c/h2\u003e\n\u003cp\u003eInstead of passing every input through all parameters of a model, MoE routes each token (or input) to only a few relevant experts. This decouples \u003cstrong\u003etotal parameter count\u003c/strong\u003e from \u003cstrong\u003ecompute per forward pass\u003c/strong\u003e — you can have a massive model that\u0026rsquo;s still fast and efficient to run.\u003c/p\u003e","title":"Mixture of Experts (MoE)"},{"content":"Chain of Thought (CoT) Chain of Thought is a prompting technique where an AI model is guided — or learns — to reason through a problem step by step before arriving at a final answer, rather than jumping straight to the conclusion.\nThe core idea is that breaking down complex reasoning into intermediate steps leads to more accurate and reliable outputs, much like how a person might work through a math problem by showing their work.\nTwo Main Flavors Explicit CoT (prompted) — You instruct the model to \u0026ldquo;think step by step.\u0026rdquo; For example:\n\u0026ldquo;Q: If a train travels 60 mph for 2 hours, how far does it go? Let\u0026rsquo;s think step by step.\u0026rdquo; \u0026ldquo;A: Speed is 60 mph. Time is 2 hours. Distance = speed × time = 60 × 2 = 120 miles.\u0026rdquo;\nImplicit CoT (trained) — Models like OpenAI\u0026rsquo;s o1/o3 or Anthropic\u0026rsquo;s Claude are trained to reason internally before producing a final answer, often without the user seeing the scratchpad.\nWhy It Works Forces the model to decompose problems rather than guess Reduces errors on multi-step tasks (math, logic, coding) Makes reasoning auditable — you can see where it went wrong Particularly powerful for tasks that require planning or sequential decisions Where It Matters Most Math word problems, logical puzzles, code debugging, multi-hop question answering, and anything requiring more than a one-shot lookup.\nKey Insight One of the most impactful ideas in modern LLM prompting. Has influenced how newer models are trained at a fundamental level — the \u0026ldquo;thinking\u0026rdquo; tokens that models produce before answering is a direct descendant of this idea.\n","permalink":"https://knowledged.to/notes/ml/chain-of-thought/","summary":"\u003ch1 id=\"chain-of-thought-cot\"\u003eChain of Thought (CoT)\u003c/h1\u003e\n\u003cp\u003eChain of Thought is a prompting technique where an AI model is guided — or learns — to reason through a problem step by step before arriving at a final answer, rather than jumping straight to the conclusion.\u003c/p\u003e\n\u003cp\u003eThe core idea is that breaking down complex reasoning into intermediate steps leads to more accurate and reliable outputs, much like how a person might work through a math problem by showing their work.\u003c/p\u003e","title":"Chain of Thought (CoT)"},{"content":"Visual Chain-of-Thought Reasoning Visual chain-of-thought (CoT) reasoning is the extension of standard chain-of-thought prompting to multimodal settings — where the model reasons step-by-step over both visual and textual information together.\nCore Idea In standard CoT, a language model breaks a problem into intermediate reasoning steps before arriving at a final answer. Visual CoT does the same, but the reasoning chain involves interpreting, referencing, and drawing inferences from images, diagrams, charts, or visual scenes alongside text.\nHow It Works Rather than just answering \u0026ldquo;what\u0026rsquo;s in this image?\u0026rdquo;, a model doing visual CoT might:\nIdentify relevant objects or regions in the image Ground those elements to the question being asked Reason about spatial relationships, numerical values, or logical implications Conclude with a final answer derived from that visual reasoning chain For example, given a geometry diagram and asked \u0026ldquo;what is the area?\u0026rdquo;, the model might reason: \u0026ldquo;I see a rectangle with labeled width 4 and height 6 → area formula is l × w → 4 × 6 = 24\u0026rdquo; — rather than guessing directly.\nKey Techniques Attention-guided reasoning — The model learns to focus on specific image regions at each reasoning step, almost like \u0026ldquo;looking\u0026rdquo; at different parts of the image sequentially.\nRationale generation — The model produces natural language rationales that describe what it sees and why it matters, making the visual reasoning transparent.\nVisual grounding — Reasoning steps are tied to specific bounding boxes or regions, so each step has a spatial anchor in the image.\nIterative refinement — Some approaches let the model \u0026ldquo;re-examine\u0026rdquo; the image after forming a partial hypothesis, correcting itself based on what it finds.\nWhy It Matters Models that just map image → answer tend to hallucinate or miss subtle visual details Step-by-step reasoning forces the model to commit to intermediate conclusions, reducing errors It makes model behavior interpretable — you can see where the reasoning went wrong Particularly powerful for tasks like: math diagrams, scientific figures, medical imaging, document understanding, and visual question answering (VQA) Relation to Modern Multimodal Models Models like GPT-4o, Gemini, and Claude with vision capabilities implicitly perform some degree of visual CoT. Explicit visual CoT training (e.g., via RLHF or process reward models on reasoning traces) pushes this further — the model learns to externalize its visual reasoning rather than doing it silently in latent space.\nIt\u0026rsquo;s an active research area, with work exploring whether models truly \u0026ldquo;see and reason\u0026rdquo; or are pattern-matching on visual features with language post-hoc.\n","permalink":"https://knowledged.to/notes/ml/multimodal-visual-chain-of-thought/","summary":"\u003ch1 id=\"visual-chain-of-thought-reasoning\"\u003eVisual Chain-of-Thought Reasoning\u003c/h1\u003e\n\u003cp\u003eVisual chain-of-thought (CoT) reasoning is the extension of standard chain-of-thought prompting to multimodal settings — where the model reasons step-by-step over \u003cstrong\u003eboth visual and textual information\u003c/strong\u003e together.\u003c/p\u003e\n\u003ch2 id=\"core-idea\"\u003eCore Idea\u003c/h2\u003e\n\u003cp\u003eIn standard CoT, a language model breaks a problem into intermediate reasoning steps before arriving at a final answer. Visual CoT does the same, but the reasoning chain involves interpreting, referencing, and drawing inferences from \u003cstrong\u003eimages, diagrams, charts, or visual scenes\u003c/strong\u003e alongside text.\u003c/p\u003e","title":"Visual Chain-of-Thought Reasoning"},{"content":"Multi-Turn Conversation in AI Multi-turn conversation in AI refers to a dialogue system where a model maintains context across multiple exchanges — rather than treating each message as an isolated input.\nSingle-Turn vs Multi-Turn In a single-turn interaction, the model sees one prompt and produces one response, with no memory of anything before or after.\nIn a multi-turn interaction, the model receives the full conversation history (all prior messages) with each new request, allowing it to:\nRefer back to earlier context (\u0026ldquo;as I mentioned above\u0026hellip;\u0026rdquo;) Resolve pronouns and implicit references (\u0026ldquo;make it shorter\u0026rdquo; — the model knows what \u0026ldquo;it\u0026rdquo; is) Track goals across steps (e.g., iteratively building a piece of code) Maintain persona or constraints set earlier in the conversation How It Works Technically There\u0026rsquo;s no persistent memory inside the model itself. Instead, the entire conversation history is passed as input on every turn — the model just sees a longer and longer context window. This is why very long conversations can hit token limits.\nA typical message structure:\n[ u a u a s s s s s y e s e s s r i r i t : s : s e t t m \" a \" a W n N n p r t o t r i : w : o t m e \" a \" p H d U t a e d p ] r d G e e a o r t i r e f t o d u r n i v c s h e t : a r i n s o d i n l o . i n t \" n : o g \" r e . v \" e ← r s m e o d a e l s t u r s i e n s g \" p r i o r c o n t e x t t o u n d e r s t a n d t h i s The key insight is that \u0026ldquo;memory\u0026rdquo; in multi-turn AI is really just context injection — the application layer is responsible for storing and replaying the conversation history, not the model itself.\n","permalink":"https://knowledged.to/ai/concepts/multi-turn-conversation/","summary":"\u003ch1 id=\"multi-turn-conversation-in-ai\"\u003eMulti-Turn Conversation in AI\u003c/h1\u003e\n\u003cp\u003eMulti-turn conversation in AI refers to a dialogue system where a model maintains context across multiple exchanges — rather than treating each message as an isolated input.\u003c/p\u003e\n\u003ch2 id=\"single-turn-vs-multi-turn\"\u003eSingle-Turn vs Multi-Turn\u003c/h2\u003e\n\u003cp\u003eIn a \u003cstrong\u003esingle-turn\u003c/strong\u003e interaction, the model sees one prompt and produces one response, with no memory of anything before or after.\u003c/p\u003e\n\u003cp\u003eIn a \u003cstrong\u003emulti-turn\u003c/strong\u003e interaction, the model receives the full conversation history (all prior messages) with each new request, allowing it to:\u003c/p\u003e","title":"Multi-Turn Conversation in AI"},{"content":"Agent Harness Engineering Agent harness engineering is the practice of building the scaffolding, infrastructure, and tooling that surrounds an AI agent — everything that isn\u0026rsquo;t the model itself but makes the model useful, reliable, and safe in production.\nThe model is the engine; the harness is the chassis, controls, safety systems, and instrumentation around it.\nCore Components Execution Environment The runtime that manages how the agent runs — process lifecycle, sandboxing, resource limits, timeouts, and isolation between agent instances.\nTool/Function Orchestration Wiring up the tools the agent can call (APIs, code execution, file systems, databases), handling tool call/response cycles, retrying failures, and enforcing what tools are accessible in a given context.\nMemory and Context Management Deciding what goes into the context window at each step — conversation history, retrieved documents, prior tool results, system prompts — and how to compress or evict it when space runs out.\nLooping and Control Flow Managing multi-step reasoning loops (ReAct, plan-and-execute, etc.), detecting when the agent is done vs. stuck, handling infinite loops, and enforcing max-step budgets.\nObservation and Tracing Capturing every LLM call, tool invocation, input/output, latency, and token cost — usually via OpenTelemetry or a similar tracing layer — for debugging and monitoring agent behavior.\nSafety and Guardrails Input/output filtering, action confirmation gates (especially for irreversible actions), policy enforcement, and injection defense so the agent can\u0026rsquo;t be hijacked by malicious content it reads.\nState Persistence For long-running agents, checkpointing agent state to durable storage (e.g., Temporal workflows) so execution can be resumed after crashes or timeouts.\nWhy It Matters Without a solid harness, agents tend to:\nRun away (infinite loops, runaway tool calls) Fail silently (swallowed errors, wrong context) Be impossible to debug (no tracing) Be dangerous in production (no guardrails) Stack Mapping (Go + Temporal) Temporal — durable execution, retry logic, state persistence Go orchestration layer — tool dispatch, context shaping OpenTelemetry — tracing and observability The \u0026ldquo;harness\u0026rdquo; concept is the unified name for all of that glue surrounding the model.\n","permalink":"https://knowledged.to/notes/ml/agent-harness-engineering/","summary":"\u003ch1 id=\"agent-harness-engineering\"\u003eAgent Harness Engineering\u003c/h1\u003e\n\u003cp\u003eAgent harness engineering is the practice of building the scaffolding, infrastructure, and tooling that surrounds an AI agent — everything that isn\u0026rsquo;t the model itself but makes the model useful, reliable, and safe in production.\u003c/p\u003e\n\u003cp\u003eThe model is the engine; the harness is the chassis, controls, safety systems, and instrumentation around it.\u003c/p\u003e\n\u003ch2 id=\"core-components\"\u003eCore Components\u003c/h2\u003e\n\u003ch3 id=\"execution-environment\"\u003eExecution Environment\u003c/h3\u003e\n\u003cp\u003eThe runtime that manages how the agent runs — process lifecycle, sandboxing, resource limits, timeouts, and isolation between agent instances.\u003c/p\u003e","title":"Agent Harness Engineering"},{"content":"Diffusion Models in AI Diffusion models are a class of generative AI models that learn to create data (images, audio, video, etc.) by learning to reverse a gradual noising process.\nThe Core Idea The training process has two phases:\nForward process (destroying data): Take a real image and progressively add Gaussian noise over many steps (say, 1000 steps) until it becomes pure random noise. This is fixed and requires no learning.\nReverse process (learning to reconstruct): Train a neural network (usually a U-Net or Transformer) to predict and remove the noise at each step — essentially learning to \u0026ldquo;denoise.\u0026rdquo; At inference time, you start from pure noise and repeatedly apply this denoising to generate a new sample.\nWhy It Works The model never learns to go from noise → image in one shot (too hard). Instead it learns a much simpler local question at each step: \u0026ldquo;given this slightly noisy image, what\u0026rsquo;s the noise I should subtract?\u0026rdquo; Chaining 1000 such small steps produces a coherent sample.\nKey Variants DDPM (Denoising Diffusion Probabilistic Models) — the foundational formulation (Ho et al., 2020) DDIM — faster sampling by skipping steps, reducing inference from 1000 → ~50 steps Latent Diffusion Models (LDM) — run the diffusion process in a compressed latent space rather than pixel space, dramatically cutting compute. This is what Stable Diffusion uses. Classifier-Free Guidance (CFG) — technique to steer generation toward a text prompt by jointly training a conditioned and unconditioned model How Text-to-Image Works Models like Stable Diffusion, DALL-E 3, and Flux add conditioning: the denoising network also takes a text embedding (from a CLIP or T5 encoder) as input at every step. The network learns to denoise toward an image that matches the prompt, not just any coherent image.\nComparison to Other Generative Models Model Mechanism Tradeoffs Diffusion Reverse noising High quality, slow sampling GAN Generator vs. discriminator Fast, but training instability VAE Encode → latent → decode Fast, but blurry outputs Flow Matching Learn a vector field (ODE) Cleaner math, increasingly dominant Why They Became Dominant Diffusion models produce significantly better sample quality and diversity than GANs, without the notorious training instability. The latent diffusion trick made them practical at scale, leading to the current generation of image/video/audio models (Stable Diffusion, Sora, Udio, etc.).\nFlow matching (used in Flux, Stable Diffusion 3, and Meta\u0026rsquo;s models) is now emerging as a cleaner successor — same intuition, but learns a straight-line path through the data manifold rather than a noisy diffusion path.\n","permalink":"https://knowledged.to/notes/ml/diffusion-models/","summary":"\u003ch1 id=\"diffusion-models-in-ai\"\u003eDiffusion Models in AI\u003c/h1\u003e\n\u003cp\u003eDiffusion models are a class of generative AI models that learn to create data (images, audio, video, etc.) by learning to reverse a gradual noising process.\u003c/p\u003e\n\u003ch2 id=\"the-core-idea\"\u003eThe Core Idea\u003c/h2\u003e\n\u003cp\u003eThe training process has two phases:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eForward process (destroying data):\u003c/strong\u003e Take a real image and progressively add Gaussian noise over many steps (say, 1000 steps) until it becomes pure random noise. This is fixed and requires no learning.\u003c/p\u003e","title":"Diffusion Models in AI"},{"content":"AI Prompts: System Prompt and Other Types System Prompt A system prompt is a set of instructions given to an AI model before any conversation begins. It\u0026rsquo;s written by the developer or application builder (not the end user) and sets the AI\u0026rsquo;s behavior, persona, tone, rules, and constraints for the entire session. The user typically doesn\u0026rsquo;t see it.\nThink of it like a job briefing you give an employee before they meet a customer — it shapes how they behave without the customer knowing the specifics.\nFor example, a system prompt might say: \u0026ldquo;You are a helpful customer support agent for Acme Corp. Only answer questions about our products. Always be polite and concise.\u0026rdquo;\nOther Types of Prompts in AI User Prompt — This is the message the end user actually types. It\u0026rsquo;s the direct question or request sent to the AI in real time. Most people interact only at this level.\nAssistant Prompt (or AI turn) — In multi-turn conversations, the AI\u0026rsquo;s previous responses can be included as part of the conversation history, effectively \u0026ldquo;prompting\u0026rdquo; the next response. This is how context is maintained across a chat.\nFew-shot Prompt — Examples embedded in a prompt to teach the model a pattern. Instead of just explaining a task, you show the model 2–3 input/output examples and ask it to follow suit.\nZero-shot Prompt — Asking the model to do something with no examples at all, relying purely on its training. \u0026ldquo;Translate this sentence to French.\u0026rdquo;\nChain-of-Thought Prompt — A technique where you instruct the model to think step by step before giving a final answer, which improves reasoning on complex tasks.\nMeta-prompt — A prompt whose purpose is to generate or improve other prompts. Useful when you want the AI to help you craft better instructions.\nRetrieval-Augmented Prompt — A prompt that includes relevant context retrieved from an external database or document at runtime, so the model can answer questions grounded in specific, up-to-date information.\nIn practice, most production AI applications layer several of these together — a system prompt sets the rules, retrieved documents provide context, and the user prompt drives the actual query.\n","permalink":"https://knowledged.to/notes/ml/ai-prompts/","summary":"\u003ch1 id=\"ai-prompts-system-prompt-and-other-types\"\u003eAI Prompts: System Prompt and Other Types\u003c/h1\u003e\n\u003ch2 id=\"system-prompt\"\u003eSystem Prompt\u003c/h2\u003e\n\u003cp\u003eA \u003cstrong\u003esystem prompt\u003c/strong\u003e is a set of instructions given to an AI model \u003cem\u003ebefore\u003c/em\u003e any conversation begins. It\u0026rsquo;s written by the developer or application builder (not the end user) and sets the AI\u0026rsquo;s behavior, persona, tone, rules, and constraints for the entire session. The user typically doesn\u0026rsquo;t see it.\u003c/p\u003e\n\u003cp\u003eThink of it like a job briefing you give an employee before they meet a customer — it shapes how they behave without the customer knowing the specifics.\u003c/p\u003e","title":"AI Prompts: System Prompt and Other Types"},{"content":"Elastic Looped Transformers (ELT) Elastic Looped Transformers (ELT) are a recent architectural innovation that rethinks how transformer layers are applied — moving from a fixed, one-pass stack to a dynamic, recurrent execution model.\nThe Standard Transformer Problem In a conventional transformer, you have a fixed stack of N layers (say, 96 layers in a large model). Every input always passes through all 96 layers exactly once. This is rigid in two ways:\nEvery input gets the same compute budget, regardless of whether it\u0026rsquo;s a trivial question or a complex reasoning problem. Depth is fixed at architecture design time — you can\u0026rsquo;t adapt it post-training without retraining. The Core Idea: Looping ELT takes a shallower set of transformer layers and runs them multiple times in a loop — hence \u0026ldquo;looped.\u0026rdquo; Instead of having 96 distinct layers, you might have 12 layers that execute 8 times, with hidden states passed from one loop iteration to the next.\nThis makes the architecture inherently recurrent — information from one pass through the layer block feeds into the next, allowing the model to iteratively refine its representations. Each loop can be thought of as a \u0026ldquo;thinking step,\u0026rdquo; where the model revisits the same input with an updated internal state.\nThe \u0026ldquo;Elastic\u0026rdquo; Part The elastic aspect allows the number of loop iterations to be adjusted dynamically at inference time based on the difficulty or complexity of the input:\nSimple inputs might need only 3–4 loops; complex multi-step reasoning might use 12+. This is called adaptive compute or dynamic depth. You can tune the speed/quality tradeoff at inference time without retraining. Easy inputs are processed cheaply, hard inputs get more compute. The same model checkpoint can serve both latency-sensitive and quality-sensitive use cases. How Hidden State Carries Over Between loop iterations, the model\u0026rsquo;s hidden state (the tensor representations at each token position) is passed forward. The model builds a progressively refined understanding of the input across iterations — similar to how a human might re-read a complex sentence before answering.\nSome ELT designs incorporate a learned halting mechanism — a small auxiliary network that predicts whether another loop iteration would meaningfully improve the output, allowing the model to stop early when it\u0026rsquo;s confident.\nComparison to Other Approaches Approach Compute per input Adaptivity Memory Standard Transformer Fixed (all layers, once) None KV cache only Mixture of Experts Variable (sparse routing) Partial Large parameter count ELT Variable (loop count) High Recurrent hidden state Mamba/RWKV Fixed (recurrent step) None Compressed state ELT sits in an interesting position — more adaptive than standard transformers, more memory-efficient than MoE, and more expressive than fixed-recurrent models like Mamba.\nWhy It Matters for AI Engineers Inference cost optimization — dynamically allocate compute per request based on complexity rather than paying the same cost for everything. Reasoning tasks — iterative refinement makes ELT well-suited for chain-of-thought style reasoning baked into the architecture; each loop iteration is implicitly a reasoning step. Edge/constrained deployment — cap loop count for resource-constrained environments; run more iterations in the cloud — same weights, different compute budget. Current State (April 2026) ELT remains primarily a research architecture — not yet adopted at mainstream transformer deployment scale. Main open challenges:\nTraining stability (looped architectures can suffer from gradient issues across iterations) Overhead of the halting mechanism Represents a promising direction in the broader movement toward adaptive compute — the idea that model intelligence should scale with problem difficulty, not be statically provisioned.\n","permalink":"https://knowledged.to/notes/ml/elastic-looped-transformers/","summary":"\u003ch1 id=\"elastic-looped-transformers-elt\"\u003eElastic Looped Transformers (ELT)\u003c/h1\u003e\n\u003cp\u003eElastic Looped Transformers (ELT) are a recent architectural innovation that rethinks how transformer layers are applied — moving from a fixed, one-pass stack to a dynamic, recurrent execution model.\u003c/p\u003e\n\u003ch2 id=\"the-standard-transformer-problem\"\u003eThe Standard Transformer Problem\u003c/h2\u003e\n\u003cp\u003eIn a conventional transformer, you have a fixed stack of N layers (say, 96 layers in a large model). Every input always passes through all 96 layers exactly once. This is rigid in two ways:\u003c/p\u003e","title":"Elastic Looped Transformers (ELT)"},{"content":"Tempo Framework Tempo is a framework designed to solve one of the hardest problems in multimodal AI: understanding very long videos without blowing up your context window or compute budget.\nThe Core Problem It Solves Videos are brutally expensive for transformers. A 1-hour video at even 1 frame per second gives you 3,600 frames. At typical vision encoding resolutions, each frame becomes hundreds of tokens — potentially millions of tokens total, far beyond what any current model can process in a single context window. And even if it could, the attention computation would be prohibitively expensive (attention is O(n²) in sequence length).\nPrior approaches mostly resorted to uniform frame sampling — just pick every Nth frame and hope you don\u0026rsquo;t miss anything important. This works poorly for real-world videos where interesting events are sparse and unevenly distributed.\nTempo\u0026rsquo;s Approach Tempo introduces a query-aware temporal compressor built around a Small Vision-Language Model (SVLM). The key word is query-aware — rather than compressing the video uniformly, Tempo compresses it differently depending on what question you\u0026rsquo;re actually asking.\nPipeline Frame encoding — video frames are encoded into visual embeddings as usual. Query-aware compression — the SVLM takes both the visual embeddings and the user\u0026rsquo;s query, then identifies which temporal segments are relevant to that specific question. Relevant frames/segments are preserved at higher fidelity; unrelated stretches get aggressively compressed or dropped. Compressed representation passed to the main MLLM — the large multimodal model receives a much shorter, query-focused token sequence rather than the raw full video, and generates the answer. Why the SVLM Approach Is Smart Using a small model for compression is elegant:\nThe SVLM is cheap to run — its job isn\u0026rsquo;t to answer the question, just to identify relevance. It acts as a smart pre-filter, doing temporal attention at a coarse level so the expensive large model only reasons over the parts that matter. Architecturally similar to RAG: a retriever (cheap) narrows down context before the generator (expensive) does the heavy lifting — just applied to the time dimension of video rather than a document corpus. What It Enables Hour-long video Q\u0026amp;A — e.g., \u0026ldquo;what was the presenter doing when they mentioned the revenue figures?\u0026rdquo; over a full lecture or meeting recording Video summarization with focus — \u0026ldquo;summarize only the parts relevant to the product demo\u0026rdquo; Temporal grounding — \u0026ldquo;when does X happen?\u0026rdquo; over long content Surveillance and monitoring — finding specific events in hours of footage without manual scrubbing Limitations and Open Questions Compression quality depends on the SVLM\u0026rsquo;s ability to judge relevance — made before the large model sees full context. If the query is ambiguous, the SVLM might compress away important material. Queries requiring understanding of patterns across time (e.g., \u0026ldquo;how does the speaker\u0026rsquo;s tone change over the talk?\u0026rdquo;) are harder than point-in-time retrieval. Broader Significance Tempo is part of a broader wave of research on hierarchical / cascaded multimodal processing — the principle that raw, unfiltered perceptual data should not go directly to your most expensive model. Use cheap, fast models for coarse filtering and structuring, then hand off a condensed, task-relevant representation to the powerful model. This pattern is likely to become standard practice as video becomes a primary modality in AI applications.\n","permalink":"https://knowledged.to/notes/ml/tempo-framework/","summary":"\u003ch1 id=\"tempo-framework\"\u003eTempo Framework\u003c/h1\u003e\n\u003cp\u003eTempo is a framework designed to solve one of the hardest problems in multimodal AI: \u003cstrong\u003eunderstanding very long videos\u003c/strong\u003e without blowing up your context window or compute budget.\u003c/p\u003e\n\u003ch2 id=\"the-core-problem-it-solves\"\u003eThe Core Problem It Solves\u003c/h2\u003e\n\u003cp\u003eVideos are brutally expensive for transformers. A 1-hour video at even 1 frame per second gives you 3,600 frames. At typical vision encoding resolutions, each frame becomes hundreds of tokens — potentially millions of tokens total, far beyond what any current model can process in a single context window. And even if it could, the attention computation would be prohibitively expensive (attention is O(n²) in sequence length).\u003c/p\u003e","title":"Tempo Framework"},{"content":"Memory-Augmented Architectures Memory-augmented architectures are neural network designs that give a model access to an explicit, addressable memory store that exists separately from the model\u0026rsquo;s weights. Standard transformers have two forms of \u0026ldquo;memory\u0026rdquo; baked in — the weights (long-term parametric knowledge frozen at training time) and the context window (short-term working memory limited to the current input). Memory-augmented architectures add a third, dynamic layer in between.\nWhy It Matters Standard transformers are stateless between calls. Everything the model \u0026ldquo;knows\u0026rdquo; about your session either lives in the weights or gets re-fed through the context window every time. This creates hard limits: context windows are expensive to fill, they get stale, and they can\u0026rsquo;t persist knowledge across sessions without explicit engineering workarounds.\nHow It Works A memory-augmented model can read from and write to an external memory at inference time. The core mechanism usually involves:\nWriting — after processing information, the model produces a key-value pair (or embedding) and stores it in the memory bank. This can happen continuously, not just during training.\nReading — when the model needs information, it generates a query vector and performs a soft lookup against memory (similar to attention), retrieving the most relevant stored representations.\nForgetting / updating — good systems also have mechanisms to overwrite stale entries or decay old memories, so the store doesn\u0026rsquo;t grow unbounded.\nArchitectures in This Space Neural Turing Machines (NTMs) / Differentiable Neural Computers (DNCs) — the original academic formulations from DeepMind. The model had explicit read/write heads over a tape-like memory. Theoretically powerful but hard to train stably.\nMemory Transformers (MemTrans, Memorizing Transformers) — extend attention to reach into a large external key-value store of past token representations. The model retrieves relevant past context without needing to fit it all in the active context window.\nRetrieval-Augmented Generation (RAG) — the production-pragmatic version. An external vector database acts as memory; a retriever fetches relevant chunks at query time. Easier to build and update than learned memory, though less tightly integrated.\nTitans (Google, 2025) — introduces a learned \u0026ldquo;long-term memory\u0026rdquo; module with its own gradient-based update rule, allowing the model to memorize information during inference, not just training. Showed strong results on tasks requiring very long-range reasoning.\nRecurrent memory approaches (RWKV, Mamba, xLSTM) — instead of explicit external stores, these compress history into a fixed-size hidden state that gets updated at each step. More efficient than full attention but lossy — information can be forgotten.\nThe 4–17x Performance Gain When a model has access to persistent, structured memory, it can effectively \u0026ldquo;do more\u0026rdquo; per unit of compute than a larger static model would. Rather than encoding everything in weights (which requires enormous scale), you offload factual and episodic knowledge to memory and keep the model focused on reasoning — yielding qualitative capability jumps without proportional scaling.\nPractical Implications for AI Engineers If you\u0026rsquo;re building production agents, memory-augmented thinking reshapes your architecture in practical ways:\nPure RAG is table stakes in 2026. The frontier is systems where agents write back to memory — updating what they\u0026rsquo;ve learned from a session, building user-specific context over time, and retrieving it selectively. Frameworks like LlamaIndex and LangGraph already have primitives for this. The research side is now focused on making the read/write more differentiable and less hand-engineered. Key Takeaway Memory-augmented architectures are the bridge between a \u0026ldquo;stateless model call\u0026rdquo; and a \u0026ldquo;persistent intelligent agent.\u0026rdquo; Three memory tiers to design around: weights (parametric, frozen), context window (ephemeral, expensive), and external memory (dynamic, persistent).\n","permalink":"https://knowledged.to/notes/ml/memory-augmented-architectures/","summary":"\u003ch1 id=\"memory-augmented-architectures\"\u003eMemory-Augmented Architectures\u003c/h1\u003e\n\u003cp\u003eMemory-augmented architectures are neural network designs that give a model access to an explicit, addressable memory store that exists separately from the model\u0026rsquo;s weights. Standard transformers have two forms of \u0026ldquo;memory\u0026rdquo; baked in — the weights (long-term parametric knowledge frozen at training time) and the context window (short-term working memory limited to the current input). Memory-augmented architectures add a third, dynamic layer in between.\u003c/p\u003e\n\u003ch2 id=\"why-it-matters\"\u003eWhy It Matters\u003c/h2\u003e\n\u003cp\u003eStandard transformers are stateless between calls. Everything the model \u0026ldquo;knows\u0026rdquo; about your session either lives in the weights or gets re-fed through the context window every time. This creates hard limits: context windows are expensive to fill, they get stale, and they can\u0026rsquo;t persist knowledge across sessions without explicit engineering workarounds.\u003c/p\u003e","title":"Memory-Augmented Architectures"},{"content":"Forward Pass and Single Pass in LLMs These terms are fundamental to understanding how LLMs work under the hood.\nForward Pass A forward pass is a single run of data through a neural network, from input to output. In an LLM, it means feeding a sequence of tokens into the model and computing a probability distribution over the vocabulary for the next token (or all token positions simultaneously).\nHere\u0026rsquo;s what actually happens during a forward pass in a transformer:\nEmbedding — each input token is converted to a high-dimensional vector Attention layers — each token attends to every other token in the sequence, computing relationships (this is the expensive part, O(n²) in sequence length) Feed-forward layers — each token\u0026rsquo;s representation is transformed independently through a series of matrix multiplications Output projection — the final hidden state is projected onto the vocabulary (50K+ tokens) to produce logits (raw scores) Softmax — logits are converted to probabilities, and you sample or argmax to pick the next token The cost of one forward pass is dominated by loading the model weights from GPU memory (HBM). A 70B parameter model at FP16 is ~140GB of weights, and you need to stream all of those through the GPU\u0026rsquo;s compute cores for every single pass. This is why inference is memory-bandwidth-bound.\nSingle Pass \u0026ldquo;Single pass\u0026rdquo; means doing the forward pass exactly once for a given input — you feed in your tokens, run through the entire network once, and get your output logits back. It\u0026rsquo;s contrasted with iterative or multi-step processes that would require multiple network executions.\nHow These Connect to Speculative Decoding Normal autoregressive generation works like this:\nI P P P n a a a p s s s u s s s t : 1 2 3 \" → → → T h p p p e r r r e e e c d d d a i i i p c c c i t t t t s s s a l \" \" \" i P . o s a \" f \" r i F s r \" a n c e \" Three separate forward passes, each loading all model weights. Each pass is largely serial — you can\u0026rsquo;t start pass 2 until you have the token from pass 1.\nSpeculative decoding breaks this seriality. When the large model runs its forward pass to verify the draft model\u0026rsquo;s candidates, it processes all candidate positions in parallel in that single pass. Here\u0026rsquo;s why that\u0026rsquo;s possible:\nTransformers are inherently parallel across the sequence dimension during a forward pass. Given the sequence [\u0026quot;The\u0026quot;, \u0026quot;capital\u0026quot;, \u0026quot;of\u0026quot;, \u0026quot;France\u0026quot;, \u0026quot;is\u0026quot;, \u0026quot;Paris\u0026quot;, \u0026quot;.\u0026quot;], the model can compute the probability of every token given its predecessors simultaneously in one shot — that\u0026rsquo;s how training works. Speculative decoding borrows this property for inference:\nD L r a a r f g t e m m o o d d e e l l g g e e n t e s r : a t e \" → → s T : h v a e e c [ r c \" c i e i a f p s p i t \" i e s , t s a \" \" l a i P l s a o l \" r f i 3 ✓ s F , \" r d , a r a n a c \" c f c . e t e \" \" p ] t t + o s k ( [ e \" 3 \" n P i s a s s r e \" i i p , n s a \" r \" O a P N ✓ t a E , e r i f a s s o c m \" r c a , w e l a p l \" r t . d s p \" a ] p \" s a . s s \" e s s ✓ ) So instead of 3 expensive large-model passes, you do 1. The memory bandwidth cost of that one pass is nearly the same whether you\u0026rsquo;re verifying 1 token or 8, because the weight-loading dominates — not the arithmetic on token positions.\nThe KV Cache Wrinkle Modern LLMs use a KV cache (key-value cache) to avoid recomputing attention for tokens already processed. Each forward pass only computes attention for new tokens against the cached representations of prior tokens. This makes each incremental generation step cheaper than a full forward pass, but the fundamental bottleneck — streaming weights from memory — remains the same. Speculative decoding\u0026rsquo;s gains hold regardless.\nKey Takeaway A forward pass is the atomic unit of computation in a neural network. Speculative decoding\u0026rsquo;s trick is turning what would be N serial forward passes on the large model into 1, by exploiting the parallelism that transformers already have built in.\n","permalink":"https://knowledged.to/notes/ml/forward-pass-and-single-pass/","summary":"\u003ch1 id=\"forward-pass-and-single-pass-in-llms\"\u003eForward Pass and Single Pass in LLMs\u003c/h1\u003e\n\u003cp\u003eThese terms are fundamental to understanding how LLMs work under the hood.\u003c/p\u003e\n\u003ch2 id=\"forward-pass\"\u003eForward Pass\u003c/h2\u003e\n\u003cp\u003eA forward pass is a single run of data through a neural network, from input to output. In an LLM, it means feeding a sequence of tokens into the model and computing a probability distribution over the vocabulary for the next token (or all token positions simultaneously).\u003c/p\u003e\n\u003cp\u003eHere\u0026rsquo;s what actually happens during a forward pass in a transformer:\u003c/p\u003e","title":"Forward Pass and Single Pass in LLMs"},{"content":"Speculative Decoding Speculative decoding is a clever inference optimization technique that exploits a fundamental asymmetry in how LLMs work: verifying a token is much cheaper than generating one.\nThe Basic Setup You run two models simultaneously — a small, fast \u0026ldquo;draft\u0026rdquo; model and your large \u0026ldquo;target\u0026rdquo; model. The draft model generates several tokens ahead in a single pass (typically 4–8 tokens). The large model then verifies all of those candidate tokens in parallel in one forward pass. If the draft tokens match what the large model would have produced, you accept them all at once. If a token diverges, you reject it (and everything after it) and fall back to the large model\u0026rsquo;s output for that position.\nWhy This Is So Effective LLM inference is memory-bandwidth-bound, not compute-bound. The GPU spends most of its time loading model weights from HBM (high bandwidth memory), not doing matrix multiplications. A forward pass that verifies 8 tokens costs nearly the same memory bandwidth as verifying 1 token, so you get multiple accepted tokens for roughly the price of one. The result is a 2–3x throughput improvement with mathematically guaranteed identical output — it\u0026rsquo;s not an approximation.\nThe Catch: Draft Model Quality Matters The speedup depends entirely on how often the draft model\u0026rsquo;s predictions are accepted. If the draft model diverges frequently (low acceptance rate), you\u0026rsquo;re paying the overhead of running two models for minimal gain. In practice, a good draft model for a given target model has a 70–85% token acceptance rate, which is where the 2–3x gains come from.\nVariants Worth Knowing Self-speculative decoding — uses the target model itself with early exit layers as the draft, avoiding the need for a separate model Medusa — adds multiple parallel draft \u0026ldquo;heads\u0026rdquo; to a single model, predicting several tokens ahead simultaneously without a separate model EAGLE / EAGLE-2 — uses a featherweight autoregressive head trained specifically to mimic the target model\u0026rsquo;s distribution, achieving higher acceptance rates than standard speculative decoding SpecInfer — optimized for batched serving scenarios where multiple requests are in-flight When It Helps Most Speculative decoding shines in low-batch, latency-sensitive workloads (like interactive chat or copilot features) where you can dedicate resources to a single request. In high-throughput batch scenarios, continuous batching already keeps the GPU saturated, so the gains are less pronounced.\nPractical Implementation If you\u0026rsquo;re self-hosting models with vLLM or SGLang, both support speculative decoding natively. You configure a speculative_model alongside your target model, and the inference engine handles the rest. For hosted APIs, some providers are now baking it in transparently — it\u0026rsquo;s worth checking whether your provider supports it, as it can cut latency noticeably for streaming responses.\nKey Takeaway Speculative decoding offers identical output quality with the same mathematical guarantees, just faster — a genuine free lunch in systems engineering. As of April 2026, it delivers 2–3x speedup and is supported natively in vLLM and SGLang.\n","permalink":"https://knowledged.to/notes/ml/speculative-decoding/","summary":"\u003ch1 id=\"speculative-decoding\"\u003eSpeculative Decoding\u003c/h1\u003e\n\u003cp\u003eSpeculative decoding is a clever inference optimization technique that exploits a fundamental asymmetry in how LLMs work: \u003cstrong\u003everifying a token is much cheaper than generating one\u003c/strong\u003e.\u003c/p\u003e\n\u003ch2 id=\"the-basic-setup\"\u003eThe Basic Setup\u003c/h2\u003e\n\u003cp\u003eYou run two models simultaneously — a small, fast \u0026ldquo;draft\u0026rdquo; model and your large \u0026ldquo;target\u0026rdquo; model. The draft model generates several tokens ahead in a single pass (typically 4–8 tokens). The large model then verifies all of those candidate tokens \u003cem\u003ein parallel\u003c/em\u003e in one forward pass. If the draft tokens match what the large model would have produced, you accept them all at once. If a token diverges, you reject it (and everything after it) and fall back to the large model\u0026rsquo;s output for that position.\u003c/p\u003e","title":"Speculative Decoding"},{"content":"What Are Model Weights in an LLM? Model weights are the learned numbers inside the neural network.\nDuring training, the model adjusts billions of numeric parameters so that, given some input text, it becomes better at predicting the next token. Those parameters are the weights.\nShort Intuition A useful way to think about it:\nThe model architecture is the blueprint. The weights are the filled-in values that make the blueprint useful. Without weights, the model is just an empty structure. What Weights Do Weights control how information flows through the network.\nThey determine:\nhow strongly one internal feature affects another which patterns the model has learned from training data how the model transforms input tokens into probabilities for the next token In practice, weights are the model\u0026rsquo;s learned behavior encoded as numbers.\nWhy They Matter When people say a model is \u0026ldquo;7B\u0026rdquo; or \u0026ldquo;70B\u0026rdquo;, they are usually referring to the number of parameters or weights.\nMore weights often mean:\nmore memory usage more computation potentially stronger modeling capacity But more weights also mean the model is heavier to load and run.\nIn Real Systems Like Ollama A model file stored on disk mostly contains these learned weights.\nWhen Ollama loads a model, it is mainly loading those numbers into RAM and sometimes VRAM so inference can begin.\nThat is one major reason the first prompt is often slower: the system has to bring the model\u0026rsquo;s learned parameters into memory before it can generate text.\nTiny Analogy If a neural network is a huge board of adjustable knobs, the weights are the positions of those knobs after training.\nTraining is the process of turning all those knobs until the model becomes good at prediction.\n","permalink":"https://knowledged.to/notes/ml/llm-model-weights/","summary":"\u003ch1 id=\"what-are-model-weights-in-an-llm\"\u003eWhat Are Model Weights in an LLM?\u003c/h1\u003e\n\u003cp\u003eModel weights are the learned numbers inside the neural network.\u003c/p\u003e\n\u003cp\u003eDuring training, the model adjusts billions of numeric parameters so that, given some input text, it becomes better at predicting the next token. Those parameters are the weights.\u003c/p\u003e\n\u003ch2 id=\"short-intuition\"\u003eShort Intuition\u003c/h2\u003e\n\u003cp\u003eA useful way to think about it:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eThe model architecture is the blueprint.\u003c/li\u003e\n\u003cli\u003eThe weights are the filled-in values that make the blueprint useful.\u003c/li\u003e\n\u003cli\u003eWithout weights, the model is just an empty structure.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"what-weights-do\"\u003eWhat Weights Do\u003c/h2\u003e\n\u003cp\u003eWeights control how information flows through the network.\u003c/p\u003e","title":"What Are Model Weights in an LLM?"},{"content":"GGUF Models GGUF (GPT-Generated Unified Format) is a binary file format for storing and distributing large language models, designed specifically for efficient local inference.\nBackground Introduced by the llama.cpp project in 2023 as a replacement for the older GGML format. The name reflects its origins but it\u0026rsquo;s now used broadly across many model families beyond GPT.\nKey Characteristics Self-contained — A single .gguf file bundles everything needed to run a model: weights, tokenizer vocabulary, metadata, and architecture config. No separate config files needed.\nQuantization-friendly — GGUF is the go-to format for quantized models. Common quantization levels include Q4_K_M, Q5_K_M, Q8_0, etc. Quantization reduces model size and memory requirements by lowering numerical precision (e.g., from 32-bit floats to 4-bit integers), with varying tradeoffs in quality.\nCPU + GPU inference — Unlike formats optimized purely for GPU (like safetensors in training pipelines), GGUF models can run efficiently on CPU, with optional GPU offloading for layers that fit in VRAM.\nMetadata-rich — The format includes a structured key-value metadata section describing the architecture, context length, rope scaling, and more — making it easier for runtimes to load models correctly without external config.\nWhy It Matters It\u0026rsquo;s the dominant format for running open-weight models locally (Llama, Mistral, Phi, Gemma, etc.) using tools like:\nllama.cpp — the reference runtime Ollama — wraps llama.cpp for a Docker-like local model experience LM Studio — GUI for running GGUF models Jan — another local inference UI Typical Filename Anatomy A filename like Meta-Llama-3-8B-Instruct.Q4_K_M.gguf tells you:\nModel family and size: Llama 3, 8B parameters Variant: Instruct-tuned Quantization: Q4_K_M (4-bit, K-quant, medium quality) It\u0026rsquo;s essentially the standard packaging format for the local LLM ecosystem.\n","permalink":"https://knowledged.to/notes/ml/gguf-models/","summary":"\u003ch1 id=\"gguf-models\"\u003eGGUF Models\u003c/h1\u003e\n\u003cp\u003eGGUF (GPT-Generated Unified Format) is a binary file format for storing and distributing large language models, designed specifically for efficient local inference.\u003c/p\u003e\n\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eIntroduced by the \u003cstrong\u003ellama.cpp\u003c/strong\u003e project in 2023 as a replacement for the older GGML format. The name reflects its origins but it\u0026rsquo;s now used broadly across many model families beyond GPT.\u003c/p\u003e\n\u003ch2 id=\"key-characteristics\"\u003eKey Characteristics\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eSelf-contained\u003c/strong\u003e — A single \u003ccode\u003e.gguf\u003c/code\u003e file bundles everything needed to run a model: weights, tokenizer vocabulary, metadata, and architecture config. No separate config files needed.\u003c/p\u003e","title":"GGUF Models"},{"content":"Prompt Bias in AI Prompt bias is a type of AI bias that comes from how a question or instruction is written, not just from the model itself.\nIn simple terms:\nThe wording, framing, or assumptions in a prompt can push an AI toward a particular answer—even if that answer isn’t neutral or fully accurate.\nWhat Prompt Bias Looks Like Here’s a quick comparison:\nNeutral prompt:\n“What are the effects of remote work on productivity?”\nBiased prompt:\n“Why does remote work reduce productivity?”\nThe second one already assumes a conclusion. The AI is more likely to justify that assumption instead of questioning it.\nCommon Forms of Prompt Bias 1. Leading Questions Prompts that nudge the AI toward a specific answer.\n“Why is product X better than product Y?” Problem: It presumes superiority instead of evaluating both. 2. Framing Bias The way information is presented influences the output.\n“How dangerous is AI?” vs “What are the risks and benefits of AI?” Same topic, very different outcomes. 3. Assumption Bias The prompt contains hidden or unverified assumptions.\n“Why do most startups fail due to bad leadership?” Reality: failure has many causes. 4. Emotional or Loaded Language Strong wording skews the tone of the response.\n“Why is this terrible policy harmful?” Words like terrible push negativity. 5. Context Injection Bias Selective context leads to skewed answers.\nProviding only negative reviews of a product and asking for a summary. 6. Instructional Bias The way you tell the AI to behave affects the output.\n“Argue that…” vs “Analyze both sides…” Why Prompt Bias Matters If you’re building tools, this is where things get real:\nYou control the output more than the model does.\nBad prompts = biased system, even with a good model.\nUser trust gets affected.\nIf your tool subtly nudges answers, people will feel it.\nIt compounds over time.\nEspecially in systems that store/reuse prompts.\nPractical Advice (No fluff) If you’re designing prompts for a system:\n1. Strip assumptions Instead of:\n“Why is Kubernetes hard to use?”\nUse:\n“What challenges do users face with Kubernetes?”\n2. Force balance when needed Explicitly ask:\n“List pros and cons” “Provide multiple perspectives” 3. Separate facts from opinions Ask for:\n“Evidence-based explanation” “Common viewpoints vs fringe viewpoints” 4. Be careful with system prompts This is where hidden bias creeps in:\n“Be helpful and optimistic” → can suppress criticism “Be critical” → can overemphasize negatives 5. Test prompts like code Don’t assume they’re neutral.\nTry variations and compare outputs.\nThe Hard Truth Most “AI bias” people complain about is actually prompt bias in disguise.\nIf you’re building an AI product:\nThe model is the engine The prompt is the steering wheel If the steering is off, don’t blame the engine.\n","permalink":"https://knowledged.to/notes/ml/prompt-bias-in-ai/","summary":"\u003ch1 id=\"prompt-bias-in-ai\"\u003ePrompt Bias in AI\u003c/h1\u003e\n\u003cp\u003e\u003cstrong\u003ePrompt bias\u003c/strong\u003e is a type of AI bias that comes from \u003cem\u003ehow a question or instruction is written\u003c/em\u003e, not just from the model itself.\u003c/p\u003e\n\u003cp\u003eIn simple terms:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003eThe wording, framing, or assumptions in a prompt can push an AI toward a particular answer—even if that answer isn’t neutral or fully accurate.\u003c/p\u003e\u003c/blockquote\u003e\n\u003chr\u003e\n\u003ch2 id=\"what-prompt-bias-looks-like\"\u003eWhat Prompt Bias Looks Like\u003c/h2\u003e\n\u003cp\u003eHere’s a quick comparison:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eNeutral prompt:\u003c/strong\u003e\u003cbr\u003e\n“What are the effects of remote work on productivity?”\u003c/p\u003e","title":"Prompt Bias in AI"},{"content":"Primacy Bias in LLM Style Selection What primacy bias is Primacy bias is the tendency of an AI model to give disproportionate weight to items that appear earlier in a list or prompt. When a model is asked to choose from many options, options shown first can become over-represented in the final answer even when later options are equally or more appropriate.\nIn practical terms, this means that a selector prompt like:\nstyle-a style-b style-c \u0026hellip; can systematically prefer style-a more often than expected if the candidates are always presented in the same order.\nWhy primacy bias happens The exact internal cause is not always directly observable from application code, but the working theory is well understood:\nLLMs process prompts sequentially, so early items establish the initial frame for the decision. During long candidate-list tasks, the model may anchor on the first few plausible options before it has fully compared the rest. If the prompt shape is repeated across many requests with stable ordering, the bias compounds into a visible production pattern. When the candidate list is long, the model may perform a shallow satisficing search instead of a full global comparison, which makes early acceptable answers more likely to survive. This is not the same as a hard-coded rule in the application. It is an emergent prompt-ordering bias that can show up when model selection is driven by long, ordered candidate lists.\nWhat happened in the BHQ case After BHQ-2033 moved analyze_visual_styles out of ctxms and into agenticms, style selection began using local cached catalogs plus an LLM selector prompt.\nThe relevant behavior at that point was:\nstyle candidates were filtered by playbook compatibility the filtered candidates were sorted alphabetically by style ID the sorted candidate list was passed to the LLM in that same alphabetical order the fallback path also picked the first few candidates from that same sorted list This created two different bias surfaces:\n1. Prompt-order bias in the main LLM selector For the normal selection path, the model was shown candidate styles in alphabetical order. Because annotated-statement-typography sorts very early, it often appeared near the top of the prompt. The working theory was that this caused the model to choose it disproportionately often, especially when many candidates were broadly compatible.\n2. Alphabetical bias in the fallback path If the LLM call failed or returned invalid JSON, the fallback logic returned the first three styles from the alphabetically sorted list. In that path, annotated-statement-typography was not just favored by prompt position; it was guaranteed to be in the fallback set whenever it survived filtering.\nConcrete evidence from this incident The incident report that became BHQ-2149 was based on repeated observation that annotated-statement-typography appeared in multiple consecutive style-selection results.\nCode review confirmed:\ncandidate ordering was alphabetical before prompt construction fallback selection was also alphabetical-first annotated-statement-typography had broad playbook tags and frequently survived filtering Additional investigation showed that the style was not uniquely broad, but it was broad enough to stay in the candidate pool for common combinations such as Authority and Transformation. Once it survived filtering, alphabetical ordering pushed it toward the front of the prompt and into the fallback set.\nFixes applied Two fixes were introduced in agenticms:\nCandidate order passed to the LLM is no longer alphabetical. Instead, it is reordered using a deterministic per-request hash, which removes stable lexicographic primacy while keeping debugging reproducible for the same brief.\nFallback selection no longer takes the first three alphabetically sorted candidates. It now uses the same deterministic non-lexicographic ordering before choosing the fallback set.\nThis preserves reproducibility while eliminating the structural alphabetical advantage that certain styles had.\nPractical lesson Whenever an LLM is asked to choose from a long set of candidates, candidate order is part of the model behavior surface.\nIf the order is stable and meaningful only to the implementation (for example alphabetical by slug), that order can accidentally become a hidden ranking signal. In production systems, that can look like the model has a \u0026ldquo;favorite\u0026rdquo; option when the real issue is deterministic prompt ordering.\nDesign guidance To reduce primacy bias in future selector prompts:\ndo not present candidates in lexicographic or insertion order unless that order is semantically meaningful prefer deterministic shuffling or another neutral ordering strategy ensure fallback logic does not reintroduce the same ordering bias log both the candidate count and the chosen items so repeated patterns can be detected early audit metadata breadth separately from prompt-order effects; these are related but distinct bias sources ","permalink":"https://knowledged.to/notes/ml/primacy-bias-in-llm-style-selection/","summary":"\u003ch1 id=\"primacy-bias-in-llm-style-selection\"\u003ePrimacy Bias in LLM Style Selection\u003c/h1\u003e\n\u003ch2 id=\"what-primacy-bias-is\"\u003eWhat primacy bias is\u003c/h2\u003e\n\u003cp\u003ePrimacy bias is the tendency of an AI model to give disproportionate weight to items that appear earlier in a list or prompt. When a model is asked to choose from many options, options shown first can become over-represented in the final answer even when later options are equally or more appropriate.\u003c/p\u003e\n\u003cp\u003eIn practical terms, this means that a selector prompt like:\u003c/p\u003e","title":"Primacy Bias in LLM Style Selection"},{"content":"Slack MCP Ideas:\nUsing Slack MCP monitor for automation opportunities within the org. Using Slack MCP identify duplicated efforts in the org. ","permalink":"https://knowledged.to/notes/devops/slack-mcp-ideas/","summary":"\u003cp\u003eSlack MCP Ideas:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eUsing Slack MCP monitor for automation opportunities within the org.\u003c/li\u003e\n\u003cli\u003eUsing Slack MCP identify duplicated efforts in the org.\u003c/li\u003e\n\u003c/ul\u003e","title":"Slack MCP Ideas"},{"content":"ELO Scoring for AI Models ELO scoring for AI models works the same way it does in chess — it\u0026rsquo;s a method for ranking competitors based on head-to-head outcomes, where your rating rises or falls depending on whether you beat or lose to opponents of known strength.\nHow it works The core idea: Every model starts with a baseline rating. When two models are compared, the system predicts the expected outcome based on the rating gap. If the actual result matches the prediction, ratings barely move. If an underdog wins, ratings shift dramatically.\nThe expected score formula:\nFor model A vs model B:\nE_A = 1 / (1 + 10^((R_B - R_A) / 400))\nIf A has rating 1200 and B has 1000, A is heavily favored. If A still wins, it gains few points. If B wins, B gains many.\nThe update rule:\nR_new = R_old + K × (Actual − Expected)\nK is a sensitivity constant — higher K means ratings move faster after each match.\nHow this applies to LLMs For AI models, the \u0026ldquo;match\u0026rdquo; is a human preference vote. Platforms like Chatbot Arena (LMSYS) show users two anonymous model responses to the same prompt and ask: which is better? That vote is the outcome — win, loss, or tie.\nAggregating thousands of these votes produces an ELO leaderboard. The beauty is that you don\u0026rsquo;t need every model to face every other model directly — ELO transitivity fills in the gaps.\nStrengths Handles sparse comparisons — models don\u0026rsquo;t need to be directly compared to be ranked against each other Continuously updatable — new models slot in naturally as votes accumulate Human-grounded — rankings reflect actual human preference, not just benchmark scores Weaknesses Prompt distribution matters — ratings reflect performance on whatever prompts users happen to submit, which may not be representative Voter bias — humans may prefer verbose, confident, or stylistically pleasing answers regardless of correctness Non-stationarity — models get updated, but their ELO history persists, creating staleness Gaming — knowing which prompts end up in Arena could theoretically let labs optimize for them Ties are messy — LLM comparisons often result in \u0026ldquo;both good\u0026rdquo; or \u0026ldquo;both bad,\u0026rdquo; which ELO handles less cleanly than chess In practice Chatbot Arena is the most prominent example, maintaining an ELO leaderboard across dozens of models. It\u0026rsquo;s become a widely cited signal for overall model quality precisely because it captures something benchmark suites miss: whether real users actually prefer one model over another.\n","permalink":"https://knowledged.to/notes/ml/elo-scoring-for-ai-models/","summary":"\u003ch1 id=\"elo-scoring-for-ai-models\"\u003eELO Scoring for AI Models\u003c/h1\u003e\n\u003cp\u003eELO scoring for AI models works the same way it does in chess — it\u0026rsquo;s a method for ranking competitors based on head-to-head outcomes, where your rating rises or falls depending on whether you beat or lose to opponents of known strength.\u003c/p\u003e\n\u003ch2 id=\"how-it-works\"\u003eHow it works\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eThe core idea:\u003c/strong\u003e Every model starts with a baseline rating. When two models are compared, the system predicts the expected outcome based on the rating gap. If the actual result matches the prediction, ratings barely move. If an underdog wins, ratings shift dramatically.\u003c/p\u003e","title":"ELO Scoring for AI Models"},{"content":"Distillation in AI (also called knowledge distillation) is a model compression technique where a smaller \u0026ldquo;student\u0026rdquo; model is trained to mimic the behavior of a larger, more capable \u0026ldquo;teacher\u0026rdquo; model.\nHow it works\nInstead of training the student on hard labels (e.g., \u0026ldquo;this image is a cat\u0026rdquo;), the student learns from the teacher\u0026rsquo;s soft outputs — the probability distribution the teacher assigns across all classes. These soft outputs carry richer information. For example, knowing a model thinks an image is 70% cat, 20% leopard, and 10% tiger tells the student more about the underlying structure than just \u0026ldquo;cat.\u0026rdquo;\nWhy it matters\nLarge models are expensive to run. Distillation lets you compress their \u0026ldquo;knowledge\u0026rdquo; into a smaller model that:\nIs faster and cheaper to serve Uses less memory Often performs surprisingly close to the original Common applications\nEdge deployment — running models on phones or IoT devices SpecDecoding — a large model verifies outputs from a smaller draft model to speed up inference LLM training — newer, smaller models trained on outputs from larger frontier models (e.g., DeepSeek\u0026rsquo;s R1 distilled variants were trained on reasoning traces from a larger model) Task-specific compression — fine-tuning a general large model into a small specialist A nuance worth knowing\nThere\u0026rsquo;s a distinction between distillation from logits (the raw probability outputs) versus distillation from reasoning traces or chain-of-thought — the latter is more common in modern LLM work, where the student learns to replicate the teacher\u0026rsquo;s step-by-step reasoning rather than just final token probabilities.\nIn short: distillation is how the AI field takes big expensive models and squeezes their capabilities into small, deployable ones.\n","permalink":"https://knowledged.to/notes/ml/knowledge-distillation/","summary":"\u003cp\u003eDistillation in AI (also called \u003cstrong\u003eknowledge distillation\u003c/strong\u003e) is a model compression technique where a smaller \u0026ldquo;student\u0026rdquo; model is trained to mimic the behavior of a larger, more capable \u0026ldquo;teacher\u0026rdquo; model.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHow it works\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eInstead of training the student on hard labels (e.g., \u0026ldquo;this image is a cat\u0026rdquo;), the student learns from the teacher\u0026rsquo;s \u003cem\u003esoft outputs\u003c/em\u003e — the probability distribution the teacher assigns across all classes. These soft outputs carry richer information. For example, knowing a model thinks an image is 70% cat, 20% leopard, and 10% tiger tells the student more about the underlying structure than just \u0026ldquo;cat.\u0026rdquo;\u003c/p\u003e","title":"Knowledge Distillation"},{"content":"Training-Free GRPO — One-Page Summary Paper: Training-Free Group Relative Policy Optimization By: Youtu-Agent Team Publication date: October 9, 2025\nThe Problem Fine-tuning LLMs with reinforcement learning (RL) to improve agent performance in specialized domains is expensive, data-hungry, prone to overfitting, and kills cross-domain generalization. Most RL approaches are limited to sub-32B models due to compute constraints.\nThe Core Idea Instead of updating model parameters (gradient-based RL), Training-Free GRPO updates model context — building an evolving library of experiential knowledge that gets injected into the prompt. The model weights stay frozen throughout.\nHow It Works The method mirrors vanilla GRPO\u0026rsquo;s structure but replaces gradient updates with context updates:\nRollout — For each training query, generate G parallel outputs using the frozen LLM conditioned on the current experience library E Reward — Score each output with a reward model (same as standard GRPO) Semantic Advantage — Instead of computing a numerical advantage for gradient ascent, the LLM summarizes each trajectory, then compares winners vs. losers to extract natural-language \u0026ldquo;lessons learned\u0026rdquo; — the semantic advantage Optimization — The experience library E is updated via Add / Delete / Modify / Keep operations based on these lessons. In the next epoch, the enriched E guides better outputs This repeats for 3 epochs over ~100 training samples.\nResults Applied to DeepSeek-V3.1-Terminus (671B) on AIME math benchmarks and WebWalkerQA web search:\nMethod AIME24 AIME25 Cost ReAct baseline 80.0% 67.9% — + Training-Free GRPO 82.7% 73.3% ~$18 ReTool (RL-trained 32B) 67.0% 49.3% ~$10,000 Key Advantages\nCost: ~$18 vs. ~$10,000 for comparable RL fine-tuning Data: 100 samples vs. 17,000+ Generalization: Swapping in a different experience library gives strong performance across both math and web search simultaneously — something parameter-tuned specialists can\u0026rsquo;t do No infrastructure: Works with any frozen API-based model, no dedicated GPU cluster needed Limitations Effectiveness depends on the underlying model\u0026rsquo;s baseline capability — results on weaker models (e.g., QwQ-32B on web tasks) were mixed or negative, suggesting a capable base model is a prerequisite.\nTL;DR Training-Free GRPO shows that for sufficiently capable LLMs, you can get RL-like performance gains by teaching the prompt rather than the parameters — at a fraction of the cost.\n","permalink":"https://knowledged.to/notes/ml/training-free-grpo/","summary":"\u003ch1 id=\"training-free-grpo--one-page-summary\"\u003eTraining-Free GRPO — One-Page Summary\u003c/h1\u003e\n\u003cp\u003ePaper: Training-Free Group Relative Policy Optimization\nBy: Youtu-Agent Team\nPublication date: October 9, 2025\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eThe Problem\u003c/strong\u003e\nFine-tuning LLMs with reinforcement learning (RL) to improve agent performance in specialized domains is expensive, data-hungry, prone to overfitting, and kills cross-domain generalization. Most RL approaches are limited to sub-32B models due to compute constraints.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eThe Core Idea\u003c/strong\u003e\nInstead of updating model \u003cem\u003eparameters\u003c/em\u003e (gradient-based RL), Training-Free GRPO updates model \u003cem\u003econtext\u003c/em\u003e — building an evolving library of experiential knowledge that gets injected into the prompt. The model weights stay frozen throughout.\u003c/p\u003e","title":"Training-Free GRPO"},{"content":"Attention in AI Attention is a mechanism that allows a model to focus on the most relevant parts of its input when producing an output — much like how humans pay more attention to certain words or objects in a scene than others.\nThe Core Idea Instead of treating all parts of the input equally, attention assigns weights to different elements, so the model can dynamically decide what\u0026rsquo;s important for each step of its task.\nA Simple Example Consider translating: \u0026ldquo;The cat sat on the mat\u0026rdquo; → French.\nWhen generating the word for \u0026ldquo;cat\u0026rdquo;, the model should focus heavily on \u0026ldquo;cat\u0026rdquo; and less on \u0026ldquo;mat\u0026rdquo;. Attention lets it do exactly that.\nHow It Works (Self-Attention) For each token (word/piece) in a sequence, attention computes three vectors:\nQuery (Q) — \u0026ldquo;What am I looking for?\u0026rdquo; Key (K) — \u0026ldquo;What do I contain?\u0026rdquo; Value (V) — \u0026ldquo;What information do I provide?\u0026rdquo; The attention score between tokens is computed as:\nAttention(Q, K, V) = softmax(QKᵀ / √d) · V\nThis produces a weighted sum of values, where tokens most relevant to each other get higher weights.\nTypes of Attention Type Description Self-attention Each token attends to all other tokens in the same sequence Cross-attention Tokens in one sequence attend to tokens in another (e.g., encoder → decoder) Multi-head attention Multiple attention operations run in parallel, each learning different relationships Causal/masked attention Tokens can only attend to previous tokens (used in GPT-style models) Why It Matters Attention is the foundation of the Transformer architecture (introduced in the 2017 paper \u0026ldquo;Attention Is All You Need\u0026rdquo;), which powers virtually all modern AI language models — GPT, Claude, Gemini, BERT, etc.\nBefore attention, models like RNNs had to compress entire sequences into a single vector, losing long-range context. Attention solved this by letting every token directly relate to every other token, regardless of distance.\nIn Short Attention = a learned, dynamic weighting system that tells the model where to look and what matters most at each step of reasoning or generation.\n","permalink":"https://knowledged.to/notes/ml/attention-mechanism/","summary":"\u003ch2 id=\"attention-in-ai\"\u003eAttention in AI\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eAttention\u003c/strong\u003e is a mechanism that allows a model to focus on the most relevant parts of its input when producing an output — much like how humans pay more attention to certain words or objects in a scene than others.\u003c/p\u003e\n\u003ch3 id=\"the-core-idea\"\u003eThe Core Idea\u003c/h3\u003e\n\u003cp\u003eInstead of treating all parts of the input equally, attention assigns \u003cstrong\u003eweights\u003c/strong\u003e to different elements, so the model can dynamically decide what\u0026rsquo;s important for each step of its task.\u003c/p\u003e","title":"Attention Mechanism"},{"content":"The Transformer architecture, introduced in the 2017 paper \u0026ldquo;Attention Is All You Need\u0026rdquo; by Vaswani et al., revolutionized AI by replacing recurrent networks with a purely attention-based design. Here\u0026rsquo;s a breakdown of how it works:\nCore Idea: Self-Attention Instead of processing sequences step-by-step (like RNNs), Transformers process all tokens in parallel and learn relationships between every pair of tokens simultaneously. This is done via self-attention.\nFor each token, three vectors are computed:\nQuery (Q) — what this token is looking for Key (K) — what this token offers Value (V) — the actual content to pass forward Attention is computed as: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V\nThe scaling by √d_k prevents dot products from growing too large in high dimensions.\nMulti-Head Attention Rather than one attention computation, the model runs multiple attention heads in parallel — each learning different types of relationships (e.g., syntactic, semantic, co-reference). Their outputs are concatenated and projected.\nPositional Encoding Since there\u0026rsquo;s no recurrence, the model has no inherent sense of order. Positional encodings (sinusoidal functions or learned embeddings) are added to token embeddings to inject sequence position information.\nThe Encoder-Decoder Structure The original Transformer had two stacks:\nEncoder (used in models like BERT):\nInput embeddings + positional encoding Multi-head self-attention Feed-forward network (FFN) Layer norm + residual connections around each sub-layer Decoder (used in models like GPT):\nSame as encoder, but adds masked self-attention (tokens can only attend to past tokens) Cross-attention layer — attends to the encoder\u0026rsquo;s output FFN + layer norm + residuals Modern LLMs like GPT are decoder-only; models like BERT are encoder-only.\nFeed-Forward Network (FFN) After attention, each position passes through a small 2-layer MLP independently: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂\nThis adds non-linearity and expands the model\u0026rsquo;s representational capacity.\nWhy Transformers Won Property RNNs/LSTMs Transformers Parallelism Sequential Fully parallel Long-range dependencies Struggles Handles natively Training speed Slow Fast (on GPUs/TPUs) Scalability Limited Scales to billions of params Key Variants BERT — Encoder-only, trained with masked language modeling; great for classification and understanding tasks. GPT — Decoder-only, trained autoregressively; great for generation. T5 / BART — Full encoder-decoder; great for seq2seq tasks like translation and summarization. Vision Transformer (ViT) — Applies the same architecture to image patches instead of text tokens. The Transformer\u0026rsquo;s combination of parallelism, expressiveness, and scalability is what enabled the modern LLM era.\n","permalink":"https://knowledged.to/notes/ml/transformer-architecture/","summary":"\u003cp\u003eThe Transformer architecture, introduced in the 2017 paper \u003cem\u003e\u0026ldquo;Attention Is All You Need\u0026rdquo;\u003c/em\u003e by Vaswani et al., revolutionized AI by replacing recurrent networks with a purely attention-based design. Here\u0026rsquo;s a breakdown of how it works:\u003c/p\u003e\n\u003ch2 id=\"core-idea-self-attention\"\u003eCore Idea: Self-Attention\u003c/h2\u003e\n\u003cp\u003eInstead of processing sequences step-by-step (like RNNs), Transformers process all tokens in parallel and learn \u003cem\u003erelationships between every pair of tokens\u003c/em\u003e simultaneously. This is done via \u003cstrong\u003eself-attention\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003eFor each token, three vectors are computed:\u003c/p\u003e","title":"Transformer Architecture"},{"content":"RNN stands for Recurrent Neural Network — a type of neural network designed to work with sequential data.\nUnlike standard feedforward networks, RNNs have a \u0026ldquo;memory\u0026rdquo; mechanism: they pass information from one step to the next, making them well-suited for tasks where order and context matter, like text, speech, or time-series data.\nThe key idea is that at each step, the network takes both the current input and a hidden state from the previous step, producing an output and an updated hidden state.\nCommon variants:\nLSTM (Long Short-Term Memory) — solves the vanishing gradient problem, better at capturing long-range dependencies GRU (Gated Recurrent Unit) — a simpler, faster alternative to LSTM Typical use cases: language modeling, machine translation, speech recognition, sentiment analysis, and time-series forecasting.\nThat said, RNNs have largely been superseded by Transformers in most NLP tasks, since Transformers handle long-range dependencies more effectively and parallelize much better during training.\n","permalink":"https://knowledged.to/notes/ml/recurrent-neural-networks/","summary":"\u003cp\u003eRNN stands for \u003cstrong\u003eRecurrent Neural Network\u003c/strong\u003e — a type of neural network designed to work with sequential data.\u003c/p\u003e\n\u003cp\u003eUnlike standard feedforward networks, RNNs have a \u0026ldquo;memory\u0026rdquo; mechanism: they pass information from one step to the next, making them well-suited for tasks where order and context matter, like text, speech, or time-series data.\u003c/p\u003e\n\u003cp\u003eThe key idea is that at each step, the network takes both the current input \u003cem\u003eand\u003c/em\u003e a hidden state from the previous step, producing an output and an updated hidden state.\u003c/p\u003e","title":"Recurrent Neural Networks (RNNs)"},{"content":"RLHF and DPO: Aligning AI to Human Preferences Both techniques address the same core problem: after pre-training on raw text, a language model needs to be steered toward responses that are helpful, safe, and aligned with what humans actually want. They\u0026rsquo;re two different approaches to the same goal.\nRLHF — Reinforcement Learning from Human Feedback The idea: Train a separate model to predict what humans prefer, then use that model as a reward signal to fine-tune the LLM via RL.\nThe pipeline:\nSupervised Fine-Tuning (SFT): Start with the base LLM and fine-tune it on a curated set of high-quality prompt-response pairs to get a reasonable baseline. Reward Model Training: Human annotators are shown pairs of model responses and asked which one is better. These preferences train a separate \u0026ldquo;reward model\u0026rdquo; (RM) that learns to score any response. RL Optimization: The LLM is then optimized using PPO (Proximal Policy Optimization) — an RL algorithm — to generate responses that maximize the reward model\u0026rsquo;s score, while a KL-divergence penalty keeps it from drifting too far from the SFT baseline. Strengths:\nProven at scale (used by InstructGPT, early ChatGPT, Gemini) Can capture nuanced human preferences Weaknesses:\nComplex, brittle pipeline — three separate models to train PPO is notoriously unstable and compute-intensive Reward hacking: the LLM can learn to \u0026ldquo;game\u0026rdquo; the reward model without actually being better Requires significant infrastructure and careful tuning DPO — Direct Preference Optimization The idea: Skip the reward model entirely. Mathematically reformulate the RLHF objective so the LLM itself is the reward model — optimized directly from preference data.\nThe insight: The optimal policy under the RLHF objective has a closed-form relationship to the reward function. DPO (Rafailov et al., 2023) exploits this to derive a loss function that works directly on preference pairs (chosen vs. rejected responses), without ever training a separate RM or running RL.\nThe pipeline:\nSFT baseline (same as RLHF) Preference data — the same human-labeled pairs (chosen/rejected), but fed directly into a classification-style loss on the LLM The loss function intuitively: Increase the relative likelihood of the chosen response over the rejected one, scaled by how confidently the reference model (SFT baseline) distinguishes them.\nStrengths:\nMuch simpler — one training stage, standard supervised loss More stable training, lower compute cost No reward hacking (no separate RM to game) Increasingly competitive with RLHF at smaller scales Weaknesses:\nRequires high-quality offline preference data (no online exploration) Can underperform RLHF on very complex tasks where exploration matters Sensitive to the quality and distribution of the preference dataset Side-by-Side Comparison RLHF DPO Reward model Explicit, separately trained Implicit, baked into LLM loss Training algorithm PPO (RL) Supervised cross-entropy-style loss Complexity High (3 models, RL loop) Low (1 model, 1 loss) Stability Fragile Stable Compute cost High Lower Online exploration Yes No (offline only) Used by InstructGPT, early ChatGPT Llama 3, Mistral, many open models Where Things Stand DPO has become the dominant approach in open-source alignment (Llama 3, Mistral, Phi, etc.) because of its simplicity. However, frontier labs (OpenAI, Google DeepMind) still use variants of RLHF — often combining offline DPO-style methods with online RL for the best of both worlds. Hybrid approaches like RLHF with DPO initialization or online DPO are active research areas.\n","permalink":"https://knowledged.to/notes/ml/rlhf-and-dpo/","summary":"\u003ch2 id=\"rlhf-and-dpo-aligning-ai-to-human-preferences\"\u003eRLHF and DPO: Aligning AI to Human Preferences\u003c/h2\u003e\n\u003cp\u003eBoth techniques address the same core problem: after pre-training on raw text, a language model needs to be \u003cem\u003esteered\u003c/em\u003e toward responses that are helpful, safe, and aligned with what humans actually want. They\u0026rsquo;re two different approaches to the same goal.\u003c/p\u003e\n\u003chr\u003e\n\u003ch3 id=\"rlhf--reinforcement-learning-from-human-feedback\"\u003eRLHF — Reinforcement Learning from Human Feedback\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eThe idea:\u003c/strong\u003e Train a separate model to \u003cem\u003epredict what humans prefer\u003c/em\u003e, then use that model as a reward signal to fine-tune the LLM via RL.\u003c/p\u003e","title":"RLHF and DPO: Aligning AI to Human Preferences"},{"content":"Instruction tuning is a fine-tuning technique where a pre-trained language model is further trained on a dataset of (instruction, response) pairs to make it better at following natural language instructions.\nHow it works A base language model trained on raw text is good at predicting the next token, but not necessarily at being helpful. Instruction tuning bridges that gap by showing the model thousands to millions of examples like:\nInstruction: \u0026ldquo;Summarize this article in 3 bullet points.\u0026rdquo; Response: \u0026ldquo;• Point 1 …\u0026rdquo; The model learns to map user intent → useful output.\nKey ideas Dataset construction — Examples cover a wide range of tasks: summarization, translation, Q\u0026amp;A, coding, reasoning, creative writing, etc. Diversity is crucial so the model generalizes rather than overfits to a narrow task type.\nFormat — Each example typically has a system prompt, a user instruction, and the expected assistant response. This is why models respond well to the chat-style format you\u0026rsquo;re using right now.\nScale matters — Research (e.g., FLAN, InstructGPT) showed that even a relatively small number of high-quality instruction examples can dramatically improve a model\u0026rsquo;s ability to generalize to unseen instructions.\nVariants worth knowing Technique What it adds RLHF (Reinforcement Learning from Human Feedback) Human raters rank responses; a reward model is trained on those rankings and used to further fine-tune RLAIF Same idea but using AI feedback instead of human raters Direct Preference Optimization (DPO) Skips the reward model; optimizes preferences directly, simpler to train Why it matters Before instruction tuning, getting useful output from a large model required careful prompt engineering and the model still often \u0026ldquo;completed\u0026rdquo; your prompt rather than \u0026ldquo;answering\u0026rdquo; it. Instruction tuning is what makes models feel like assistants rather than autocomplete engines.\nGPT and Claude are a product of this kind of training pipeline — constitutional AI and RLHF-style techniques built on top of a pre-trained base model.\n","permalink":"https://knowledged.to/notes/ml/instruction-tuning/","summary":"\u003cp\u003eInstruction tuning is a fine-tuning technique where a pre-trained language model is further trained on a dataset of (instruction, response) pairs to make it better at following natural language instructions.\u003c/p\u003e\n\u003ch2 id=\"how-it-works\"\u003eHow it works\u003c/h2\u003e\n\u003cp\u003eA base language model trained on raw text is good at predicting the next token, but not necessarily at being \u003cem\u003ehelpful\u003c/em\u003e. Instruction tuning bridges that gap by showing the model thousands to millions of examples like:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eInstruction:\u003c/strong\u003e \u0026ldquo;Summarize this article in 3 bullet points.\u0026rdquo;\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eResponse:\u003c/strong\u003e \u0026ldquo;• Point 1 …\u0026rdquo;\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe model learns to map user intent → useful output.\u003c/p\u003e","title":"Instruction Tuning"},{"content":"Perplexity in Language Models Perplexity measures how well a probability model predicts a sample of text. Intuitively, it captures how \u0026ldquo;surprised\u0026rdquo; or \u0026ldquo;perplexed\u0026rdquo; a model is when it encounters new text — a lower perplexity means the model found the text more predictable, i.e., it\u0026rsquo;s a better model.\nThe Core Idea A language model assigns a probability to every sequence of words. Given a test sentence, the model predicts the probability of each next word given all preceding words:\nP(w₁, w₂, \u0026hellip;, wₙ) = P(w₁) · P(w₂|w₁) · P(w₃|w₁,w₂) · \u0026hellip; · P(wₙ|w₁,\u0026hellip;,wₙ₋₁)\nA good model assigns high probability to real, natural text.\nThe Formula Perplexity (PPL) is defined as the exponentiated average negative log-likelihood per token:\nP P L = e x p ( - 1 / N · Σ l o g P ( w ᵢ | w ₁ , . . . , w ᵢ ₋ ₁ ) Where:\nN is the number of tokens in the test set P(wᵢ | \u0026hellip;) is the model\u0026rsquo;s predicted probability for token i given its context The sum is over all tokens You can think of it as the geometric mean of the inverse probabilities the model assigned to each token — essentially, the effective vocabulary size the model is \u0026ldquo;choosing from\u0026rdquo; at each step.\nIntuition with an Extreme Example Scenario PPL Model always predicts the next word perfectly 1 (ideal) Model assigns uniform probability over a 50k vocab 50,000 (random) A well-trained GPT-class model on English ~10–30 A perplexity of 10 means the model behaves as if it\u0026rsquo;s choosing uniformly among 10 equally likely words at each step.\nWhy It\u0026rsquo;s Useful Comparable across runs: It normalizes for sequence length, so you can compare models evaluated on the same test set. Cheap to compute: No human raters needed — just run the model over a held-out corpus. Sensitive to model improvements: Small gains in log-likelihood show up clearly as PPL reductions. Key Caveats Vocabulary dependence: Perplexity is not directly comparable across models with different tokenizers or vocabularies. A model with a larger vocabulary can have a higher per-token PPL even if it\u0026rsquo;s subjectively better. Doesn\u0026rsquo;t capture all quality dimensions: A model can have low perplexity but still produce factually wrong, incoherent, or harmful outputs. PPL measures fluency/predictability, not truthfulness or usefulness. Test set matters: PPL is only meaningful on data the model hasn\u0026rsquo;t seen during training. Evaluating on training data gives artificially low (overfit) numbers. Compression perspective: Minimizing perplexity is mathematically equivalent to minimizing cross-entropy loss, which is exactly what language model training optimizes. So PPL on a held-out set is essentially a measure of how well training generalized. Relationship to Cross-Entropy Perplexity and cross-entropy loss H are directly related:\nP P L = 2 ^ H ( i n b i t s ) o r P P L = e ^ H ( i n n a t s ) This is why training loss curves and perplexity curves have the same shape — one is just an exponentiation of the other. Reporting PPL instead of raw loss is simply a more interpretable scale for humans.\nIn short, perplexity gives you a single number that summarizes how confidently and accurately a model predicts real text — making it a standard first-pass benchmark, even though it\u0026rsquo;s always supplemented by task-specific evaluations (MMLU, HumanEval, etc.) for a fuller picture.\n","permalink":"https://knowledged.to/notes/ml/perplexity-in-language-models/","summary":"\u003ch2 id=\"perplexity-in-language-models\"\u003ePerplexity in Language Models\u003c/h2\u003e\n\u003cp\u003ePerplexity measures how well a probability model predicts a sample of text. Intuitively, it captures how \u0026ldquo;surprised\u0026rdquo; or \u0026ldquo;perplexed\u0026rdquo; a model is when it encounters new text — a lower perplexity means the model found the text more predictable, i.e., it\u0026rsquo;s a better model.\u003c/p\u003e\n\u003ch3 id=\"the-core-idea\"\u003eThe Core Idea\u003c/h3\u003e\n\u003cp\u003eA language model assigns a probability to every sequence of words. Given a test sentence, the model predicts the probability of each next word given all preceding words:\u003c/p\u003e","title":"Perplexity in Language Models"},{"content":"Model quantization is the process of reducing the numerical precision of a neural network\u0026rsquo;s weights (and sometimes activations) to make models smaller and faster, with acceptable loss in accuracy.\nThe core idea Neural networks store parameters as floating-point numbers — typically 32-bit floats (float32). Quantization maps these to lower-precision representations like 16-bit floats, 8-bit integers, or even 4-bit integers. Fewer bits per number means less memory and faster arithmetic.\nCommon precision levels Format Bits Typical use float32 32 Training baseline bfloat16 / float16 16 Training \u0026amp; inference on GPUs int8 8 Efficient inference int4 / int3 / int2 4 or less Aggressive compression (LLMs) How it works Post-training quantization (PTQ) takes a trained model and converts its weights after the fact. It\u0026rsquo;s fast and simple but can hurt accuracy at very low bit depths.\nQuantization-aware training (QAT) simulates low-precision arithmetic during training, so the model learns to be robust to quantization error. This produces better accuracy but requires a full training run.\nThe mapping process works roughly like this: given a range of float values, you find the min/max, divide the range into discrete steps, and map each float to the nearest step. A scale factor and zero-point are stored per tensor (or per channel) to reverse the mapping during computation.\nWhy it matters for LLMs Large language models have billions of parameters. A 70B parameter model in float32 would require ~280 GB of memory — far beyond a single GPU. Quantizing to int4 brings that down to ~35 GB, making local inference feasible. Techniques like GGUF (used by llama.cpp) and GPTQ/AWQ are purpose-built for LLM quantization with minimal perplexity degradation.\nThe trade-offs Memory — fewer bits means the model fits in less RAM/VRAM Speed — integer arithmetic is faster than floating point on most hardware; also more data fits in cache Accuracy — lower precision introduces rounding error; some layers (like the first and last) are more sensitive and are often kept at higher precision Outliers — transformer activations can have extreme outlier values that make quantization harder; methods like SmoothQuant and GPTQ account for this A mental model Think of it like image compression: a raw photo has full color depth, but a compressed JPEG still looks fine at a fraction of the size. Quantization does the same for model weights — you\u0026rsquo;re trading a bit of fidelity for a lot of practical efficiency.\n","permalink":"https://knowledged.to/notes/ml/model-quantization/","summary":"\u003cp\u003eModel quantization is the process of reducing the numerical precision of a neural network\u0026rsquo;s weights (and sometimes activations) to make models smaller and faster, with acceptable loss in accuracy.\u003c/p\u003e\n\u003ch2 id=\"the-core-idea\"\u003eThe core idea\u003c/h2\u003e\n\u003cp\u003eNeural networks store parameters as floating-point numbers — typically 32-bit floats (\u003ccode\u003efloat32\u003c/code\u003e). Quantization maps these to lower-precision representations like 16-bit floats, 8-bit integers, or even 4-bit integers. Fewer bits per number means less memory and faster arithmetic.\u003c/p\u003e\n\u003ch2 id=\"common-precision-levels\"\u003eCommon precision levels\u003c/h2\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n      \u003ctr\u003e\n          \u003cth\u003eFormat\u003c/th\u003e\n          \u003cth\u003eBits\u003c/th\u003e\n          \u003cth\u003eTypical use\u003c/th\u003e\n      \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u003ccode\u003efloat32\u003c/code\u003e\u003c/td\u003e\n          \u003ctd\u003e32\u003c/td\u003e\n          \u003ctd\u003eTraining baseline\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u003ccode\u003ebfloat16\u003c/code\u003e / \u003ccode\u003efloat16\u003c/code\u003e\u003c/td\u003e\n          \u003ctd\u003e16\u003c/td\u003e\n          \u003ctd\u003eTraining \u0026amp; inference on GPUs\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u003ccode\u003eint8\u003c/code\u003e\u003c/td\u003e\n          \u003ctd\u003e8\u003c/td\u003e\n          \u003ctd\u003eEfficient inference\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u003ccode\u003eint4\u003c/code\u003e / \u003ccode\u003eint3\u003c/code\u003e / \u003ccode\u003eint2\u003c/code\u003e\u003c/td\u003e\n          \u003ctd\u003e4 or less\u003c/td\u003e\n          \u003ctd\u003eAggressive compression (LLMs)\u003c/td\u003e\n      \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003ch2 id=\"how-it-works\"\u003eHow it works\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003ePost-training quantization (PTQ)\u003c/strong\u003e takes a trained model and converts its weights after the fact. It\u0026rsquo;s fast and simple but can hurt accuracy at very low bit depths.\u003c/p\u003e","title":"Model Quantization"},{"content":"gcloud Commands To login g c l o u d a u t h l o g i n Application Default Login g c l o u d a u t h a p p l i c a t i o n - d e f a u l t l o g i n Project Set g c l o u d c o n f i g s e t p r o j e c t b u d d y h q - p r d GKE cluster context set dev:\ng c l o u d c o n t a i n e r c l u s t e r s g e t - c r e d e n t i a l s w e b - d e v - z o n e u s - c e n t r a l 1 - f - p r o j e c t b u d d y h q prd:\ng c l o u d c o n t a i n e r c l u s t e r s g e t - c r e d e n t i a l s c n t r l p l a n e - g k e - s z - p r d - z o n e u s - c e n t r a l 1 - c - p r o j e c t b u d d y h q - p r d ","permalink":"https://knowledged.to/notes/devops/gcloud-quick-reference/","summary":"\u003ch1 id=\"gcloud-commands\"\u003egcloud Commands\u003c/h1\u003e\n\u003ch2 id=\"to-login\"\u003eTo login\u003c/h2\u003e\n\n\n\n\u003cdiv class=\"goat svg-container \"\u003e\n  \n    \u003csvg\n      xmlns=\"http://www.w3.org/2000/svg\"\n      font-family=\"Menlo,Lucida Console,monospace\"\n      \n        viewBox=\"0 0 144 25\"\n      \u003e\n      \u003cg transform='translate(8,16)'\u003e\n\u003ctext text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'\u003eg\u003c/text\u003e\n\u003ctext text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'\u003eu\u003c/text\u003e\n\u003ctext text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'\u003ed\u003c/text\u003e\n\u003ctext text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'\u003eu\u003c/text\u003e\n\u003ctext text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'\u003eg\u003c/text\u003e\n\u003ctext text-anchor='middle' x='120' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='128' y='4' fill='currentColor' style='font-size:1em'\u003en\u003c/text\u003e\n\u003c/g\u003e\n\n    \u003c/svg\u003e\n  \n\u003c/div\u003e\n\u003ch2 id=\"application-default-login\"\u003eApplication Default Login\u003c/h2\u003e\n\n\n\n\u003cdiv class=\"goat svg-container \"\u003e\n  \n    \u003csvg\n      xmlns=\"http://www.w3.org/2000/svg\"\n      font-family=\"Menlo,Lucida Console,monospace\"\n      \n        viewBox=\"0 0 304 25\"\n      \u003e\n      \u003cg transform='translate(8,16)'\u003e\n\u003ctext text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'\u003eg\u003c/text\u003e\n\u003ctext text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'\u003eu\u003c/text\u003e\n\u003ctext text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'\u003ed\u003c/text\u003e\n\u003ctext text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'\u003eu\u003c/text\u003e\n\u003ctext text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'\u003ep\u003c/text\u003e\n\u003ctext text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'\u003ep\u003c/text\u003e\n\u003ctext text-anchor='middle' x='120' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='128' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='136' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='144' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='152' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='160' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='168' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='176' y='4' fill='currentColor' style='font-size:1em'\u003en\u003c/text\u003e\n\u003ctext text-anchor='middle' x='184' y='4' fill='currentColor' style='font-size:1em'\u003e-\u003c/text\u003e\n\u003ctext text-anchor='middle' x='192' y='4' fill='currentColor' style='font-size:1em'\u003ed\u003c/text\u003e\n\u003ctext text-anchor='middle' x='200' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='208' y='4' fill='currentColor' style='font-size:1em'\u003ef\u003c/text\u003e\n\u003ctext text-anchor='middle' x='216' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='224' y='4' fill='currentColor' style='font-size:1em'\u003eu\u003c/text\u003e\n\u003ctext text-anchor='middle' x='232' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='240' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='256' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='264' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='272' y='4' fill='currentColor' style='font-size:1em'\u003eg\u003c/text\u003e\n\u003ctext text-anchor='middle' x='280' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='288' y='4' fill='currentColor' style='font-size:1em'\u003en\u003c/text\u003e\n\u003c/g\u003e\n\n    \u003c/svg\u003e\n  \n\u003c/div\u003e\n\u003ch2 id=\"project-set\"\u003eProject Set\u003c/h2\u003e\n\n\n\n\u003cdiv class=\"goat svg-container \"\u003e\n  \n    \u003csvg\n      xmlns=\"http://www.w3.org/2000/svg\"\n      font-family=\"Menlo,Lucida Console,monospace\"\n      \n        viewBox=\"0 0 304 25\"\n      \u003e\n      \u003cg transform='translate(8,16)'\u003e\n\u003ctext text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'\u003eg\u003c/text\u003e\n\u003ctext text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'\u003eu\u003c/text\u003e\n\u003ctext text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'\u003ed\u003c/text\u003e\n\u003ctext text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'\u003en\u003c/text\u003e\n\u003ctext text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'\u003ef\u003c/text\u003e\n\u003ctext text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'\u003ei\u003c/text\u003e\n\u003ctext text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'\u003eg\u003c/text\u003e\n\u003ctext text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'\u003es\u003c/text\u003e\n\u003ctext text-anchor='middle' x='120' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='128' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='144' y='4' fill='currentColor' style='font-size:1em'\u003ep\u003c/text\u003e\n\u003ctext text-anchor='middle' x='152' y='4' fill='currentColor' style='font-size:1em'\u003er\u003c/text\u003e\n\u003ctext text-anchor='middle' x='160' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='168' y='4' fill='currentColor' style='font-size:1em'\u003ej\u003c/text\u003e\n\u003ctext text-anchor='middle' x='176' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='184' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='192' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='208' y='4' fill='currentColor' style='font-size:1em'\u003eb\u003c/text\u003e\n\u003ctext text-anchor='middle' x='216' y='4' fill='currentColor' style='font-size:1em'\u003eu\u003c/text\u003e\n\u003ctext text-anchor='middle' x='224' y='4' fill='currentColor' style='font-size:1em'\u003ed\u003c/text\u003e\n\u003ctext text-anchor='middle' x='232' y='4' fill='currentColor' style='font-size:1em'\u003ed\u003c/text\u003e\n\u003ctext text-anchor='middle' x='240' y='4' fill='currentColor' style='font-size:1em'\u003ey\u003c/text\u003e\n\u003ctext text-anchor='middle' x='248' y='4' fill='currentColor' style='font-size:1em'\u003eh\u003c/text\u003e\n\u003ctext text-anchor='middle' x='256' y='4' fill='currentColor' style='font-size:1em'\u003eq\u003c/text\u003e\n\u003ctext text-anchor='middle' x='264' y='4' fill='currentColor' style='font-size:1em'\u003e-\u003c/text\u003e\n\u003ctext text-anchor='middle' x='272' y='4' fill='currentColor' style='font-size:1em'\u003ep\u003c/text\u003e\n\u003ctext text-anchor='middle' x='280' y='4' fill='currentColor' style='font-size:1em'\u003er\u003c/text\u003e\n\u003ctext text-anchor='middle' x='288' y='4' fill='currentColor' style='font-size:1em'\u003ed\u003c/text\u003e\n\u003c/g\u003e\n\n    \u003c/svg\u003e\n  \n\u003c/div\u003e\n\u003ch2 id=\"gke-cluster-context-set\"\u003eGKE cluster context set\u003c/h2\u003e\n\u003cp\u003e\u003ccode\u003edev\u003c/code\u003e:\u003c/p\u003e","title":"GCloud Quick Reference"},{"content":"Kubernetes Port Forward Use this command:\nk u b e c t l p o r t - f o r w a r d s v c / t e m p o r a l - w e b 8 0 8 0 : 8 0 8 0 The first port is the localhost port, the second the service port.\n","permalink":"https://knowledged.to/notes/devops/kubernetes-port-forward/","summary":"\u003ch1 id=\"kubernetes-port-forward\"\u003eKubernetes Port Forward\u003c/h1\u003e\n\u003cp\u003eUse this command:\u003c/p\u003e\n\n\n\n\u003cdiv class=\"goat svg-container \"\u003e\n  \n    \u003csvg\n      xmlns=\"http://www.w3.org/2000/svg\"\n      font-family=\"Menlo,Lucida Console,monospace\"\n      \n        viewBox=\"0 0 384 25\"\n      \u003e\n      \u003cg transform='translate(8,16)'\u003e\n\u003ctext text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'\u003ek\u003c/text\u003e\n\u003ctext text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'\u003eu\u003c/text\u003e\n\u003ctext text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'\u003eb\u003c/text\u003e\n\u003ctext text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='48' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'\u003ep\u003c/text\u003e\n\u003ctext text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'\u003er\u003c/text\u003e\n\u003ctext text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'\u003e-\u003c/text\u003e\n\u003ctext text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'\u003ef\u003c/text\u003e\n\u003ctext text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='120' y='4' fill='currentColor' style='font-size:1em'\u003er\u003c/text\u003e\n\u003ctext text-anchor='middle' x='128' y='4' fill='currentColor' style='font-size:1em'\u003ew\u003c/text\u003e\n\u003ctext text-anchor='middle' x='136' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='144' y='4' fill='currentColor' style='font-size:1em'\u003er\u003c/text\u003e\n\u003ctext text-anchor='middle' x='152' y='4' fill='currentColor' style='font-size:1em'\u003ed\u003c/text\u003e\n\u003ctext text-anchor='middle' x='168' y='4' fill='currentColor' style='font-size:1em'\u003es\u003c/text\u003e\n\u003ctext text-anchor='middle' x='176' y='4' fill='currentColor' style='font-size:1em'\u003ev\u003c/text\u003e\n\u003ctext text-anchor='middle' x='184' y='4' fill='currentColor' style='font-size:1em'\u003ec\u003c/text\u003e\n\u003ctext text-anchor='middle' x='192' y='4' fill='currentColor' style='font-size:1em'\u003e/\u003c/text\u003e\n\u003ctext text-anchor='middle' x='200' y='4' fill='currentColor' style='font-size:1em'\u003et\u003c/text\u003e\n\u003ctext text-anchor='middle' x='208' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='216' y='4' fill='currentColor' style='font-size:1em'\u003em\u003c/text\u003e\n\u003ctext text-anchor='middle' x='224' y='4' fill='currentColor' style='font-size:1em'\u003ep\u003c/text\u003e\n\u003ctext text-anchor='middle' x='232' y='4' fill='currentColor' style='font-size:1em'\u003eo\u003c/text\u003e\n\u003ctext text-anchor='middle' x='240' y='4' fill='currentColor' style='font-size:1em'\u003er\u003c/text\u003e\n\u003ctext text-anchor='middle' x='248' y='4' fill='currentColor' style='font-size:1em'\u003ea\u003c/text\u003e\n\u003ctext text-anchor='middle' x='256' y='4' fill='currentColor' style='font-size:1em'\u003el\u003c/text\u003e\n\u003ctext text-anchor='middle' x='264' y='4' fill='currentColor' style='font-size:1em'\u003e-\u003c/text\u003e\n\u003ctext text-anchor='middle' x='272' y='4' fill='currentColor' style='font-size:1em'\u003ew\u003c/text\u003e\n\u003ctext text-anchor='middle' x='280' y='4' fill='currentColor' style='font-size:1em'\u003ee\u003c/text\u003e\n\u003ctext text-anchor='middle' x='288' y='4' fill='currentColor' style='font-size:1em'\u003eb\u003c/text\u003e\n\u003ctext text-anchor='middle' x='304' y='4' fill='currentColor' style='font-size:1em'\u003e8\u003c/text\u003e\n\u003ctext text-anchor='middle' x='312' y='4' fill='currentColor' style='font-size:1em'\u003e0\u003c/text\u003e\n\u003ctext text-anchor='middle' x='320' y='4' fill='currentColor' style='font-size:1em'\u003e8\u003c/text\u003e\n\u003ctext text-anchor='middle' x='328' y='4' fill='currentColor' style='font-size:1em'\u003e0\u003c/text\u003e\n\u003ctext text-anchor='middle' x='336' y='4' fill='currentColor' style='font-size:1em'\u003e:\u003c/text\u003e\n\u003ctext text-anchor='middle' x='344' y='4' fill='currentColor' style='font-size:1em'\u003e8\u003c/text\u003e\n\u003ctext text-anchor='middle' x='352' y='4' fill='currentColor' style='font-size:1em'\u003e0\u003c/text\u003e\n\u003ctext text-anchor='middle' x='360' y='4' fill='currentColor' style='font-size:1em'\u003e8\u003c/text\u003e\n\u003ctext text-anchor='middle' x='368' y='4' fill='currentColor' style='font-size:1em'\u003e0\u003c/text\u003e\n\u003c/g\u003e\n\n    \u003c/svg\u003e\n  \n\u003c/div\u003e\n\u003cp\u003eThe first port is the localhost port, the second the service port.\u003c/p\u003e","title":"Kubernetes Port Forward"}]