Llm | knowledged.to

LoRA (Low-Rank Adaptation) in AI

A new section for Fine-Tuning Techniques is created to hold the LoRA document, and the LLM Architecture section is de-duplicated.

Function Calling Support in LLM Models

Explains function calling (tool use) in LLMs: how models emit structured requests to invoke external functions, the request-execute-return loop, provider support, and practical reliability notes.

What is Speculative Decoding?

Explains speculative decoding, which pairs a small draft model with a large target model to accelerate LLM inference without changing outputs.

Mixture of Experts in AI

Explains sparse Mixture-of-Experts (MoE) architecture with conditional computation, router/gate mechanisms, load balancing, and trade-offs vs. dense models.

Local + Frontier Model Collaboration Patterns in Open Source Harnesses

As local LLMs improve, harnesses are learning to pair them with frontier models. A look at the four collaboration patterns already shipping in open source.

RLVR vs. the Agent Loop: Training-Time vs. Inference-Time

Distinguishes RLVR as training-time weight updates from inference-time agent verification loops.

The Modern LLM Training Pipeline

Explains the four-stage modern LLM training pipeline from pre-training through verifiable-reward RL.

The Modern LLM Training Pipeline

Explains the four-stage modern LLM training pipeline from pre-training through verifiable-reward RL.

Jailbreaking LLMs: A Security Researcher's Field Guide

Field guide to LLM jailbreaking attack surfaces, threat modeling, defenses, and responsible disclosure.

Adaptive Thinking in Claude Models

Explains Claude adaptive thinking, effort levels, fixed-budget deprecation, and hidden reasoning display.

AdapTime: Adaptive Temporal Reasoning in LLMs

Paper summary of AdapTime, an adaptive planner for temporal reasoning in LLMs.

Agent Harness Six Components (E, T, C, S, L, V) — Survey Summary

Summarizes the six-component agent harness model: execution, tools, context, state, hooks, and evaluation.

TurboQuant

Explains TurboQuant, a rotation-based vector quantization method for KV-cache compression and vector search.

ARIS: Multi-Agent Reliability Patterns

Patterns from ARIS for reliable multi-agent research using adversarial review, audits, and persistent memory.

CompactRAG

Explains CompactRAG, a multi-hop RAG method using offline atomic QA pairs and fixed two-call inference.

Graph-based Agent Memory

Explains graph-based memory for LLM agents, including taxonomy, GAM consolidation, and hybrid retrieval.

LLM Thinking Token Budgets

Explains thinking-token budget parameters, provider naming, cost-latency tradeoffs, and completion-cap interactions.

LLM Prompt Cache Options Across Providers

Compares prompt/KV cache TTLs, controls, pricing, scope, and strategies across major LLM providers.

LLM Prompt Caching: Implicit vs Explicit

Explains implicit vs explicit LLM prompt caching, prefix constraints, provider support, and when to use each.

Why LLM Caching Is Only for Input Tokens

Explains why LLM prompt caching applies to reusable input-token prefill, not sequential output decoding.