Top-K in RAG Search
In Retrieval-Augmented Generation (RAG), top-k is the number of most relevant document chunks the retriever returns from the vector store for a given query. The “k” is literally just a number — top-3, top-5, top-10, etc.
How it works
- Embed the query into a vector
- Run a similarity search (cosine, dot product, etc.) against indexed chunks
- Retriever ranks every chunk by similarity score
- Top-k says “give me the k highest-scoring ones”
- Those chunks get stuffed into the LLM’s context as grounding material before generation
Choosing k — the tradeoff
Too low (k=1, 2):
- Risk missing relevant context
- If the answer is split across multiple chunks, or the best chunk wasn’t ranked #1, you’re stuck
Too high (k=20+):
- Dilutes the signal with marginally-relevant chunks
- Burns context window and tokens
- Can actually hurt answer quality — research shows LLMs degrade with too much irrelevant context (“lost in the middle” problem)
Typical values
- Defaults are usually k=3 to k=10, depending on chunk size and task
- Common pattern: pair with a reranker
- Stage 1: retrieve top-k=20 with cheap vector similarity (high recall)
- Stage 2: rerank with a cross-encoder, keep top 3-5 for the final prompt (high precision)
Related knob: similarity threshold
Some retrievers also expose a similarity threshold — drop anything below a score cutoff regardless of rank. Useful when “no relevant context” is a valid outcome and you don’t want to force k chunks when none are actually good.
Quick reference
| k value | Use case |
|---|---|
| 1-2 | High-precision lookup, short context budgets |
| 3-5 | Most common production default |
| 10-20 | First stage before reranking |
| 20+ | Almost always pair with a reranker |