Top-K in RAG Search

In Retrieval-Augmented Generation (RAG), top-k is the number of most relevant document chunks the retriever returns from the vector store for a given query. The “k” is literally just a number — top-3, top-5, top-10, etc.

How it works

Embed the query into a vector
Run a similarity search (cosine, dot product, etc.) against indexed chunks
Retriever ranks every chunk by similarity score
Top-k says “give me the k highest-scoring ones”
Those chunks get stuffed into the LLM’s context as grounding material before generation

Choosing k — the tradeoff

Too low (k=1, 2):

Risk missing relevant context
If the answer is split across multiple chunks, or the best chunk wasn’t ranked #1, you’re stuck

Too high (k=20+):

Dilutes the signal with marginally-relevant chunks
Burns context window and tokens
Can actually hurt answer quality — research shows LLMs degrade with too much irrelevant context (“lost in the middle” problem)

Typical values

Defaults are usually k=3 to k=10, depending on chunk size and task
Common pattern: pair with a reranker
- Stage 1: retrieve top-k=20 with cheap vector similarity (high recall)
- Stage 2: rerank with a cross-encoder, keep top 3-5 for the final prompt (high precision)

Some retrievers also expose a similarity threshold — drop anything below a score cutoff regardless of rank. Useful when “no relevant context” is a valid outcome and you don’t want to force k chunks when none are actually good.

Quick reference

k value	Use case
1-2	High-precision lookup, short context budgets
3-5	Most common production default
10-20	First stage before reranking
20+	Almost always pair with a reranker

How it works#

Choosing k — the tradeoff#

Typical values#

Related knob: similarity threshold#

Quick reference#

How it works

Choosing k — the tradeoff

Typical values

Related knob: similarity threshold

Quick reference