Top-K in RAG Search

In Retrieval-Augmented Generation (RAG), top-k is the number of most relevant document chunks the retriever returns from the vector store for a given query. The “k” is literally just a number — top-3, top-5, top-10, etc.

How it works

  1. Embed the query into a vector
  2. Run a similarity search (cosine, dot product, etc.) against indexed chunks
  3. Retriever ranks every chunk by similarity score
  4. Top-k says “give me the k highest-scoring ones”
  5. Those chunks get stuffed into the LLM’s context as grounding material before generation

Choosing k — the tradeoff

Too low (k=1, 2):

  • Risk missing relevant context
  • If the answer is split across multiple chunks, or the best chunk wasn’t ranked #1, you’re stuck

Too high (k=20+):

  • Dilutes the signal with marginally-relevant chunks
  • Burns context window and tokens
  • Can actually hurt answer quality — research shows LLMs degrade with too much irrelevant context (“lost in the middle” problem)

Typical values

  • Defaults are usually k=3 to k=10, depending on chunk size and task
  • Common pattern: pair with a reranker
    • Stage 1: retrieve top-k=20 with cheap vector similarity (high recall)
    • Stage 2: rerank with a cross-encoder, keep top 3-5 for the final prompt (high precision)

Some retrievers also expose a similarity threshold — drop anything below a score cutoff regardless of rank. Useful when “no relevant context” is a valid outcome and you don’t want to force k chunks when none are actually good.

Quick reference

k valueUse case
1-2High-precision lookup, short context budgets
3-5Most common production default
10-20First stage before reranking
20+Almost always pair with a reranker