LLM as Judge

Using a language model to evaluate the outputs of another model (or itself) instead of relying on humans or rigid automated metrics like BLEU/ROUGE/exact-match. Give the judge model a response (or a pair of responses) plus a rubric or question, and it returns a score, a label, or a winner.

Why it exists

For open-ended generation — chat answers, code explanations, summaries, agent traces — string-overlap metrics don’t capture quality, and human eval is slow and expensive. Once frontier LLMs got good enough, they became decent proxies for human raters on a lot of tasks, so they’re now the default evaluator in MT-Bench, Chatbot Arena, G-Eval, and most internal eval pipelines.

It’s also the “AI feedback” half of RLAIF and Constitutional AI — the judge produces the preference signal that would otherwise come from a human, which then feeds into DPO or GRPO. This is why it shows up next to ELO: Chatbot Arena runs pairwise LLM (and human) judgments and uses Elo to convert win-rates into a ranking.

Common shapes

  • Single-answer grading — judge scores one response against a rubric (1–10, pass/fail, criterion-wise).
  • Pairwise — judge picks A or B given the same prompt. Cleaner signal, pairs well with Elo.
  • Reference-based vs reference-free — with or without a gold answer to compare against.

Known biases

  • Position bias — favors the first option in pairwise.
  • Verbosity bias — longer answers score higher.
  • Self-preference bias — a model rates its own outputs higher.
  • Style over substance — confident, well-formatted, wrong answers beat hedged correct ones.

Mitigations

  • Swap positions and average across both orderings.
  • Use a judge that’s stronger than the models being evaluated.
  • Force the judge to produce chain-of-thought reasoning before the score.
  • Ensemble multiple judges.
  • Calibrate against a small human-labeled set so you know how much to trust the numbers.

Mental model

LLM-as-judge is a cheap, scalable, biased estimator of human preference. Useful as a gradient during iteration; dangerous if you treat its absolute scores as ground truth.

  • ELO scoring for AI evaluation (Chatbot Arena ranking)
  • RLAIF, Constitutional AI
  • DPO, GRPO (consume preference signals the judge produces)
  • MT-Bench, G-Eval