LLM as Judge

Using a language model to evaluate the outputs of another model (or itself) instead of relying on humans or rigid automated metrics like BLEU/ROUGE/exact-match. Give the judge model a response (or a pair of responses) plus a rubric or question, and it returns a score, a label, or a winner.

Why it exists

For open-ended generation — chat answers, code explanations, summaries, agent traces — string-overlap metrics don’t capture quality, and human eval is slow and expensive. Once frontier LLMs got good enough, they became decent proxies for human raters on a lot of tasks, so they’re now the default evaluator in MT-Bench, Chatbot Arena, G-Eval, and most internal eval pipelines.

It’s also the “AI feedback” half of RLAIF and Constitutional AI — the judge produces the preference signal that would otherwise come from a human, which then feeds into DPO or GRPO. This is why it shows up next to ELO: Chatbot Arena runs pairwise LLM (and human) judgments and uses Elo to convert win-rates into a ranking.

Common shapes

Single-answer grading — judge scores one response against a rubric (1–10, pass/fail, criterion-wise).
Pairwise — judge picks A or B given the same prompt. Cleaner signal, pairs well with Elo.
Reference-based vs reference-free — with or without a gold answer to compare against.

Known biases

Position bias — favors the first option in pairwise.
Verbosity bias — longer answers score higher.
Self-preference bias — a model rates its own outputs higher.
Style over substance — confident, well-formatted, wrong answers beat hedged correct ones.

Mitigations

Swap positions and average across both orderings.
Use a judge that’s stronger than the models being evaluated.
Force the judge to produce chain-of-thought reasoning before the score.
Ensemble multiple judges.
Calibrate against a small human-labeled set so you know how much to trust the numbers.

Mental model

LLM-as-judge is a cheap, scalable, biased estimator of human preference. Useful as a gradient during iteration; dangerous if you treat its absolute scores as ground truth.

ELO scoring for AI evaluation (Chatbot Arena ranking)
RLAIF, Constitutional AI
DPO, GRPO (consume preference signals the judge produces)
MT-Bench, G-Eval

Why it exists#

Common shapes#

Known biases#

Mitigations#

Mental model#

Related#

Why it exists

Common shapes

Known biases

Mitigations

Mental model

Related