LLM as Judge
Using a language model to evaluate the outputs of another model (or itself) instead of relying on humans or rigid automated metrics like BLEU/ROUGE/exact-match. Give the judge model a response (or a pair of responses) plus a rubric or question, and it returns a score, a label, or a winner. Why it exists For open-ended generation — chat answers, code explanations, summaries, agent traces — string-overlap metrics don’t capture quality, and human eval is slow and expensive. Once frontier LLMs got good enough, they became decent proxies for human raters on a lot of tasks, so they’re now the default evaluator in MT-Bench, Chatbot Arena, G-Eval, and most internal eval pipelines. ...