AI Glossary

LLM as a Judge

LLM-as-a-judge is using one language model to score the outputs of another against a rubric you define — quality, relevance, safety, correctness. It scales evaluation to volumes humans can't review by hand, trading some reliability for enormous reach.

Also known as: LLM-as-a-judge, model-graded evaluation, AI judge

· Chain of Thought

AI Evaluation & Reliability

You can’t hand-review ten thousand chatbot responses, but you also can’t write a regex that knows whether an answer is good. LLM-as-a-judge splits the difference: give a capable model the output, a rubric, and often the question and reference, and have it return a score or verdict. It’s the workhorse of modern AI evaluation precisely because it scales to production volumes.

The catch is that the judge is itself a fallible model — it can be inconsistent, biased toward longer or more confident answers, and only as good as the rubric you give it. The fixes are practical: write specific criteria, calibrate the judge against a human-labeled set, and use it for relative comparisons (is A better than B?) more than absolute truth.

Used well, it’s how teams turn evaluation from a manual bottleneck into a continuous signal. Used blindly, it just launders one model’s errors through another — which is why builders pair it with human spot-checks rather than trusting it wholesale.

Go deeper

From the conversation