LLM as a Judge
LLM-as-a-judge is using one language model to score the outputs of another against a rubric you define — quality, relevance, safety, correctness. It scales evaluation to volumes humans can't review by hand, trading some reliability for enormous reach.
Also known as: LLM-as-a-judge, model-graded evaluation, AI judge
You can’t hand-review ten thousand chatbot responses, but you also can’t write a regex that knows whether an answer is good. LLM-as-a-judge splits the difference: give a capable model the output, a rubric, and often the question and reference, and have it return a score or verdict. It’s the workhorse of modern AI evaluation precisely because it scales to production volumes.
The catch is that the judge is itself a fallible model — it can be inconsistent, biased toward longer or more confident answers, and only as good as the rubric you give it. The fixes are practical: write specific criteria, calibrate the judge against a human-labeled set, and use it for relative comparisons (is A better than B?) more than absolute truth.
Used well, it’s how teams turn evaluation from a manual bottleneck into a continuous signal. Used blindly, it just launders one model’s errors through another — which is why builders pair it with human spot-checks rather than trusting it wholesale.
Go deeper
From the conversation
-
Explaining Eval Engineering | Galileo's Vikram Chatterji -
Architecting AI Agents: The Shift from Models to Systems | Aishwarya Srinivasan -
Mindset Over Metrics: How to Approach AI Engineering | Hamel Husain -
Practical Lessons for GenAI Evals | Chip Huyen & Vivienne Zhang -
Building an AI-Native Startup | GrowthX's Marcel Santilli -
Inside IBM's watsonx: Building Enterprise AI That Ships | Dr. Maryam Ashoori