AI Glossary

AI Evaluation

AI evaluation is how you measure whether an AI system actually works — scoring its outputs against what good looks like, systematically and repeatably, instead of eyeballing a few demos. For non-deterministic systems like LLMs and agents, it's the discipline that separates a thing that demos well from one you can ship.

Also known as: evals, AI evals, model evaluation, LLM evaluation

· Chain of Thought

AI Evaluation & Reliability

Traditional software is deterministic — same input, same output — so you test it with assertions. An LLM or agent can give a different, plausible-sounding answer every time, which breaks that model. AI evaluation fills the gap: a repeatable way to score outputs against criteria — accuracy, relevance, safety, format, task completion — so you know whether a change made the system better or worse.

In practice it spans offline evals on a fixed test set (catch regressions before you ship) and online evals on live traffic (catch what production surfaces). The hard part is defining “good” for open-ended outputs, which is why methods like LLM-as-a-judge and human review both show up, and why observability and evaluation get used together rather than confused.

It’s a recurring theme on the show because it’s where most AI projects quietly fail: teams ship on vibes, can’t tell why quality drifts, and have no way to improve systematically. Evaluation is the feedback loop that makes iteration possible — without it you’re guessing.

Go deeper

From the conversation