How do you test an AI system when the output isn't deterministic?
You stop expecting one exact answer and start testing properties. Because the same input can produce different valid outputs, traditional assert-equals tests don't fit. Instead you build a dataset of inputs with known-good characteristics and check each output against them — is it grounded, does it follow the instruction, does it avoid the unsafe thing — usually scored by a rubric or an LLM judge. You run that suite on every change, the way you'd run unit tests, so a regression shows up before users do.
AI Evaluation & ReliabilityAI Engineering
Why normal tests don’t fit
A unit test asserts that an input produces one exact output. AI breaks that assumption: the same prompt can yield several different answers that are all correct, so assertEqual fails good outputs and passes bad ones. Testing AI isn’t about matching a string; it’s about checking whether each output has the qualities you require.
Test on properties, against a dataset
Build a dataset of representative inputs paired with what a good answer must do — stay grounded in the source, follow the instruction, stay in scope, avoid the unsafe response. Then score each output against those properties rather than against one golden string. The scoring is done by checks where you can write them and by a rubric or an LLM judge where you can’t. The dataset is the asset; it’s what makes the test repeatable.
Run it like a test suite
The discipline matters more than any single check. Wire the suite into your workflow so it runs on every prompt change, model swap, and dependency bump, and so a drop in scores blocks the change. This is what turns evaluation from a one-time demo check into the AI equivalent of CI — the thing that catches a regression before it ships.
Start small and grow the set
You don’t need a thousand cases to begin. Start with the handful that represent your real traffic and the failures you’ve already seen, and add a case every time something breaks in production. The test set should grow toward your actual edge cases, not toward some abstract notion of coverage.
From the conversation
This explainer is drawn from these episodes — each carries its full transcript.