Best Chain of Thought Episodes on AI Evaluation

Measuring whether AI actually works — evals, reliability, trust.

1 EP 57 Apr 29, 2026 43 min Transcript Every AI Agent Has an Evaluation Gap | Alex Ratner, Snorkel AI Alex Ratner, Snorkel AI Alex Ratner on the evaluation gap every team hits the moment an agent leaves the demo — and how to close it.
2 EP 46 Dec 19, 2025 37 min Transcript Explaining Eval Engineering | Galileo's Vikram Chatterji Vikram Chatterji, Galileo Vikram Chatterji on treating evaluation as an engineering discipline, not a one-time check.
3 EP 32 Jul 16, 2025 44 min Transcript The AI Agent Trust Gap: Bridging Risk to Reliability | Elastic’s Philipp Krenn Philipp Krenn, Elastic Philipp Krenn on bridging the gap from risky to reliable — why trust is the real bar for shipping agents.
4 EP 13 Mar 5, 2025 29 min Transcript AI in 2025: Agents & The Rise of Evaluation-Driven Development Vikram Chatterji & Andrew Zigler The case for evaluation-driven development — building the eval loop into how you ship, not bolting it on after.
5 EP 5 Dec 4, 2024 48 min Transcript Practical Lessons for GenAI Evals | Chip Huyen & Vivienne Zhang Chip Huyen & Vivienne Zhang Chip Huyen and Vivienne Zhang on practical GenAI evaluation — the most-cited starting point for evals on the show.

More collections

All collections →

New here? Start with five → Browse by topic →