Collections

Best Chain of Thought Episodes on AI Evaluation

Measuring whether AI actually works — evals, reliability, trust.

  1. 1 EP 57 43 min Transcript Every AI Agent Has an Evaluation Gap | Alex Ratner, Snorkel AI Alex Ratner, Snorkel AI Alex Ratner on the evaluation gap every team hits the moment an agent leaves the demo — and how to close it.
  2. 2 EP 46 37 min Transcript Explaining Eval Engineering | Galileo's Vikram Chatterji Vikram Chatterji, Galileo Vikram Chatterji on treating evaluation as an engineering discipline, not a one-time check.
  3. 3 EP 32 44 min Transcript The AI Agent Trust Gap: Bridging Risk to Reliability | Elastic’s Philipp Krenn Philipp Krenn, Elastic Philipp Krenn on bridging the gap from risky to reliable — why trust is the real bar for shipping agents.
  4. 4 EP 13 29 min Transcript AI in 2025: Agents & The Rise of Evaluation-Driven Development Vikram Chatterji & Andrew Zigler The case for evaluation-driven development — building the eval loop into how you ship, not bolting it on after.
  5. 5 EP 5 48 min Transcript Practical Lessons for GenAI Evals | Chip Huyen & Vivienne Zhang Chip Huyen & Vivienne Zhang Chip Huyen and Vivienne Zhang on practical GenAI evaluation — the most-cited starting point for evals on the show.

More collections

All collections →

New here? Start with five → Browse by topic →