AI Evaluation & Reliability

Measuring, testing, and trusting AI systems.

AI evaluation is the practice of measuring whether an AI system actually works — through evals, benchmarks, and reliability testing that catch hallucinations and regressions before they reach production.

35 episodes

The First Fully Autonomous AI Attack Is 18 Months Away | Kristin Lovejoy Kristin Lovejoy, Kyndryl · Jun 11, 2026 · Transcript
We Built Agents, Nobody Built HR | Tyler Akidau, Redpanda Tyler Akidau, Redpanda · May 27, 2026 · Transcript
How Superhuman Built AI Into a 100ms Product | Loïc Houssier Loïc Houssier, Superhuman · May 22, 2026 · Transcript
Every AI Agent Has an Evaluation Gap | Alex Ratner, Snorkel AI Alex Ratner, Snorkel AI · Apr 29, 2026 · Transcript
Hallucinations Are a Data Architecture Problem | Sudhir Hasbe, Neo4j Sudhir Hasbe, Neo4j · Apr 16, 2026 · Transcript
Why LLMs Are Plausibility Engines, Not Truth Engines | Dan Klein Dan Klein, Scaled Cognition · Apr 8, 2026 · Transcript
How Intercom Cut $250K/Month by Ditching GPT for Qwen Fergal Reid, Intercom · Feb 26, 2026 · Transcript
Explaining Eval Engineering | Galileo's Vikram Chatterji Vikram Chatterji, Galileo · Dec 19, 2025 · Transcript
Beyond Transformers: How Liquid AI Is Rethinking LLM Architecture | Maxime Labonne Maxime Labonne, Liquid AI · Nov 12, 2025 · Transcript
Architecting AI Agents: The Shift from Models to Systems | Aishwarya Srinivasan Aishwarya Srinivasan, Fireworks AI · Oct 8, 2025 · Transcript
Vercel's Playbook for AI Agents: From Vibe Check to Production | Malte Ubl Malte Ubl, Vercel · Sep 10, 2025 · Transcript
From Demo to Defensibility: How to Build an AI Business that Lasts | Aurimas Griciūnas Aurimas Griciūnas, SwirlAI · Aug 27, 2025 · Transcript
Mindset Over Metrics: How to Approach AI Engineering | Hamel Husain Hamel Husain, Parlance Labs · Aug 20, 2025 · Transcript
Mastering Multi-Agent Systems | MongoDB’s Mikiko Chandrasekhar Mikiko Chandrasekhar, MongoDB · Jul 23, 2025 · Transcript
The AI Agent Trust Gap: Bridging Risk to Reliability | Elastic’s Philipp Krenn Philipp Krenn, Elastic · Jul 16, 2025 · Transcript
Architecting Reliable Agentic AI | Cisco’s Giovanna Carofiglio on the AGNTCY Collective Giovanna Carofiglio, Cisco · Jul 9, 2025 · Transcript
The Emerging AI Agent Stack | CrewAI’s João Moura João Moura, CrewAI · Jun 25, 2025 · Transcript
Your Key to AI Success is Hiding in Plain Sight | Cohesity's Greg Statton Greg Statton, Cohesity · Jun 11, 2025 · Transcript
The 2025 AI Shift: From Chat to Task Completion & Reliable Action | Galileo Founders Vikram Chatterji & Atindriyo Sanyal · May 28, 2025 · Transcript
Amplitude's AI Playbook: How Wade Chambers Builds for the Agentic Future Wade Chambers, Amplitude · May 21, 2025 · Transcript
AI's Two Extremes – Foundations & The Frontier | Databricks’ Denny Lee Denny Lee, Databricks · May 7, 2025 · Transcript
Why Enterprises Need a Different Approach to AI Agents | Lyzr’s Siva Surendira Siva Surendira, Lyzr · Apr 30, 2025 · Transcript
Breaking the Language Barrier: Smartling's AI Translation Pipeline | Olga Beregovaya Olga Beregovaya, Smartling · Apr 23, 2025 · Transcript
Low-Code AI: From Requirements to Apps in Minutes | OutSystems' Rodrigo Coutinho Rodrigo Coutinho, OutSystems · Apr 16, 2025 · Transcript
Inside IBM's watsonx: Building Enterprise AI That Ships | Dr. Maryam Ashoori Maryam Ashoori, IBM · Apr 2, 2025 · Transcript
Using AI to Modernize Your Legacy Applications | MongoDB’s Rachelle Palmer Rachelle Palmer, MongoDB · Mar 12, 2025 · Transcript
AI in 2025: Agents & The Rise of Evaluation-Driven Development Vikram Chatterji & Andrew Zigler · Mar 5, 2025 · Transcript
The Making of Gemini 2.0: DeepMind's Approach to AI Development and Deployment | Logan Kilpatrick Logan Kilpatrick, Google DeepMind · Feb 12, 2025 · Transcript
How DeepSeek Changed the AI Race Overnight Atindriyo Sanyal, Galileo · Feb 5, 2025 · Transcript
AI in 2025: Agents & The Rise of Evaluation Driven Development Yash Sheth & Atindriyo Sanyal · Jan 15, 2025 · Transcript
Beyond Chatbots: How Twilio Uses AI to Strengthen Human Connection | Vinnie Giarrusso Vinnie Giarrusso, Twilio · Dec 18, 2024 · Transcript
The Enterprise AI Deployment Playbook | ServiceTitan, Indeed & Twilio Mehmet Murat Ezbiderli, Vinnie Giarrusso, Grant Ledford & Atindriyo Sanyal · Dec 11, 2024 · Transcript
Practical Lessons for GenAI Evals | Chip Huyen & Vivienne Zhang Chip Huyen & Vivienne Zhang · Dec 4, 2024 · Transcript
GenAI Predictions for 2025 | Databricks & Cohere Sara Hooker, Craig Wiley & Vikram Chatterji · Nov 20, 2024 · Transcript
The State of AI: Open-Source Models & Enterprise Trust | May Habib May Habib, Writer · Nov 6, 2024 · Transcript

Explainers on this topic

Terms on this topic

Guests on this topic

Kristin Lovejoy Tyler Akidau Loïc Houssier Alex Ratner Sudhir Hasbe Dan Klein Fergal Reid Vikram Chatterji Maxime Labonne Aishwarya Srinivasan Malte Ubl Aurimas Griciūnas Hamel Husain Mikiko Chandrasekhar Philipp Krenn Giovanna Carofiglio João Moura Greg Statton Atindriyo Sanyal Wade Chambers Denny Lee Siva Surendira Olga Beregovaya Rodrigo Coutinho Maryam Ashoori Rachelle Palmer Andrew Zigler Logan Kilpatrick Yash Sheth Vinnie Giarrusso Mehmet Murat Ezbiderli Grant Ledford Chip Huyen Vivienne Zhang Sara Hooker Craig Wiley May Habib