AI Evaluation & Reliability
Measuring, testing, and trusting AI systems.
AI evaluation is the practice of measuring whether an AI system actually works — through evals, benchmarks, and reliability testing that catch hallucinations and regressions before they reach production.
26 episodes
- How Superhuman Built AI Into a 100ms Product | Loïc Houssier
- Every AI Agent Has an Evaluation Gap | Alex Ratner, Snorkel AI
- Hallucinations Are a Data Architecture Problem | Sudhir Hasbe, Neo4j
- Why LLMs Are Plausibility Engines, Not Truth Engines | Dan Klein
- Explaining Eval Engineering | Galileo's Vikram Chatterji
- Beyond Transformers: How Liquid AI Is Rethinking LLM Architecture | Maxime Labonne
- Architecting AI Agents: The Shift from Models to Systems | Aishwarya Srinivasan
- Vercel's Playbook for AI Agents: From Vibe Check to Production | Malte Ubl
- Mindset Over Metrics: How to Approach AI Engineering | Hamel Husain
- AI's Trillion-Dollar Healthcare Bet | Corti's Andreas Cleve
- Mastering Multi-Agent Systems | MongoDB’s Mikiko Chandrasekhar
- The AI Agent Trust Gap: Bridging Risk to Reliability | Elastic’s Philipp Krenn
- Architecting Reliable Agentic AI | Cisco’s Giovanna Carofiglio on the AGNTCY Collective
- The Emerging AI Agent Stack | CrewAI’s João Moura
- The 2025 AI Shift: From Chat to Task Completion & Reliable Action | Galileo Founders
- AI's Two Extremes – Foundations & The Frontier | Databricks’ Denny Lee
- Why Enterprises Need a Different Approach to AI Agents | Lyzr’s Siva Surendira
- Breaking the Language Barrier: Smartling's AI Translation Pipeline | Olga Beregovaya
- Inside IBM's watsonx: Building Enterprise AI That Ships | Dr. Maryam Ashoori
- Information Symmetry: DevRev's Bet on AI-Driven Enterprise Decisions | Manoj Agarwal
- AI in 2025: Agents & The Rise of Evaluation-Driven Development
- How DeepSeek Changed the AI Race Overnight
- AI in 2025: Agents & The Rise of Evaluation Driven Development
- The Enterprise AI Deployment Playbook | ServiceTitan, Indeed & Twilio
- Practical Lessons for GenAI Evals | Chip Huyen & Vivienne Zhang
- Got Agents? Agentic Workflows & Architecture | Weaviate, Unstructured & CrewAI
Guests on this topic
Loïc HoussierAlex RatnerSudhir HasbeDan KleinVikram ChatterjiMaxime LabonneAishwarya SrinivasanMalte UblHamel HusainAndreas CleveMikiko ChandrasekharPhilipp KrennGiovanna CarofiglioJoão MouraAtindriyo SanyalDenny LeeSiva SurendiraOlga BeregovayaMaryam AshooriManoj AgarwalYash ShethAndrew ZiglerMehmet Murat EzbiderliVinnie GiarrussoGrant LedfordChip HuyenVivienne ZhangBrian RaymondBob van Luijt