How should engineering leaders decide which AI use cases to actually build?

Galileo CEO Vikram Chatterji warns that an open-ended hackathon will generate a hundred ideas given how broad AI is — text, images, agentic task completion. The discipline is to take those back to product and business owners and prioritize on two axes: what the business actually needs, and which use cases you can get out the door very quickly. Then apply operational rigor to try a small number of prioritized ideas fast, rather than drowning in possible solutions.

Why do banks adopt AI more cautiously than companies like DoorDash?

Chatterji frames it as fault tolerance. A bank deals with people’s money, so a misfiring consumer-facing AI can derail the bank’s reputation and put it out of business quickly — pushing enterprises toward a careful crawl-walk-run approach. A DoorDash, Instacart, Airbnb, or Twilio has a lower bar: a chatbot mistake is bad but it’s “dealing with hungry people,” not unauthorized transactions. The stakes set how experimental the company can afford to be.

What does it mean that AI agents act as “smart routers”?

Instead of asking the LLM to just generate text, you wrap functions as tools and let the agent decide which tool to call to finish a task — acting like a smart router. Chatterji contrasts this with old deterministic code where you’d “literally say this is exactly what you need to do.” Now it’s “I just want this done, you figure it out,” creating a leader-worker relationship between the engineer and the agent.

How do you evaluate an AI agent versus a simple chatbot?

A chatbot is query-response, query-response. An agent runs a sequence of actions you don’t directly observe — you only see the final output. Chatterji says evaluation has to open that box: did it choose the right tool, was the tool called correctly, did it plan the task properly, and did it actually complete the task — and how do you even measure quality of completion? Galileo surfaces traces, spans, and explainable metrics so engineers can see how decisions were made and optimize wasteful API calls.

What is a “compound system” in AI engineering?

Chatterji credits Databricks’ Matei Zaharia with coining “compound systems” for what Galileo calls your AI app. As agentic scenarios add function calls, the system grows more complex and starts to resemble classical software engineering — it’s “all about how good are your functions and how well are you managing it all.” That’s why engineers now ask what a unit test or regression test even looks like for an agent, which is where eval comes in.

Episodes · S2 E13 ← Prev Next →

AI in 2025: Agents & The Rise of Evaluation-Driven Development

Mar 5, 2025 · Vikram Chatterji , Galileo, Andrew Zigler , Dev Interrupted · 29 min

AI Agents AI Evaluation & Reliability AI Engineering

Listen on any app

Key takeaways

Vikram Chatterji argues AI is “another tool in your arsenal,” not a mandate. The right first question isn’t “how do we use AI” but “what’s your use case, and is this a good fit?” — and engineering leaders need the machinery to push back and say “maybe I don’t need to use AI at all.”
Risk tolerance should set the pace. Chatterji contrasts a bank — where a misfiring consumer AI can “derail your entire bank’s reputation” because it deals with people’s money — against a DoorDash or Instacart, where a chatbot slip is “dealing with hungry people,” not bank accounts. Enterprises go crawl-walk-run; digital natives move fast and break things.
Even fast-moving teams need forecasting before they build. Chatterji’s sequence: forecast how many use cases (10, 15), provision enough compute (he name-checks A100 GPUs) so engineers aren’t asking “where’s my GPU at,” invest in tooling to cut compute cost, then add eval guardrails for an “AI CI/CD process” before letting teams loose.
Agents turn the LLM from a generator into a “smart router” that picks tools to complete a task — replacing deterministic code where “you would literally say this is exactly what you need to do” with “I just want this done, you figure it out.” That unlock creates new failure modes: did it pick the right tool, call it correctly, plan the task, and actually complete it?
Agentic apps are “compound systems” — Chatterji credits Databricks’ Matei Zaharia for the term — that look increasingly like classical software engineering built from function calls. Galileo’s Luna evaluation metrics aim to score them without ground truth via a TypeScript/JavaScript SDK, surfacing explainable metrics and automatic insights as “a co-pilot for your AI application development.”
Asked the most valuable skill for engineering leaders right now, Chatterji names two: keep building to stay upskilled — if you expect your team to ship AI apps, build simple apps on the side yourself — and stay plugged into the community of other eng leaders, because “you can’t wait to make those mistakes yourselves” when everything moves this fast. (His interviewer Andrew Zigler condensed it as “build, read, and communicate.”)

Frequently asked questions

How should engineering leaders decide which AI use cases to actually build?: Galileo CEO Vikram Chatterji warns that an open-ended hackathon will generate a hundred ideas given how broad AI is — text, images, agentic task completion. The discipline is to take those back to product and business owners and prioritize on two axes: what the business actually needs, and which use cases you can get out the door very quickly. Then apply operational rigor to try a small number of prioritized ideas fast, rather than drowning in possible solutions.
Why do banks adopt AI more cautiously than companies like DoorDash?: Chatterji frames it as fault tolerance. A bank deals with people’s money, so a misfiring consumer-facing AI can derail the bank’s reputation and put it out of business quickly — pushing enterprises toward a careful crawl-walk-run approach. A DoorDash, Instacart, Airbnb, or Twilio has a lower bar: a chatbot mistake is bad but it’s “dealing with hungry people,” not unauthorized transactions. The stakes set how experimental the company can afford to be.
What does it mean that AI agents act as “smart routers”?: Instead of asking the LLM to just generate text, you wrap functions as tools and let the agent decide which tool to call to finish a task — acting like a smart router. Chatterji contrasts this with old deterministic code where you’d “literally say this is exactly what you need to do.” Now it’s “I just want this done, you figure it out,” creating a leader-worker relationship between the engineer and the agent.
How do you evaluate an AI agent versus a simple chatbot?: A chatbot is query-response, query-response. An agent runs a sequence of actions you don’t directly observe — you only see the final output. Chatterji says evaluation has to open that box: did it choose the right tool, was the tool called correctly, did it plan the task properly, and did it actually complete the task — and how do you even measure quality of completion? Galileo surfaces traces, spans, and explainable metrics so engineers can see how decisions were made and optimize wasteful API calls.
What is a “compound system” in AI engineering?: Chatterji credits Databricks’ Matei Zaharia with coining “compound systems” for what Galileo calls your AI app. As agentic scenarios add function calls, the system grows more complex and starts to resemble classical software engineering — it’s “all about how good are your functions and how well are you managing it all.” That’s why engineers now ask what a unit test or regression test even looks like for an agent, which is where eval comes in.

Concepts in this episode

AI terms discussed here — each links to a plain-language definition.

AI Agent AI Evaluation Latency

Chapters

00:00Introduction and Special Announcement
01:14Welcome to Dev Interrupted
01:42Challenges in AI Adoption
03:16Balancing Business Needs and AI
06:15Crawl, Walk, Run Approach
10:52Building Trust and Prototyping
13:07AI Agents as Smart Routers
13:50Galileo's Role in AI Development
16:25Evaluating AI Systems
25:36Skills for Engineering Leaders
27:35Conclusion

Show notes

This week, we're sharing a special episode courtesy of 'Dev Interrupted.' Our co-host, Galileo CEO Vikram Chatterji, recently joined theDev Interrupted team for an engaging discussion on AI strategy. We were so impressed by the conversation that we wanted to share it with our audience, and they were kind enough to let us. We hope you enjoy it!

From Dev Interrupted:

"Vikram Chatterji joins Dev Interrupted’s Andrew Zigler to discuss how engineering leaders can future-proof their AI strategy and navigate an emerging dilemma: the pressure to adopt AI to stay competitive, while justifying AI spending and avoiding risky investments.

To accomplish this, Vikram emphasizes the importance of establishing clear evaluation frameworks, prioritizing AI use cases based on business needs and understanding your company's unique cultural context when deploying AI."

Follow Dev Interrupted

Podcast

Substack

Follow Dev Interrupted Hosts

Andrew

Ben

Connect with Chain of Thought host Conor Bronsdon:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

Check out Galileo

⁠⁠⁠⁠Try Galileo⁠⁠

Transcript

59 segments

Conor Bronsdon 0:01 Hey, everyone. It is your host, Conor Bronson here. This week, we're doing something a little different, and we are showcasing an interview from a different podcast. As some of you may know, I previously hosted the Dev Interrupted podcast, and our friends over there, Andrew and Ben at DevInrupted, graciously invited my fellow Chain of Thought cohost and Galileo CEO, Vikram Chatterjee,

Conor Bronsdon 0:22 to their show to discuss how engineering leaders should position their AI strategy. Andrew and Vikram had a great discussion, and we are excited to share it with our audience here at Chain of Thought this week. If you enjoy it, we definitely recommend considering adding Dev Interrupted to your podcast listening feed as well. They have a bunch of great content. You'll hear me on a lot of the back catalog.

Conor Bronsdon 0:44 And they discussed in their conversation with Vikram how to balance the pressure from business leaders and from the stock market to adopt AI with the need to justify spend while avoiding risky investments. It's a delicate balance, and we think this episode is a must listen for anyone working in or leading engineering teams. We'll be back next week with our regularly scheduled programming. We've got some great interviews lined up. I think you'll love some of folks who are gonna be joining us. But for now, all of the hosts over at Interrupted take it from here.

Speaker 1:14 Hey, everyone. Welcome back to Dev Interrupted. I'm your host, Andrew Ziegler, developer advocate at Linear B. And joining me today is Vikram Chatterjee, co founder and CEO of Galileo. Vikram has been on the front lines of the AI revolution for many years, from leading product management at Google during the birth of transformers to building tools that help engineering teams confidently evaluate and deploy AI systems.

Speaker 1:40 Here's the crux of today's conversation. Engineering leaders are stuck between a rock and a hard place. They know they need to experiment with AI to stay competitive, but they're under immense pressure to justify those costs. All the while, AI evolves so rapidly that today's wrong move could cost them tomorrow's opportunity. Vikram, welcome to the show. Thank you, Andrew. Super excited to be here. Likewise. Let's jump right in. Starting with the biggest challenge currently facing engineering leaders, AI being that moving target and experimenting

Speaker 2:17 feeling risky, it's a big barrier to adoption, and the fear of making the wrong investment is ever present. How do you think leaders can take the first steps without putting themselves or their teams in a bad position? I've always thought about AI as, you know, it's another tool in your arsenal. Right? Even before generative AI became a very big thing, with classical machine learning and with NLP,

Speaker 2:43 it was never about, like, hey. You have to use this thing. It was more about what's your use case, and based on that, is this a good fit for your use case? Now, I guess the difference is what we're with a lot of engineering leaders that we talk to, there's a lot of top down pressure to just use AI for the sake of it. To get to your point about the rock between a rock and a hard place, it's very important for

Speaker 3:04 engineering leaders to think about what are the heuristics, right, that is gonna help me figure out, you know, is this something that we should be even going ahead with? And that includes things like, what does the business need? Because what I've seen is folks just do a massive hackathon within their org, and you're gonna get a 100 ideas just given how open ended AI can be right now, right, in terms of whether it's generation of text, whether it's generation of images or completion of a task with agents. You get a 100 use cases, but it's very, very important for them to then go back to product

Speaker 3:38 and business owners to figure out which one of them should they prioritize. And also on the back of that, which one of them can you actually get out the door very quickly? And that's kind of where the operational rigor has to kick in of like trying out x number of ideas very, very fast. So you have to have that machinery in place and the ability to push back and say that, hey, maybe I don't need to use AI at all. But when I do, here's how I need to do it. And at that point, we can talk about this more, but we have to think it through in terms of if I succeed, what does that mean for me in terms of the number of engineers I need for this, the the evals,

Speaker 4:13 the cost of of productionizing this thing at scale. So there's a bunch of things that people have to think about and a lot of trade offs at the onset itself. So it sounds like in balance with that proliferation of ideas, you really need an evaluation framework or a way to understand and extract scenarios that are maybe have a higher ROI or a higher impact on your business, and focus on those. Because I think that's kind of part of the problem too, is you're drowning in possible solutions and everyone can come up with maybe a way to integrate it in some way. But is that the most effective way? Is that where we should focus our attention?

Speaker 4:48 And the more attention you put on something, it can skew than how the rest of your organization is using and thinking about AI. So those decisions, especially early on, you know, they're really impactful. How would you advise or what habits do you think make for somebody to be able to evaluate and derisk that experimentation? Like, are some what are those tools that those kinds of people are always using again and again to do that well? Yeah. It's a great question. I will say it depends a lot on the organization, and you have to just know your organization well. So if you're a big bank, right, the amount of harm that can happen if you put something out there in a in the consumer world with AI and it misfires is very, very large. It can literally derail your entire bank's reputation. For a commoditized entity like a bank,

Speaker 5:37 you're gonna be out of business very quickly. On the other hand, if you're, let's say, a DoorDash or an Instacart, the the bar is probably a bit lower because not to say that they they have a low bar in general, but if something goes wrong with their their chatbot or something like that, it's not gonna be the end of the world because they're not dealing with people's money. They're dealing with hungry people, which is bad, but they're not dealing with people's money. What Maybe you didn't get my burger, but, you know, my bank account didn't have, like, an unauthorized transaction or something. They're totally different stakes. They're totally different stakes. And so what I've seen as a result of that is when you talk to these large enterprises, they're very, very excited about generative AI, they're very excited to add agents and everything else. But they're very, they're taking a very, very careful crawl, walk, run approach to it, which I think is good. What I'm seeing with the other, like companies where they are earlier stages, they're tech first, let's say as an example, like a DoorDash or like an Instacart

Speaker 6:29 or like an Airbnb or a Twilio, they're taking a much more experimental approach to this. Right? Like let's try things out. Let's see how it goes. Let's see how we feel. It's very much on the lines of, you know, build fast, break things, learn quickly. And that's really led to them kind of figuring things out as they go. And it is also useful when, as this industry is moving super quickly. So based on the organization, I've seen like the barrier to entry from a

Speaker 6:55 fault tolerance perspective is different. Now, within that, if you're an engineering leader at a faster moving company, if you're all about experimentation and going fast, then the question becomes, how do you plan well? And there is a certain crawl walk around there as well, to be honest, Andrew, because what we've seen there is you have to think about, you know, is it going to be 15 use cases, 10 use cases? Like you have to have some forecasting there because once based on the forecasting, have to do a couple of things. You first think about the compute costs and then staff yourself with enough, I don't know, A100s

Speaker 7:26 and have that available for you so that everyone can just build. Otherwise, everyone's gonna come back to you and say like, hey, where's my GPU at? So you have to have like X amount of compute and give that out very judiciously. You then have to think about how can you optimize the compute and you know, start to invest in tooling at that layer to minimize the cost of compute.

Speaker 7:47 And then comes the eval piece, which is kind of what Galileo does, where you have to think about how do you create the right kind of guardrails in place for like an AI CICD process. And then basically go to the teams and say, awesome. Alright. You wanna you wanna launch things? You wanna experiment with things? Here's the stack that you can use. Go knock yourself out. Right? And some people will use the entire stack, some won't, but you have to create that enablement almost within the team before they can, like, go crazy.

Speaker 8:15 That's a big unlock. So let me try to unpack this playbook because I think there's a lot of really interesting tips in here. One of them being, first and foremost, understanding your company, your company culture, their risk tolerance and what they're doing. And understanding that there's a big difference between a traditional enterprise and a digital native company

Speaker 8:34 experimenting with AI right now, especially with different levels of risk within what they're working on. So it's about understanding your own company, your own environment, the level of risk and the tolerance for experimentation. But then getting into that, you know, crawl, walk, run loop that's going to get you up and get you moving on this process. It's about creating

Speaker 8:56 the actual resources to enable those teams to make effective tools. And so if you're going to create a way for them to grow within your company, you have to all think about resourcing them and prioritizing them based upon, again, you know, maybe the profile of your company culture. So it really does start with understanding your culture. Exactly. And you you hit on a good point. It's the it's the culture. It's the it's the people.

Speaker 9:22 Again, it comes down to like also the business that you're running, whether you're in Instacart or a DoorDash. Again, going back to that example, they have many different ways of instituting generative AI, but maybe there's a different company that really doesn't have those many use cases and that's okay. You kind of have to look at all of those different angles and then figure out how you want to act and how quickly you want to act. But it does stem with that. As an engineering leader, I would say step one is always

Speaker 9:46 just that. Right. And part of company culture, and it's kind of what I want to focus on next, is also, you know, fears and anxieties around making the right or wrong decisions or creating tools that are doing jobs that traditionally people did within their org and understanding how people need to reprioritize their time to better use these tools. And in doing all of this, I think we're all trying to make systems that are future proof. We don't want to rebuild these AI workflows again and again and again. We want to make it once and evaluate it and iterate on it.

Speaker 10:21 So that's part of justifying too the ROI and going back to creating the resources within your team for that to grow. So how can an engineering leader, perhaps someone who's in a company culture that's maybe more on the digital native side, have an appetite for risk and experimentation? Maybe they even have a little bit of resources. How can they shift those conversations

Speaker 10:42 with nontechnical leadership from immediate gains to creating future proof solutions that are gonna help the company in maybe like a year or five years from now? It comes down to building trust. I think the first thing becomes, like, as an engineering leader, I've talked to a lot of different leaders in this space, and it all comes down to, number one, first, having a gut instinct reaction,

Speaker 11:05 a gut instinct on their own side, personally. Like, kind of a trust that, you know, these use cases are great, and this use case is actually gonna be a good one to start with. The second is staff that. Just go very, very fast and build out a prototype and start to see, does that actually add value? And then shop that around with others in your business. They could be leaders in the business, depending on how large the company is. And you can now, with GenAI, the unlock is that you can build a prototype pretty quickly. And so you can at least start to get a sense for what is the appetite here. And that's typically what I've seen happen.

Speaker 11:39 And then from from there, you can start to figure out what's the, what are the KPIs are. Right? Because the main the main thing is you want to be able to ship something pretty quickly. You want to go to production quickly with the right checks and balances in place. So you start with at least one to two use cases as quickly as you possibly can with the right checks and balances and guardrails in place, and then start to see how that looks. Like, roll it out to 1%, 5%, 10%. Start to see what you learn. And then with that playbook, you can go very fast with everything else.

Speaker 12:09 This is exactly what we've seen with the largest banks in the world, as well as the digital natives. It's just that the digital natives are just moving much, much faster than the largest banks. But this playbook is kinda similar on both sides. It makes a lot of sense in terms of how those teams can get started. It's also about evaluating, going back to the evaluative frameworks from earlier, and having measurements in place to look at the results week after week, and that creates within the company culture a healthy socialization and understanding of the tools that people are building and we're using, and that's what cultivates the trust. And it's also something really hard to find in an LLM based world. You know, LLMs are stochastic, they don't want to repeat themselves. So when you put them into an environment where you want a repeatable workflow,

Speaker 12:57 where you have an AI agent evaluating things on the fly, there's so much to consider. And that's kind of part of, you know, the dauntingness of that I was alluding to earlier. In our initial conversation, you know, there was something that you mentioned to me that's really resonated with me since. And as you mentioned how AI agents in the future or even now, they can act as smart routers or kind of like load balancers

Speaker 13:21 for workflows and help optimize them over time. And that was really fascinating to me. I'm wondering if maybe you could dive into that a little more. Just on a technical level, now that we've talked about introducing these products and or these projects within your company, if you are somebody who's revolutionizing a workflow right now with AI, how should they best kind of think about it?

Speaker 13:44 What I'll say is in terms of agents in particular, how that works. So for context for folks, Calileo is leading provider of evaluation tooling for any AI developer out there that's building an AI app, right? And what that means is an eval basically includes not just the metrics, but also like the dataset that you're working with. And it includes like a workflow so you can really understand what your failure modes are, build out evals around that, and then use that in your CICD process. So you know exactly if you're shipping a good product, and once you ship, you can check for regression test. That's the net net of how evals work. Now with agents, what's interesting is we've kind of moved from people building like just chatbots. That was like, it almost felt like that's just as far as the imagination could go in terms of use cases. To now it's like, you can complete any task. What's the task that your product does? So, you know, we started seeing this with operators launch as well, which is a good example of an agent where all of a sudden I started to see booking.com's

Speaker 14:41 folks talking about like, hey, now you can book a hotel room with just natural language. And then box.com's CEOs started talking about how you can, like, add files and folders. So it just unlocked all these use cases. So and that's kinda what we're seeing with agents coming. Are they are these are these are these apps perfect? No. And here's why. Because what's happening with with agents is it's essentially a way to say like, hey, LLM,

Speaker 15:05 instead of just generating something, why don't you just act towards choosing the right kind of tool? Or here's a function, I've wrapped it in a certain manner, do x for me. Right? So making it do specific things. The tool piece of this especially is interesting because now the API can act as, as you mentioned, as a smart router to figure out, you know, which tool should I use for finishing this task, which is I think the biggest unlock right now. Because earlier, you would just code all of that. You would like literally, deterministically say, this is exactly what you need to do. Versus now, it's almost like, I just want this done. You figure it out. You figure it out for me. Which is also why there's almost like this leader worker relationship almost with an

Speaker 15:46 agent that's happening. And so that's been a very big unlock because now these agents can just find out the right tool and go ahead. However, it does lead to different kinds of failure modes, right? Like, did it choose the right tool or not? Did the right tool get called in the right way or not? Even if it did choose the right tool, what happened after that? Can I see the entire flow of how all of that worked out? At the end of the day, did it complete the task? What do you mean by complete the task? How do you measure quality of task completion? Did it plan the whole task properly or not? So there's a bunch of these different kinds of

Speaker 16:22 qualitative stuff along the way that you need. They need to get evaluated in in the evaluation system that we've been talking about. So those all of those things need to get looked at. Because when you're when you're in this environment where it's sitting as a router or it's making smart decisions, it's no longer like a chatbot where chatbot where it's query response, query response. Instead, it's request or, you know, demand or like, you know, you need it to do something and it's then a sequence

Speaker 16:48 of actions. And those actions lack visibility to you. You're not staring at it making the decisions or the actions or making the API calls. You just see what it tells you at the end. So Yep. That's kind of where I and that goes back to building trust as well. Yep. Because you need to understand what's happening in those stages. Yeah. Exactly. It's it's funny how similar this is to working with a human being. Yeah. Exactly. That's what's coming to mind for me too. I really liked what you said about leaders

Speaker 17:17 and and contributors, but it kind of being this balance of the you as an engineer are overseeing the output or the work of the agent, and you're responsible for its success, just like how a manager is responsible for their IC's, you know, success. You have to give it the right tools, the right environment, and the right context. So it's a it's I think it's a big unlock. Are you kind of seeing that from folks that are starting to engage in these workflows?

Speaker 17:42 We are because that's exactly the kind of question that they're asking around, like, hey. How do I make sure that everything worked fine? And also, the if it did work fine and the answer is correct, can I see what the route was that it took? Because maybe it's just making unnecessary API calls that it doesn't have to in the first place. Can I optimize this even further? So there's a big question around

Speaker 18:03 what are the failure modes, but also can I have a visualization into like how the how the entire AI app like actually made its decisions and how it was planning so I can maybe tweak things here and there? The system itself, I think the folks at Databricks, Matteza Haria coined the term compound systems for what we at Gallo basically call your AI app. The compound system is becoming more complex because of these agentic scenarios where essentially that's people are adding function calls.

Speaker 18:32 So it's it's fascinating because now it's even closer to classical software engineering and it's all about how good are your functions and how well are you managing it all. And those engineers come to us and they ask us about, great, this is fun, but how do I what is a unit test for me now? And what is a regression test for me now? And that's where eval has come in. Right. I'd like to understand a little more about how does Galileo open that box so that you look in to understand

Speaker 19:00 how those tools are working and and and how somebody would use something like that to build confidence. Because what you're describing is like a whole new category to me of how we think about tools. When you're in an environment where you're defining a novel category, that's really, you have to do a lot of, like, definition setting and understanding. So we're all for the first time opening the box and looking inside on the workflows, workflows that we're building now for the first time. What does Galileo provide? What are the things that people should be looking for?

Speaker 19:32 What Galileo tries to do is our our end goal is to help you build high quality AI apps fast. That's our end goal. So we win if you're building those apps 10 x faster and those apps are 10 x better. That's our goal. Now in order to do that, that includes what I think of as the visualization layer, meaning, like, you could just see your traces and spans. I feel like that's the easy part. It is it's it's good for them, but it's highly commoditizable. Right? Anybody can build that thing. So we obsess about the user experience at that at that layer. But then beyond that, what we've been seeing is we initially gave the developers the ability to just build their own metrics as well and score their agents. And what we saw was developers

Speaker 20:13 struggled with that because they kind of had an idea about like, hey, need to see if it's planned this thing out well enough. But then in order to build that metric, they almost had to build a very complex prompt. They had to figure out the instructions. They had to figure out how do you optimize this. Do I use a GPT-four model for this, something else? So that's the layer where we basically realized that, wait, this is actually ripe for a lot of research. So we have a fairly large staff of AI researchers that are constantly working on, we call our LUNA evaluation metrics,

Speaker 20:43 where it's not just the prompt instructions and things of that nature, but we also focus a lot on how can we optimize the cost and latency of these different metrics. So what does this mean for the user? What this means is I've built my agent and I've built I'm building my agent. You could just use Galileo's TypeScript SDK or JavaScript SDK to be able to start logging your application. And then on the other side, without any ground truth being needed, you basically magically see not just the visualization layer, but you also start to see two things. One, you'll see these very, very highly explainable metrics show up with an explanation for exactly what went wrong and why. The second thing that you see are automatic insights as well, which are fairly easy to understand. Because what we want to do essentially as a developer, what they want as well, is I just am trying to build this agent out. I cobbled together a few things. I'm trying to run a quick experiment.

Speaker 21:34 Which part, this complex compound system, should I focus on? Should I focus on this API, that one, the prompt, or something else? And so we basically dumb it all down for them as much as we can can and tell them, you know, we've gone we have all your logs. We also have these metrics that we've built out based on all of this. Here's what we think you should do. So it's almost like a co pilot for your AI application development. That's the journey we wanna go with them on as they're going from, like, building towards scaling.

Speaker 22:02 That's very fascinating to me because I when someone's going to look at all of these different variables, what Galileo is doing is it's helping you isolate those variables and find the ones that are going to have the biggest impact for you to focus on, which kind of resonates with like the whole top line objective of folks having to, or being in a position where they are experimenting with AI or building workflows as they're trying to evaluate and justify

Speaker 22:25 where they should be spending their time because it's moving so quickly that their time always needs to be spent on the most impactful part of the project or anything else you're working on is likely just adding risk to the project because you're doing things that are probably going be outdated by the time they're really in practice or really in use. Yep. And and that requires like a whole mental, I think, flip about how we build

Speaker 22:47 and how we look at them. So it helps you isolate the variables and and make those tools a little more natural to develop in a in a classical way where you can evaluate and understand. Yep. Yep. No. I agree with you. That's exactly right. And when someone is is maybe building and managing a bunch of these AI bots or tools or workflows, you know, are you seeing that it's like people are like almost like doing performance evaluations

Speaker 23:12 on their agents? Like maybe like how somebody would do for- maybe someone's getting ramped up right as a BDR or as a customer success manager and there's, you know, basic tenets that you want them to do across the board, and you're evaluating their interactions with customers or whatnot, and because you understand what is good and what is bad in the environment of your company's culture. Are you seeing that that's how people are using and evolving evolving

Speaker 23:35 as they build these tools? Yeah. It's similar. It's very similar because to take your example for a second, like you let's say you hire an SDR. The SDR is mostly focused on outbounding, and then you've got to see the quality of their outbound, who they outbounded to, what the result of that was, did they book a call or not, and all sorts of stuff. There's an entire funnel there. Imagine all of that's being done by an AI agent.

Speaker 24:00 That AI agent's basically gonna have to call, do a bunch of different kinds of API calls to make sure all of that happens. And now the question becomes great. If it's an AI agent that's not sleeping at night, then how do I make sure that there's some sense of potential failure modes that can happen? You you would probably do this with the SDR as well, the human SDR, where you'd probably wanna have, like, some kind of inspection. Right? You have an expectation setting. I want you to book 10 calls this week. That's my expectation. And then with the human have some level of inspection around,

Speaker 24:29 great. Like, how many outbound calls do you do today? And then I'm gonna start to look at a funnel. It's similar. You you come up with that mental model of what those potential failure modes would be as the person who's building the app, and then you have to build out the guardrails accordingly. And then what happens is as you go through the motion of interacting with that human SDR or the agent or the AI agent, you start to figure out more failure modes because you go deeper and deeper and deeper. You're like, oh, shit. This can go wrong. That can go wrong. And then on the fly, you're gonna have to create more evals and more metrics, and you also start generating more data

Speaker 25:03 around, like, ah, you know, like, when it's when it's trying to reach out to this specific kind of person with this specific queue, that's when it's failing. So maybe I should isolate this data a little bit. That's when you start to create, like, this this dataset that you wanna test against and these metrics. So that's the flow that people have been going down of, like, exploring and creating these models and making it part of their process. What really stands out to me about that is it can get very complex, and it sounds like a whole new set of skills even.

Speaker 25:33 From your perspective, what do you think is the most powerful habit or skill that an engineer or an engineering leader should be picking up right now to stay ahead of this kind of curve? They have to build. I think the best engineering leaders that I've seen, they're doing two things right now. They're building. They're keeping themselves upskilled. I think everyone has to do that. I certainly do that all the time. Everyone has to do that. And if you're expecting your team to build out these kinds of AI apps, you have to get very, very familiar with it. That's one very large part of it. And the second piece is just being in touch with the community, like learning from each other. Cause I've noticed that organizations that are moving really fast are the ones where they have eng leaders who are also talking to other eng leaders about what they have learned really, really quickly, because everything's moving so quickly. You can't wait to make those mistakes yourselves.

Speaker 26:22 Right? So I'm seeing folks where there's like a good cross pollination amongst leaders that they're moving much, much faster. It could be that, you know, an engineer might tell somebody else that, Hey, you should get Galileo because without evals, it's gonna be really bad. And the other one's gonna probably learn that the hard way. It has happened with a lot of engineering leaders before.

Speaker 26:39 So I feel like those two things are very, very important. Keep being at the forefront of learning by doing because you're an engineering leader. Test off those skills and you can actually build simple apps on the side. And the second one is just being a big part of the community somehow, being in touch. Build, read, and communicate. Yeah. Three core defining traits.

Speaker 27:01 And Yep. That's really powerful takeaway. For me, what really stands out about that is that people who put in place small incremental changes over time, you know, they get ahead faster and faster. In an incremental world like AI, somebody who's building that tool yesterday is going to be much more ahead of you tomorrow if you didn't build it. So those engineers that are out there building and talking about what they're building and they're reading about what other people are building, those are the ones that are getting ahead and and are staying on the forefront. And I I that's a really great takeaway.

Speaker 27:36 Vikram, this has been an incredible conversation. There's so much insight packed into what we talked about today. I wanna thank you for sharing and giving our listeners some actionable takeaways. Before we wrap up, where can our audience go to learn more about Galileo and the work that you're doing? Yeah, for sure. So we are at galileo.ai. You can also check us out on LinkedIn and on Twitter.

Speaker 27:59 We post a lot of content on our website. We put that in our blogs and the website. There's an entire research section. Publish our papers and everything else that we've worked on for AI evals. We've been around for four years, so there's a rich history of a large body of work there. So they can check it out there. We're also hiring right now for a lot of engineers

Speaker 28:18 who have built out their own AI apps. Excited to chat with anybody who's interested in being at the forefront of, helping builders build. That's fantastic to hear, and we'll definitely I'll include those notes in the show notes on Substack. So if you listen to this and you're interested in getting involved with Galileo or following up on anything that we talked about today, you know, please be sure to subscribe, share our episode. You've made it this far, and if you check out our substack, there's even more insights from today's discussion.

Conor Bronsdon 28:48 Thank you everyone for tuning in this week, and a huge thank you to the DevRunrupted team for letting us republish Vikram's interview. Again, make sure you go check out DevRunrupted wherever you get your podcasts. We'll be back next week with an awesome chain of thought interview that we think you'll enjoy. Thanks for tuning in once again, and we'll see you soon.