What does Vikram Chatterji mean by the shift from “chat to task completion”?

Vikram claims the defining 2025 theme is generative AI moving beyond chat completion and generation toward actually completing tasks and taking actions — what he calls agents, while admitting the term is “overblown.” He argues this matters for enterprises because it raises the stakes from building chatbots to automating real work, targeting higher ROI from both OpEx and CapEx. He frames language models as becoming operating systems that power these actions, and says adoption for this use case is moving “very rapidly,” even if widespread task completion is still arriving in “fits and starts.”

How do Vikram and Atin describe Galileo’s approach to the AI “measurement problem”?

Vikram says the measurement problem has been Galileo’s north star since the company began: generative AI has no equivalent of the F1 score that NLP classification relied on, and even F1, in his view, “wasn’t a very good metric.” His pitch is that you can’t build a high-quality system you can’t measure. He describes Galileo’s LUNA metrics engine as a factory for creating, testing, and tuning low-latency metrics that auto-adapt at scale. Atin adds that there’s “no one-stop-shop metric,” which is why they baked auto-adaptation into LUNA. Both are the founders’ claims, not verified here.

What did the guests say about vibe coding and reliability?

Atin claims vibe coding makes developers “10x more faster,” turning weeks of work into an hour, but cautions that coding “completely blindly” produces “a pile of crap” that isn’t deployable — great for prototyping, not production. He argues a limited number of proven design patterns are re-emerging with LLMs in the mix. Vikram frames speed (vibe coding) and quality (reliability) as two different problems, both accelerating, and says moving from “one to production” still needs an expert in the loop plus real guardrails. These are the founders’ characterizations.

Which enterprise use cases did Vikram highlight as already delivering value?

Vikram claims the strongest early ROI is internal-facing: telcos building faster outage detection, wealth-management teams at large banks generating reports faster, customer support across nearly every enterprise, plus accounting and finance. He argues the lower accuracy bar on internal tools lets teams experiment fast while OpEx savings are “massive.” Some retailers are moving external-facing too, where the accuracy bar rises and real-time guardrails on millions of queries become essential. Conor adds that big organizations can aggregate value from simple fixes like helping people find docs.

How fast do the guests expect enterprise AI adoption to move versus cloud?

Both founders use the cloud-migration analogy. Vikram describes a diffusion-of-innovations pattern: a handful of innovators in finance or healthcare prove out the model-risk-management and ops layers, see large efficiency gains, and trigger a domino effect of FOMO across the industry — the same dynamic that played out with cloud, but happening a bit faster. Atin goes further, claiming adoption “will happen a 100 x faster than the cloud,” because the enabling technology already lives on the cloud and, with sufficient compliance, almost anyone can “write two lines of code” to bake in intelligence.

Episodes · S2 E25 ← Prev Next →

The 2025 AI Shift: From Chat to Task Completion & Reliable Action | Galileo Founders

May 28, 2025 · Vikram Chatterji , Galileo, Atindriyo Sanyal , Galileo · 45 min

AI Agents AI Evaluation & Reliability AI Observability Enterprise AI

Listen on any app

Key takeaways

Vikram Chatterji frames 2025’s defining shift as generative AI moving from chat completion toward task completion and action. He argues this elevates the enterprise conversation beyond building chatbots to automating tasks, with the goal of higher ROI on OpEx and CapEx — though he concedes “agents” is “an overblown term.”
Vikram describes a “gold rush” among middleware providers racing to build orchestration and frameworks — citing OpenAI’s Agents SDK, Anthropic’s MCP, and Google’s A2A. He’s blunt that some “don’t mean anything” and are “shallow libraries,” but reads the collective effort as the system falling into place to make task-completion apps real.
Atin Sanyal pitches evaluation agents as composable functions that sit where the developer works — for example an eval agent installed as a tool inside Cursor that automatically fixes potential issues. He argues eval is expanding into a broader AI reliability story rather than the point-in-time activity it used to be, with MCP servers standardizing communication around the LLM.
Vikram positions Galileo’s LUNA metrics engine as the answer to AI’s measurement problem, which he calls the company’s north star. He notes generative AI lost the F1 score that classification had — and that F1 “wasn’t a very good metric” anyway — and describes LUNA as a factory for creating, testing, and low-latency-tuning metrics at scale.
On vibe coding, Atin says it makes developers “10x more faster,” compressing weeks of work into an hour, but warns that blind vibe coding yields “a pile of crap” that isn’t deployable. He argues a limited set of high-quality design patterns from earlier software eras is re-emerging with the LLM in the mix, which sharpens the need for evaluation and reliability.
Vikram reports the biggest enterprise ROI so far is internal-facing use cases — telco outage detection, wealth-management report generation, customer support, accounting — where the accuracy bar is lower so teams experiment rapidly while OpEx savings are “massive.” He says enterprises now want a single pane of glass over risk vectors across a hundred-plus use cases.

Frequently asked questions

What does Vikram Chatterji mean by the shift from “chat to task completion”?: Vikram claims the defining 2025 theme is generative AI moving beyond chat completion and generation toward actually completing tasks and taking actions — what he calls agents, while admitting the term is “overblown.” He argues this matters for enterprises because it raises the stakes from building chatbots to automating real work, targeting higher ROI from both OpEx and CapEx. He frames language models as becoming operating systems that power these actions, and says adoption for this use case is moving “very rapidly,” even if widespread task completion is still arriving in “fits and starts.”
How do Vikram and Atin describe Galileo’s approach to the AI “measurement problem”?: Vikram says the measurement problem has been Galileo’s north star since the company began: generative AI has no equivalent of the F1 score that NLP classification relied on, and even F1, in his view, “wasn’t a very good metric.” His pitch is that you can’t build a high-quality system you can’t measure. He describes Galileo’s LUNA metrics engine as a factory for creating, testing, and tuning low-latency metrics that auto-adapt at scale. Atin adds that there’s “no one-stop-shop metric,” which is why they baked auto-adaptation into LUNA. Both are the founders’ claims, not verified here.
What did the guests say about vibe coding and reliability?: Atin claims vibe coding makes developers “10x more faster,” turning weeks of work into an hour, but cautions that coding “completely blindly” produces “a pile of crap” that isn’t deployable — great for prototyping, not production. He argues a limited number of proven design patterns are re-emerging with LLMs in the mix. Vikram frames speed (vibe coding) and quality (reliability) as two different problems, both accelerating, and says moving from “one to production” still needs an expert in the loop plus real guardrails. These are the founders’ characterizations.
Which enterprise use cases did Vikram highlight as already delivering value?: Vikram claims the strongest early ROI is internal-facing: telcos building faster outage detection, wealth-management teams at large banks generating reports faster, customer support across nearly every enterprise, plus accounting and finance. He argues the lower accuracy bar on internal tools lets teams experiment fast while OpEx savings are “massive.” Some retailers are moving external-facing too, where the accuracy bar rises and real-time guardrails on millions of queries become essential. Conor adds that big organizations can aggregate value from simple fixes like helping people find docs.
How fast do the guests expect enterprise AI adoption to move versus cloud?: Both founders use the cloud-migration analogy. Vikram describes a diffusion-of-innovations pattern: a handful of innovators in finance or healthcare prove out the model-risk-management and ops layers, see large efficiency gains, and trigger a domino effect of FOMO across the industry — the same dynamic that played out with cloud, but happening a bit faster. Atin goes further, claiming adoption “will happen a 100 x faster than the cloud,” because the enabling technology already lives on the cloud and, with sufficient compliance, almost anyone can “write two lines of code” to bake in intelligence.

Concepts in this episode

AI terms discussed here — each links to a plain-language definition.

AI Evaluation Compound AI Systems Model Context Protocol (MCP)Model Risk Management Vibe Coding Accuracy AI Agent Retrieval-Augmented Generation (RAG)Artificial General Intelligence (AGI)Latency

Chapters

00:00Welcome and Introductions
01:05Generative AI and Task Completion
02:13Middleware and Orchestration Systems
03:17Enterprise Adoption and Challenges
05:55Multimodal AI and Future Plans
08:37AI Reliability and Evaluation
11:08Complex AI Systems and Developer Challenges
13:45Galileo's Vision and Product Roadmap
18:59Modern AI Evaluation Agents
20:10Galileo's Powerful SDK and Tools
21:24The Importance of Observability and Robust Testing
22:27The Rise of Vibe Coding
24:48Balancing Creativity and Reliability in AI
31:26Enterprise Adoption of AI Systems
36:59Challenges and Opportunities in Regulated Industries
42:10Future of AI Reliability and Industry Impact

Show notes

AI in 2025 promises intelligent action, not just smarter chat. But are enterprises prepared for the agentic shift and the complex reliability hurdles it brings?

Join Conor Bronsdon on Chain of Thought with fellow co-hosts and Galileo co-founders, Vikram Chatterji (CEO) and Atindriyo Sanyal (CTO), as they explore this pivotal transformation. They discuss how generative AI is evolving from a simple tool into a powerful engine for enterprise task automation, a significant advance driving the pursuit of substantial ROI. This shift is also fueling what Vikram observes as a "gold rush" for middleware and frameworks, alongside healthy skepticism about making widespread agentic task completion a practical reality.

As these AI systems grow into highly complex, compound structures—often incorporating multimodal inputs and multi-agent designs—Vikram and Atin address the critical challenges around debugging, achieving reliability, and solving the profound measurement problem. They share Galileo's vision for an AI reliability platform designed to tame these intricate systems through robust guardrailing, advanced metric engines like Luna, and actionable developer insights. Tune in to understand how the industry is moving beyond point-in-time evaluations to continuous AI reliability, crucial for building trustworthy, high-performing AI applications at scale.

Connect with Chain of Thought host Conor Bronsdon:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

Follow Today's Guest(s)

Website: galileo.ai

Read: Galileo Optimizes Enterprise–Scale Agentic AI Stack with NVIDIA

Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠

⁠⁠⁠⁠Agent Leaderboard

Transcript

103 segments

Speaker 0:06 I'm delighted to welcome you back to Chain of Thought and to welcome two of my co hosts back as well. We've got the co founders of Galileo. Well, two out of three today, Vikram Chatterjee, CEO and Atendriy Ossanyal, CTO. Gentlemen, thank you so much for being here with us today and for sharing your insights with our listeners. Of course. Great to be here, Connor. Yeah. Likewise.

Speaker 0:26 And I think it's a particularly poignant time to have you both here because not only is there a lot going on for Galileo on the product transformation side as evaluation and reliability have become key themes for AI in 2025. But we're now a few months into the year. We're recording this right at the end of April, and the pace of AI innovation hasn't slowed. In fact, it's,

Speaker 0:50 maybe to no one's surprise, sped up quite a bit. Every week brings new models, new techniques, new conversations about where this technology is heading. Vikram, let's start with you. What key themes are you paying attention to thus far in 2025, and what are you seeing for the next few months? I think the big theme for 2025 has been the move from generative AI being this tool that was used for just, you know, chat completion and generation towards actual task completions

Speaker 1:18 and actions. And in my head, that means a lot for the enterprise because it elevates the conversation from being all about just building a bunch of chatbots to actually automating tasks in the enterprise or actually getting higher ROI overall from an OpEx and CapEx perspective. And, basically, I am talking about agents. It's an overblown term to some extent, and we can talk more about that. But the idea that these these language models are basically becoming, like, these operating systems that are powering these

Speaker 1:50 these actions are being adopted for this use case very rapidly. So number one, that's brought in a lot of excitement and interest, but that also in the beginning of the year brought in a lot of skepticism around, like, how real is this? Can these tasks actually be completed by these AI agents? What does this even mean? You're seeing a lot of definitions flying around by, you know, all sorts of founders around, like, this is my definition. This is my definition. But at the end of the day, it's basically task completion. What's really interesting from a secondary theme is because of this huge emphasis on wanting to move towards task completion, just given the amount of

Speaker 2:23 value that can be uncovered for the enterprise using this over the course of the next decade. There's a huge gold rush from the middleware providers around that, can we build the right orchestration system for this? Can we build out the right kind of frameworks for this? You're seeing this huge rush from the you know, OpenAI came out with this agents SDK. You have

Speaker 2:43 Anthropic came out with MCP because working with tools was tough. You have the, you know, the a two way from from Google recently. Some of them are they don't mean anything. They're actually just shallow libraries. But from a marketing perspective, everyone just wants their stamp out there around, like, you know, we have something to from a middleware perspective. But what this means in my perspective is it's it's all coming. When everyone's working really hard, the entire industry is working very hard to make something happen, it means that the entire system is coming in place to make the

Speaker 3:13 the top level applications around task completion become more of a reality. I that being said, third observation has been that's not there yet, but we're seeing this in fits and starts. So some startups have already managed to do this in a really effective way, and some enterprises, some of the largest banks I'm talking to, they've also managed to actually build out, you know, these

Speaker 3:35 AI applications powered by different kinds of tools. They build their own internal tools. They've built they use their own build their own MCP servers around that. It's becoming more of a reality, and that to me is a big, big shift, which by the end of the year is gonna become much more like how we talk about maybe RAC based systems or prompt engineered solutions

Speaker 3:56 now, which is, you know, very much the norm. This is the next level up in that in that in that journey. And I think this is this is the biggest shift and biggest unlock from an AI value creation perspective. Atin, what's your perspective? Yeah. I kinda echo what Vikram is saying for sure. I think there's a lot of, you know, just overall high energy and excitement around emergent infrastructure,

Speaker 4:22 around the LLM, which was kind of the center of the solar system. And but we've moved far beyond that to to Vikram's point around actually doing things, doing tasks, taking action. But what's most exciting, I think, for me is the we kind of have a standard of data exchange and communication kind of being set up, the advancements in MCP with MCP servers, etcetera. They are this attempt to sort of standardize communication around the LLMs.

Speaker 4:52 And we've already seen very early sort of early success with certain frameworks. There's already a lot of MCP servers, which are open source out there, and they host a whole bunch of these tools which are augmenting the LLM to actually take action. But then the question becomes what is the accuracy of the action and whether the the action whether we actually achieved what we intended to do or not. So a lot of

Speaker 5:20 great advancements made just in the last four months, agentic frameworks, of course, being the biggest of them, but also, you know, a lot of multimodal launches with g p t four v and deep s e c v l. It was a paper they wrote in 2024, but they recently launched it a couple months ago. So all this will kind of come together to just give a more enhanced sort of experience around

Speaker 5:43 language models in general, but also kind of set the standard for how to build a good, robust, reliable, generative AI applications in a more centralized manner. And multimodal evals are increasingly a topic of conversation as well, which is obviously something we are are working on internally here. I I'd love to get thoughts from both of you around the direction that Galileo is planning to go adapting to

Speaker 6:12 leveraging some of these new frameworks, whether that's ATA a A2A agencies, ACP that we're obviously partners with Cisco and others on, or MCP. Atin, I know you've been experimenting there. You're already talking to the team about it. And then, obviously, we have plans to launch more on the multimodal front coming soon. What are the key things that people should be looking out for as far as Galileo's product and

Speaker 6:39 the vision that the two of you have for it? Yeah. Maybe I can take a first stab at this. So so multimodal will happen. I think there's enough proof of just very high quality agentic frameworks already being released that support multimodal. These models have a significantly better understanding of multimodal data, and we're already seeing even in the enterprise, are seeing

Speaker 7:06 image based q and a systems sort of prop up in sort of in the prototypical phase. There'll be a lot of acceleration on this front in the remainder of the year. What it means for evals, I think it means a bunch of things. Firstly, how do you adapt to sort of this new landscape of how people are building end to end applications? That includes the ID itself in which you are actually sitting and building the app. Right? There's already in the zero to one phase, there's a lot of potential for errors

Speaker 7:40 and performance issues and, you know, we'll probably talk about white coding a bit in a bit, but there's a whole host of compounding issues that lead to the proclivity of errors in Gen AI apps right from when the developer is building. So from an eval standpoint, it's all about how do you adapt to this new pattern of development going beyond, say, standardized logging and getting insights on a UI.

Speaker 8:11 That's why I personally talk a lot about evaluation tools and evaluation agents. These are these composable sort of functions, you will, that sit right where you are. You can install an, you know, eval agent as a as a tool in your IDE in cursor that will automatically fix the potential issues for you. So eval kind of expands itself into this broader idea of AI reliability

Speaker 8:39 because in the erstwhile world, you talk about evaluations and observability. And with the era of agents, there's so many new components and you truly see end to end systems building. Evals and observability sounds like a point in time activity. And what you really need is a sort of a broader reliability story where these metrics and these little evals, they're kind of means to an end, which are,

Speaker 9:04 in the end, want high quality end to end applications built. But the form factor it takes would be in the form of things like tools, things like agents, where you have a separate MCP server, for example, that only does evals, which you can point your app to and will automatically start fixing and taking action on on your application. Absolutely agreed. And we're seeing this increasingly as these new roles begin to be established at different enterprises,

Speaker 9:32 whether it's partners we work with like HP and Databricks or elsewhere. We're seeing people who are focused on evals, who are focused on reliability as, you know, maybe even we're going to see AI reliability engineers. We're seeing AI PMs come to the forefront. And while observability and evaluation are crucial to how they're getting their job done, really what we're delivering with that is helping them to create effective applications,

Speaker 9:55 create continuous learning and continuous iteration loops, ideally ones that can be fed through automated systems, and have reliable applications that users can trust. So think I that's a great point, Aten. And, Vikram, I know that's something you're hearing a lot with the enterprises that you're talking with as well. Yeah. Because if if you think about at the end of the day, what people

Speaker 10:16 care about, what AI leaders care about, AI engineers care about is they just wanna be able to build out AI applications, whether it's agentic or otherwise, at scale in a very reliable way. To your question before about multi model, whether it's speech, video, image, or even if it's with basic text, I think what's happening right now is the systems are these are to use the word from Databricks. Right? The compound AI systems

Speaker 10:43 that that are being developed. These compound AI systems are getting more and more complex. But Dave from Databricks had coined that craze of compound AI systems maybe about a year and a half ago, and that was when we were in the world of prompt engineering, and we just had a model and a prompt, and Rag came about, and he's like, this is a compound system now. But, you know, if he if he looks back at this right now, and it's compounded even more. So the systems are becoming more and more complex. And the reason I bring this up is because from a developer's perspective or from an AI leader's perspective in the enterprise,

Speaker 11:16 they're looking at this compound system and thinking of it as comprising of a bunch of different things all in one. Right? Where you have you have a multi modal model there, which is doing some kind of image to text generation, looking at a PDF, the and next one is gonna be taking taking desks and kind of summarize that, and then it's a third and the fourth one. So there's a there's a series of different kinds of steps that are being taken, and there might be some multi modality in there. There might not be. There might be some rag in there. There might not be. There'll be multi torn applications from a chat perspective. They might not be. So the systems are extremely complex.

Speaker 11:50 But to your point about multi agentic systems and how do we look into those, I see that as, like, you know, one aspect of this entire compound system, which is why, like, from a from a developer's perspective and from a leader's perspective, at the end of the day, they care about how do you build these reliable, super complex compound AI systems now, which includes all of these different pieces,

Speaker 12:15 which is very hard to debug, very hard to observe, very hard to understand, and there are maybe 40 different failure modes here. One small aspect of that is, like, one one aspect of that is real time experimentation, real time monitoring. One aspect is, like, can I have real time protections? Can I have complete cost transparency? Can I have a governance engine in place here? So there's a bunch of different stuff here, like, which which comes in place, which is why the the overarching narrative from an enterprise perspective has always been, like, what's the what's the AI reliability platform I can use, which can be a partner for the long term for these super complex alarm systems that my team is building and make sure that this works at scale.

Speaker 12:57 That's where Galileo sits, and that's where we've been helping some of the largest enterprises in the world, and we'll continue to do that. Absolutely. And it seems clear that as enterprises create multi agent systems, compound systems across different multimodal use cases, there's this increasing need, as you've mentioned, and as Aten's highlighted, for not just evals as a point of time solution,

Speaker 13:21 but as a mechanism that works throughout the AI development life cycle and whereby you are causing improvement flywheels to occur, and you're creating systems that work better than they did three months before or four months before. And I love that product vision for the company as this reliability platform for AI engineers, AIPMs, AI builders, whatever their names may be.

Speaker 13:46 Are there particular parts of that product vision that you would wanna highlight that you expect to see come true for Galileo over the next six, nine months? Yeah. I can I'm gonna get a quick stab at this spread. Like, there's some core beliefs we have, Connor, just based on what we're seeing in the enterprise and we're seeing across our customers. And those beliefs typically translate into products

Speaker 14:09 eventually. One of them is this notion of, you know, just overall AI reliability in production, and that aspect means more about it it pertains more to the enterprises that already are in production that require real time scalable guardrailing. They require a wave such that in regulated industries, especially every single query that's coming going that's coming into the system as well as every response that's coming out and everything in between needs to be guardrail based

Speaker 14:37 on specific kinds of guardrails that are built for their specific use case and do that at millisecond latencies and a low cost. So there's definitely that aspect, which is one core belief that we have there. You know, AI reliability and production is gonna be a really, really big thing, and it's gonna become even harder and harder as a problem to solve for enterprises

Speaker 14:57 as we go along. So that's that's one as big aspect of things. The other thing that we've been thinking a lot more about from a belief perspective is there is a big measurement problem in in the world of AI. We've always thought about that as our north star since the very beginning of the company. And I mentioned this in every single time I talk to an enterprise. Like, we there is no f one score that we had in the NLP world for classic classification tasks in the world of generative AI, and even the f one score wasn't a very good metric.

Speaker 15:24 Right now, with generative AI, it's it's all of that is gone. So how do you actually build a high quality system if you can't measure it properly? So a very large part of what we'll be building out is gonna is gonna revolve around tripling down on our on our LUNA metrics engine that we've built out, which is basically, like, this entire factory where you can, like, create metrics. You can test metrics. You you can make them low latency very easily and make sure that they're high quality at scale auto adapts. So there's a huge emphasis around just r and d and algorithms

Speaker 15:54 and infrastructure on the metric side, and all of that collectively is our LUNA piece just to solve this measurement problem of AI. Right? And all of that is gonna power the your point of the SLDC of the AI application life cycle all the way from, like, offline experimentation to CIC, to online, to real time. All of that is powered by this core platform. That's the core of what we're trying to solve. That's the the that solve the measurement problem at scale. That's the second belief that we have. The the third belief that we have is these complex AI systems need to be tamed. Now what does that mean? Like, if you look at most

Speaker 16:28 systems today, the focus of the industry is becoming more and more on the output. Like, is the output good? Is the output bad? Is the system fine? Is it hallucinating? What's happening now is because these systems are getting so complex and compounded, it's also becoming very important to have helped the developer understand the shape of their system and understand where the failure boards are. If you think of a multi agentic system where you have, like, hundreds of different traces,

Speaker 16:51 dozens of potential tools to be called, No one's gonna go trace by trace by trace and try to look for where things are going wrong. It's just not gonna happen. It's it's it's a fool's errand to look for that. So the question becomes, like, can you start to build algorithms and build systems in place such that you can tame these very large complex systems. And we're working on a lot of very interesting solutions there from an algorithmic perspective to just automatically

Speaker 17:14 take in these millions of of traces, millions of different input signals from all the way from the code to the output generation, etcetera, to automatically give the developer an overarching, you know, bird's eye signal of exactly where the failure modes are and then take action on those signals and actually tell the developer what the actions should be just to truly uplevel the nature of of their development life cycle. And I think that's very necessary with multi agentic systems because if you think of it that way, and I'll pause right after this, but if you think of this, what what's happening is agents are like workers in a room. Right? You have the managers, you have the workers, and then now they're all going and doing things,

Speaker 17:51 and there are these tools that they can use. But, you know, if you if you ask somebody about, like, what's where's where are things going wrong? It's not gonna be, like, you're gonna stop a worker and be like, show me everything you've done over time. And now that we figured this out, you need a summary overall and some kind of a metrics dashboard and an overall summarization dashboard of here's what's going wrong quantitatively. Here's what's going wrong qualitatively. And that's the only way you can then start a root cause analyze. I think all of that's gonna be necessary

Speaker 18:15 to aim the complex system. You're saying we need a Jira for agents? Because that's kind of what I'm hearing here. Hopefully, Jira plus plus plus. Yeah. True. I had a couple more points, to what Vikram was saying. Certainly, the measurement problem, I think we discovered that this right at the onset of the company which was years ago which is, you know, in general AI has a measurement problem.

Speaker 18:38 The other thing that we realized was there is no one stop shop metric that fixes all your pains. And this was a realization which we made early on in our journey which is why we've gone down the route of baking in things like auto adaptation into into frameworks like platforms like Luna that we've built. And in the modern world, it's kind of taken this form of

Speaker 19:04 evaluation agents really because they're adapting, which means they're doing the task of getting better, figuring out mistakes on their own, and adapting to your data. So that's one aspect of it, the adaptability, which makes them agentic in nature. The second aspect to the point about taming a complex system, it is true that there's a million different parts to

Speaker 19:27 an end to end GenAI app, but it's also that it can be built in a million different ways. There's no limited set of 12 ways that you can build an end to end app, especially with the advent of a two a and any kind of agentic framework that you take. What they're allowing the user to do is giving the power back to the users. That's inversion of control. So now the developer has a lot of composability power to be able to construct

Speaker 19:56 a system the way they want. What does that mean for evals though? For us, it's not only to adapt our metrics to to, you know, be accurate over time and constantly be accurate. That's one aspect of it. But also think just to point to one feature of Galileo, is we have a very powerful SDK, which not only takes the shape and form of Python libraries or TypeScript libraries,

Speaker 20:23 but also, of course, APIs, but also tools and agents there, which can fit into the modern workflow of, say if you're building in cursor or any kind of IDE which gives you this Copilot like experience which is automatically fixing your code. How do we take all the the goodness of things like Luna, etcetera, and shape it in a form which is composable, which meets the user where they are, and it takes action

Speaker 20:50 at every stage of the life cycle. Where it's zero to one where they're building the app, we want them to not make the early level mistakes, so you want to use our agents there. Then there's the CICD bit where your application's moving from dev to staging to prod. And then there's in the prod where you want more real time guardrailing and observability, but it's the same sort of ingredients behind the scenes which are sort of pow part of this one powerful platform, and that's what Galileo gives the user. In an era of vibe coding and increasing

Speaker 21:21 LLM generation of software, it feels like this need for observability, this need for robust testing and evaluation is more important than ever to ensure reliability and to ensure trust in applications, particularly as we're building these multi agentic systems. I have to say, I'm really excited by the fact that people can also now try the platform for free and can have enterprise grade evals,

Speaker 21:49 can test it out, can give us feedback over at galileo.ai. I think there's a massive opportunity for us to help developers to increase the reliability of their applications while also getting the feedback we need to continue to hone in on how we solve this measurement problem. I wonder if either of you is seeing, maybe Atin, if you want to start, any sort of increase in the need for these reliability systems,

Speaker 22:17 whether it's the observation piece, the evaluation piece, or otherwise, based off of the shift to LLM driven or at least assisted cogen? What Wipe Coding has done is is it's made developers 10x more faster at getting to a particular outcome. You know, it would take you, you know, a few weeks to build something that you can build in an hour. So props to, you know, all the innovation that has happened towards it, even though wipe coding has a bit of a negative connotation.

Speaker 22:48 It is true that if you completely blindly wipe code, you will end up with a pile of crap and which is not deployable. It will be a suboptimal application. Please nobody here look at my GitHub. Yeah. I mean, same here. So it's great for prototyping, but there's certain standard sort of thoughtful design patterns. I always say that we've kind of gone back to the era of 2019

Speaker 23:15 or even 2009 or even fifty years ago where we discovered there's a limited number of ways where you can build very high quality applications. And the same sort of design patterns are reemerging in this new era with the LLM in the mix. So wipe coding allows you to be efficient at the lowest layer of coding, which is just brass tacks writing boilerplate code. And

Speaker 23:39 any developer worth their salt will probably tell you that. Over time, people have discovered ways to efficiently wipe code or thoughtfully wipe code by prompting the LLM back and making making sure it doesn't make the same mistake, etcetera, etcetera. Even then, there's only a limit to how much you can scale a very high quality application with white coding, which really

Speaker 24:04 accentuates the need for evaluation and reliability in general. And that that's where we kind of come in. I think for us, it becomes more important to, like I've been saying, adapt to the new sort of world in which we are in, which is developers are using Copilots. We need to be, you know, in line with them and meet them where they are. But the fundamental problem,

Speaker 24:31 like Vikram said earlier, is still the same whether you ask a developer or a leader, which is that I just want to build a high quality application that is deterministic, which is reliable, which gives me outputs that I can expect, and an efficient way to root cause it and take action on whenever things go rogue. How should developers or technical leaders who are building AI systems

Speaker 24:54 think about balancing, seeking reliability while also leaving in the magic that is hallucinations and iterations, that the AI is creating? Because it's a fine line to walk depending on the type of application you're developing and, you know, with the agents that you're you're putting loose in the world. But at the same time, if we completely try to eliminate this creativity,

Speaker 25:19 we are also getting rid of some of the secret sauce behind AI. So, Vikram, I'm curious from your perspective how you're seeing enterprises that we're working with adapt to this problem and solve for solve for reliability while also keeping the upside of of leveraging nondeterministic systems. If you talk about wide coding along with the AI reliability problem, I I do think there's they're trying to solve two different two different things. That one is for

Speaker 25:47 speed of building AI systems, and the other one is quality of the AI systems. And I think both of those are going to be important, and both of those are going to be continuously accelerated from an industry perspective. I think the idea of wide coding is gonna come from the idea, like, you know, you have to build these AI apps faster. A lot of the context for building out these AI applications

Speaker 26:10 turns out is, as has always been, right, the business logic side of thing has always been a very big part of software development, right, in different shape forms. It's interesting how, like, now the the business logic is coming up even closer to the, you know, the actual coding layer by using wide coding, where, Atin, it's, like, incredibly technical, like, on our maybe for you and me, you and I can actually build out

Speaker 26:32 apps much faster now with just using our knowledge of the world and just, like, figuring things out quickly using, I don't know, Level and a bunch of other tools. So that's interesting from a speed perspective. I don't, but that's that's separate from to Althin's point from thinking about this as, like, a production AI application out in the world where you have to have the right kind of guardrails in place. You have to have the right kind of safeguards in place for making sure that it's actually reliable.

Speaker 26:59 As an example, with agentic systems, right, it's much more about just better software engineering principles all over again. You just have to make the right kind of tool calls. You have to have the right kind of, you know, functions and different kinds of conditional statements to make sure that you're calling the right stuff, and now increasingly making sure that you're giving the right kind of instructions to the to the model to choose the right kind of prompts.

Speaker 27:21 And and so, you know, from a reliability perspective, it gets back into what are the best practices that you're using over there. If you're just white coding, then are you are you missing out on some of the robustness that you might need as in the enterprise to actually make sure that this can see the light of day as a as an actual AI product. So I do think they're two different problems. You as much as white coding can help in accelerating from going from the zero to one,

Speaker 27:47 that one to production is still gonna be something where, to Atin's point, you you need an expert in the loop where you can actually help get to the other side. But that doesn't preclude the idea that you absolutely need AI reliability. To your second part of the question, Connor, which is about, like, nondeterministic systems, that era is very much upon us, right, at this point. I think about a year and a half, could have said that,

Speaker 28:10 is this like is this like Bitcoin or Web three, which never made any sense to me, still doesn't as a market. But for but is AI like that? It's absolutely not. It's actually showing a bunch of value in the enterprise. Has it absolutely arrived in the enterprise and everybody's seeing a bunch of ROI? Not yet. Lot of OpEx reduction, not a lot of increase in revenue from use cases perspective. I think that's gonna change, especially with agents now.

Speaker 28:35 Given that nondeterministic systems and nondeterministic software is already a here and now thing, where the buck has moved to is this question of, as an enterprise leader, should I or should I not adopt AI? Right? And the ones who are saying I'll not adopt AI are gonna go the way of the rotary form. They all know that. So given that they're all gonna be adopting AI at some point, now the pucks move to this question of how do I make sure it's reliable?

Speaker 29:01 And that's kind of where these platforms like Galileo come in, and it's gonna get more and more important for us to be able to think of solving their problem in a very holistic way. I think I'd love to give my two cents on this one because it's an interesting question. I know creativity versus determinism was this one of the first questions that was asked when ChatGPT happened and LLMs were upon us.

Speaker 29:26 I think in the agentic era, that question is a bit dated. It makes less sense in terms of, number one, with the innovation on the model side. On the LLM side, they're actually much better at language. And, like, if you, you know, rewind the clock maybe twelve to fifteen months ago, someone could easily detect that, oh, the output has come out of OpenAI versus Anthropic. They were distinctly different. And now these models are a lot better to you can ask them to write in any style, and, you know, they will adapt to your style. So there's a very marked improvement

Speaker 30:03 in the ability for these models to spit out language. But in the agentic era, it's less about the language itself. It's more about the actions and the outcomes of the application. And what people are really trying to do, the modern developer, is build an end to end system which is primarily software, and you wanna inject LMs at the right places to be able to do the right things for you,

Speaker 30:30 which were not possible erstwhile. And in the end, your system is compound in the sense of it has traditional software components, it has LMs in the mix doing just the right things where you don't want hallucinations, etcetera. And the outcome of the end to end system can be language if it's a Q and A system, but it can also be something else. You have many long running agents where the output is not a piece of text. It's something else and it's more about actions.

Speaker 30:57 So in this new era, think of the creativity question kind of becomes a bit myopic in the sense of, yes, you want if the application, say, is a marketing writing tool, yes, you want that element of creativity, which is actually a lot more controllable and deterministic, and you can actually rely on the LLM itself to be creative in language. But the problem is entirely different for the vast majority of the more end to end applications,

Speaker 31:21 which are sort of this mix of traditional software and LLMs. What are the specific use cases or problems that you're hearing enterprises you talk to come up with and and say, hey, we need help solving x. So, like, we have this broader reliability challenge. We have, you know, evaluations that we need throughout this process. But are there specific use cases, Vikram, that you're hearing more of this about? Or are there other needs that you're hearing from these enterprise customers? If you think of the shape of an enterprise, it's typically like a 20 to 30,000

Speaker 32:00 organized person organization or oftentimes even larger than that. In the world of retail, manufacturing, aerospace, banking, telco, right, Massive number of people, massive number of operations to actually keep that alive and thriving. So you have sales, marketing, all of these are the size of like a 100 different startups in the Bay Area, right? Each of these organizations are. So a large number of their use cases have been internal facing,

Speaker 32:25 which we sitting over here might poo poo on that, that it's internal facing, it's not external facing, who gives a shit. That's not how they're thinking of that, right? A lot these enterprises, they've built out dozens and dozens of these AI applications and agentic systems, because number one, the bar for the need for it to be super high accuracy is a little bit lower because it's internal facing.

Speaker 32:51 So they can experiment really rapidly, but at the same time, the flip side of that, the ROI from an OpEx perspective is massive. So the number of these enterprises that have been building out AI based tools and applications for their entire sales fleet on the ground, The number of them are that are building out stuff like telcos are building out these systems such that

Speaker 33:14 outage detection can happen faster so that their analysts can work better. Wealth management teams at large banks are building out ways that they can generate reports faster for their analysts. Customer support is obviously a massive use case across every enterprise globally. Don't even get me started on accounting and then the finance department, and it's massive. So you have dozens and dozens of these. So the question mark in the enterprises becomes,

Speaker 33:37 I have a 100 different use cases. I want to centralize this in one singular place so that the entire platform, the shape of my entire platform is centralized, maybe with one team. But then now can I get a single pane of glass view of exactly what the risk vectors are across the entire enterprise, wherever the models are being used for, let's say, Conor, you're sitting in accounting at a large aerospace company and you start asking questions, which just breaks the system?

Speaker 34:01 Some centralized AI platform team should know about that, that, hey, this is a risk vector and this could break the system for others. Or somehow your question is like, Generated this massive report for me and it starts to take up a lot of huge number of tokens in the output, and now there's a huge cost spike from where you're sitting. So now they need to know about that. Or they might want to guardrail the response so that you don't get the response.

Speaker 34:22 That's kind of what we were starting to hear a lot more about, that internal operations, massive, massive use case, and some of these organizations, especially more on the retail side, they started to move more external facing as well, but the bar for accuracy is even higher, and that's where real time guardrails and real time systems become much, much more of

Speaker 34:44 a need at scale. Right? Like, has every single query from millions and millions of queries has to go through a bunch of different customized guardrails that they've built out, which fits to their use case. This makes total sense because we've already seen teams not only go through major code transformations and change their customer support features, but also

Speaker 35:05 limit the intake of tickets on customer support, and all these other options that are happening at the enterprise level. So thank you, Vikram, for highlighting several of these these ways that enterprises are already seeing value and already pursuing this. I think it's really interesting to consider this because so often when we talk about AI, we talk specifically about the exciting things coming out of fast shipping startups

Speaker 35:27 without considering that for companies that have thousands of employees, tens of thousands of employees, you can aggregate such significant value from something as simple as, helping people find docs internally. You can save hundreds of hours a week, if not more, across the organization from some of these simple fixes that can be enabled by agents or chatbots and a variety of other workflows.

Speaker 35:51 And in particular, I think it's interesting, Vikram, to hear about some of the industries that you're talking about here. There's multiple highly regulated industries you've mentioned here, healthcare, finance, and others that are already seeing value. And a lot of naysayers externally, I think, were initially looking at these highly regulated industries as areas where

Speaker 36:15 transformational impact, wasn't necessarily gonna be a huge driver. And it seems like that's not the case with growing needs for AI reliability engineering, AI observability engineering to correlate with things like security teams, governance, platform teams internally. It very much feels like there's an opportunity for every industry to realize these benefits.

Speaker 36:37 And I would almost say, and I'm I'm curious if you agree with this, that companies that already have a history of successfully dealing with regulations, dealing with, you know, higher customer trust bars are more prepared to actually take on the what you need to do in order to have highly reliable AI systems and leverage them internally. Yes. There is also a direct correlation with the, like, the pre AI era of for what those industries were doing. For an as an example,

Speaker 37:10 a lot of the healthcare companies have been very slow to adopt these newer models, which are over a billion parameters. A billion parameters is small, but like, a lot of organizations kind of cut the model size for small versus large at that point right now. We're seeing that with financial services as well, right? But model risk management teams used to audit different kinds of

Speaker 37:36 models, like their different version of BERT models, and they used to say, This is fine. But now they're living in an era where if they give acceptance on a very specific model, the teams are gonna come back the next week and say, There are five new models that you need to look at. So that's almost like MRM needs to move almost on a real time basis, so they're adapting really quickly.

Speaker 37:57 And so what we're seeing with financial services, with healthcare, and some other regulated industries like that, is if there are a 100 of these organizations, it's maybe five of them that are at the forefront right now, because they've been able to adapt, because they already have the compute layer, the data layer figured out. And so now it's they just have to get out of their way by creating the by figuring out the ops layer internally,

Speaker 38:21 and their MRM teams are moving really fast, and we're working directly with them to figure out what the future of their, of the shape of AI adoption in those companies looks like. And they're very, very open to working with us on that side. And those companies, just looking at that, the general, what's it called, the diffusion of innovations graph, if you will, like where The early adopters versus early majority, etcetera. Yeah. Exactly.

Speaker 38:47 Yeah. So you have the innovators and the early adopters, and these folks are definitely on the innovator side. And what happens is like one person in the financial services side of things, they figure out how the MRM should work. They figure out how all this other stuff should work, and they see this massive amount of impact, and then that just has this domino effect across the entire enterprise. And we're seeing the same thing happening now with some healthcare businesses

Speaker 39:10 where all the others are having this massive amount of FOMO around like, Oh crap, they figured this out. They're seeing massive amounts of optimization efficiencies. We're going to be left behind. We used to do data science. We can do AI. We have developers here. And the same thing's happening in retail, in manufacturing, in aerospace. And so it's gonna That's what typically takes a while. And it's exactly the same thing that happened with cloud as well, right? Where everyone was like, there's no way me as an aerospace business, we're not gonna put all of our data on Amazon servers. Some folks are still doing it, so it's a little Yeah, bit

Speaker 39:43 exactly. And so it takes a little bit of time, but then they'll look at the next aerospace company, and they're like, Oh, they did it and they're moving faster now, so we got to do this too now. And then there'll be a new CIO and they'll just do it. So it's the same thing that's happening here. It's happening a little bit faster than the cloud, but I do think there's a little bit of the whole, like, which industry is seeing the innovators already come about and how much are those innovators advertising themselves, and then that just leads to this domino effect. I love it. Thanks for these great insights, Vikram. Atin, I am curious if you have anything to add on this front. I mean, I'll spare the kind

Speaker 40:15 of Vikram covered the pretty much the entire gamut of the the use cases in the industries and nothing more to add on that particular front. But I guess, I would say that the yeah. I mean, the cloud analogy is pretty pretty spot on except that this is like, the access to this technology is so much easier. Like, to port an application on the cloud, say, even if you rewind the clock to twenty fifteen, it's a lot. Right? You have to figure out fundamental infrastructure and

Speaker 40:45 too many impediments to say. But now that, you know, there's already enough appetite for the cloud, a lot of these the fundamental technology, is powered by the LLMs, they're on the cloud. As long as you have enough compliance for that, I think the rest of it is just on your fingertips. Pretty much anyone can, you know, write two lines of code and get a lot of intelligence baked into their system.

Speaker 41:09 So I would say that this adaption will adaption will happen a 100 x faster than the cloud. It's interesting you bring that up because I do think one of the big differences we're seeing is that inherently consumer facing LLM products have led this wave where, I mean, ChatGPT went extremely viral, has continued to do so regularly, gaining massive market share,

Speaker 41:35 over 5% of the world. Sam's claiming 10 points, are using it with some regularity. So we we have this major, you know, moment where people are simply aware of this technology. They can try it out for themselves. They can see that it may give them better information than a a Google search in some instances if at the most, like, basic use case. And there are simple use cases that individuals can solve. And so the the comfort with the technology

Speaker 42:00 is often quite high because people are just able to use it day to day if they want to. So it's gonna be really interesting to see how that and the the infrastructure that's already been built out accelerates this trend. Vikram, thank you so much for this conversation. It's been fantastic getting a chance to hear what you guys are hearing from the enterprise.

Speaker 42:18 I'm curious if there are any customers or or particular folks you wanna shout out here as we as we wrap up because I I know people always love to hear that they're being impactful in these conversations. There's there's a whole bunch of customers that I would love to thank and talk about, but it's mostly in the these Fortune 50 telcos and banks and others, as well as

Speaker 42:38 a lot of others in the fast moving AI space. Like, there's the amazing folks at at at Twilio, at at HP, and many, many more that we'd love to thank. So we have dozens of these folks that we are closely closely working with, and many more who are who we are working very closely with that we can talk more about maybe in the next one or two months. But it's amazing to build with them to see how the our markets move really fast and actually

Speaker 43:02 make sure that we can add a lot of value because their systems need to be much more reliable than they are right now. Still things still still break, when when things do, you know, Dalai Leyv has to just be there to make sure that they can sleep at night knowing that there's their AI systems are nondeterministic, but reliable at the same time. That's the only way this this entire industry is actually gonna see

Speaker 43:23 not just the light of day, but actually start to shine. Well said. And we'll certainly be talking to you both a lot more in the coming weeks and months as we continue to shape the future of reliability for AI systems. Atin, Vikram, thank you so much for helping us navigate the complex and fast moving currents of AI in 2025. For all of you builders and leaders who are listening

Speaker 43:44 and looking to stay ahead of the curve, I highly recommend subscribing to the Galileo YouTube channel. It's full of fantastic content, including demos of our reliability platform, our observability and evaluation features, how tos, and so much more, along with discussions like this episode that will help keep you informed and inspired. Plus check out Atin and Vikram on LinkedIn.

Speaker 44:03 They post a lot of great stuff there. Atin, in particular, I wanna shout out for constantly sharing his insights and kind of thoughts on on what's moving forward in AI and some great papers that I I certainly enjoy reading. You can also always find more info at gala.ai/blog for everything we're writing. Gents, thanks again for coming on the show today. We'll have you back again soon for another check-in on all that's happening in world of AI. Thanks, Connor.

Speaker 44:28 Appreciate And thanks, everyone, for listening. We'll see you next week.