What does Elastic actually do in the AI stack, according to Philipp Krenn?

Krenn positions Elastic as the data layer. Its classic use is retrieval-augmented generation: you store your data, do the retrieval first to get the right context, then generate the output with an LLM. Elastic can also cache results so a similar question reuses an earlier answer for a faster, cheaper response. On the evaluation side, because Elastic does OpenTelemetry and the ELK stack handles logging, it collects performance, cost, and quality telemetry. Krenn is explicit that Elastic stores the results but leaves the evaluation itself to Galileo.

How do Galileo and Elastic integrate technically?

The integration runs on OpenTelemetry. Krenn explains that Elastic acts as the data store that keeps the telemetry, then uses OTLP — the wire protocol for OpenTelemetry — to move that observability data so it can be aggregated and evaluated. He draws a parallel: MCP is becoming the protocol for AI, and OpenTelemetry is the protocol for telemetry data; standardizing on shared protocols is what lets Elastic, Galileo, and others partner more easily. Because both sides speak the OpenTelemetry standard, the same data feeding Elastic can feed Galileo for evaluation.

What small models does Elastic build, and why not just use a general-purpose LLM?

Krenn describes two areas. For re-ranking, Elastic built a model that runs over a retrieved subset — for example, re-ranking the top thousand documents out of a million. For embeddings and inference, Elastic heavily uses E5, a multilingual model, in the dense vector space, plus a custom model for the sparse vector space that does keyword expansion to find related keywords. He says a general-purpose large language model would be too expensive and too slow for these tasks, so the right specialized models are needed instead.

What is Philipp Krenn’s advice for developers trying to take AI to production?

Krenn says it’s very easy to get started and build, but getting to production is the hard part. Guardrails are a very important piece, and you need the evaluation side to confirm the system is actually doing what you want, within expectations for cost and performance. Drawing on his observability background, he warns that observability is often treated as an afterthought and expects the same mistake with AI — people will eventually realize it shouldn’t be. He adds that AI may even raise the stakes: hallucinations and wrong answers create a stronger business drive to get the right answers.

How does Philipp Krenn see AI changing customer-facing work?

Krenn frames it through analogy and caution. He compares AI’s trajectory to self-driving cars: hyped years ago, slow to arrive, but now he rides in a Waymo frequently — suggesting AI may take longer to reach the “this is what happens in reality” phase. He hopes reliable agents mean never having to call a hotline or wait on a chat system again, freeing people doing repetitive call-center work for harder background problems. On job loss he is measured: automation fears have recurred since the industrial revolution without playing out, so he expects work to shift rather than disappear.

Episodes · S2 E32 ← Prev Next →

The AI Agent Trust Gap: Bridging Risk to Reliability | Elastic’s Philipp Krenn

Jul 16, 2025 · Philipp Krenn , Elastic · 44 min

AI Agents RAG & Retrieval AI Evaluation & Reliability AI Observability

Listen on any app

Key takeaways

Elasticsearch quietly powers search across much of the internet. Philipp Krenn notes that Wikipedia and Stack Overflow run Elasticsearch behind their search boxes, and that almost everything on GitHub is cached or powered by Elasticsearch in the background — usage most people never see.
MCP is changing how systems are accessed, not just what data they hold. Krenn frames it as a shift in interaction mode: rather than writing against one specific REST API, you let the LLM figure out the MCP connection, fetch the right data, and see what actions it can run — including descriptive asks like “build me a Kibana dashboard.”
Elastic does not build large language models. Krenn says the models Elastic builds are for inference and re-ranking, and it relies on partners to supply the LLM that generates the answer — or, in Galileo’s case, the Luna models on the evaluation side. He describes the stack as an “AI lasagna” of layers where each player partners with the others.
Re-ranking is a small-model pattern that wasn’t practical before. Krenn walks through it: from a million documents you retrieve the first thousand, then run a more expensive, slightly slower re-ranker over just that top-thousand subset. A fast, efficient small language model makes running that costlier model on a narrowed set feasible.
Latency tolerance for AI is a temporary novelty effect. Krenn recalls that a 200-millisecond Elasticsearch query once felt “unacceptably slow,” yet a five-second LLM answer is currently treated as “perfectly fine.” He expects that forgiveness to fade as AI becomes standard and users demand faster, more real-time responses.
A chatbot’s promises can legally bind the company behind it. Krenn cites a Canadian court case where an airline lost after its chatbot told a customer something it shouldn’t have; because it was the company’s agent, the result had to stick. Many companies then pulled LLMs back from customer-facing roles, keeping them internal until guardrails could be trusted.

Frequently asked questions

What does Elastic actually do in the AI stack, according to Philipp Krenn?: Krenn positions Elastic as the data layer. Its classic use is retrieval-augmented generation: you store your data, do the retrieval first to get the right context, then generate the output with an LLM. Elastic can also cache results so a similar question reuses an earlier answer for a faster, cheaper response. On the evaluation side, because Elastic does OpenTelemetry and the ELK stack handles logging, it collects performance, cost, and quality telemetry. Krenn is explicit that Elastic stores the results but leaves the evaluation itself to Galileo.
How do Galileo and Elastic integrate technically?: The integration runs on OpenTelemetry. Krenn explains that Elastic acts as the data store that keeps the telemetry, then uses OTLP — the wire protocol for OpenTelemetry — to move that observability data so it can be aggregated and evaluated. He draws a parallel: MCP is becoming the protocol for AI, and OpenTelemetry is the protocol for telemetry data; standardizing on shared protocols is what lets Elastic, Galileo, and others partner more easily. Because both sides speak the OpenTelemetry standard, the same data feeding Elastic can feed Galileo for evaluation.
What small models does Elastic build, and why not just use a general-purpose LLM?: Krenn describes two areas. For re-ranking, Elastic built a model that runs over a retrieved subset — for example, re-ranking the top thousand documents out of a million. For embeddings and inference, Elastic heavily uses E5, a multilingual model, in the dense vector space, plus a custom model for the sparse vector space that does keyword expansion to find related keywords. He says a general-purpose large language model would be too expensive and too slow for these tasks, so the right specialized models are needed instead.
What is Philipp Krenn’s advice for developers trying to take AI to production?: Krenn says it’s very easy to get started and build, but getting to production is the hard part. Guardrails are a very important piece, and you need the evaluation side to confirm the system is actually doing what you want, within expectations for cost and performance. Drawing on his observability background, he warns that observability is often treated as an afterthought and expects the same mistake with AI — people will eventually realize it shouldn’t be. He adds that AI may even raise the stakes: hallucinations and wrong answers create a stronger business drive to get the right answers.
How does Philipp Krenn see AI changing customer-facing work?: Krenn frames it through analogy and caution. He compares AI’s trajectory to self-driving cars: hyped years ago, slow to arrive, but now he rides in a Waymo frequently — suggesting AI may take longer to reach the “this is what happens in reality” phase. He hopes reliable agents mean never having to call a hotline or wait on a chat system again, freeing people doing repetitive call-center work for harder background problems. On job loss he is measured: automation fears have recurred since the industrial revolution without playing out, so he expects work to shift rather than disappear.

Concepts in this episode

AI terms discussed here — each links to a plain-language definition.

AI Agent Latency Retrieval-Augmented Generation (RAG)Model Context Protocol (MCP)Inference Embeddings Prompt Injection Accuracy AI Alignment AI Evaluation

Chapters

00:00Introduction
01:09Galileo's AI Reliability Platform
01:43Challenges in AI Agent Reliability
06:17Insights Engine and Its Importance
11:00Luna 2: Small Language Models
14:42Custom Metrics and Agent Leaderboard
19:16Galileo's Integrations and Partnerships
21:04Philipp Krenn from Elastic
24:47Optimizing LLM Responses
25:41Galileo and Elastic: A Powerful Partnership
28:20Challenges in AI Production and Trust
30:02Guardrails and Reliability in AI Systems
32:17The Future of AI in Customer Interaction

Show notes

The age of ubiquitous AI agents is here, bringing immense potential - and unprecedented risk.

Hosts Conor Bronsdon and Vikram Chatterji open the episode by discussing the urgent need for building trust and reliability into next-generation AI agents. Vikram unveils Galileo's free AI reliability platform for agents, featuring Luna 2 SLMs for real-time guardrails and its Insights Engine for automatic failure mode analysis. This platform enables cost-effective, low-latency production evaluations, significantly transforming debugging. Achieving trustworthy AI agents demands rigorous testing, continuous feedback, and robust guardrailing—complex challenges requiring powerful solutions from partners like Elastic.

Conor welcomes Philipp Krenn, Director of Developer Relations at Elastic, to discuss their collaboration in ensuring AI agent reliability, including how Elastic leverages Galileo's platform for evaluation. Philipp details Elastic's evolution from a search powerhouse to a key AI enabler, transforming data access with Retrieval-Augmented Generation (RAG) and new interaction modes. He discusses Elastic's investment in SLMs for efficient re-ranking and embeddings, emphasizing robust evaluation and observability for production. This collaborative effort aims to equip developers to build reliable, high-performing AI systems for every enterprise.

Connect with Chain of Thought host Conor Bronsdon:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

Follow Today's Guest(s)

Connect with Philipp on LinkedIn

Learn more about Elastic

Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Agent Leaderboard

Transcript

115 segments

Conor Bronsdon 0:04 Welcome back to Chain of Thought, everyone. I am your host, Conor Bronson. And today, we're tackling one of the most urgent challenges in our industry, how we build trust and reliability into the next generation of AI agents. And this is a topic you've probably heard myself talk about. You've probably heard Galileo CEO, Vikram Chatterjee, who's joining us in a second, talk about.

Conor Bronsdon 0:24 And it's definitely something we're gonna cover in our conversation and interview that will air in the back half of this episode with Philip Crenn, director of developer relations at Elastic, as this topic is very close to all of our hearts. But first, to set the stage, Vikram, it's great to see you. Welcome back to Chain of Thaw. Thank you so much, Conor. It's good to be back here again.

Conor Bronsdon 0:45 And I mean, our listeners here aren't able to access this part, but I have to say I've been really enjoying your internal chain of thoughts that you've been putting out for Galileo team members internally. Maybe we'll leak a couple minutes of some of those at some point, but it's been a lot of fun hearing your insights and your perspective. And I think that's what I value so much about these conversations is obviously the team at Galleo has been hard at work. We've just launched our free AI reliability platform for improving agents. Go check it out at galileo.ai.

Conor Bronsdon 1:16 There's so much more information there. We think there's incredible tools that every developer and AI builder is gonna love. But we wanna highlight a few of the biggest announcements and talk about why they matter because we're using these terms reliability and trust to talk about why our suite of Luna two small language models matter for real time guardrails and scalable evaluations.

Conor Bronsdon 1:36 We have our new insights engine, and this all means what exactly for engineering teams everywhere? So at the core of that conversation is agent reliability. Vikram, why is Galileo, and why are you so laser focused on agent reliability?

Speaker 1:53 Overall, we're focused on the concept of reliability, right, because that's the number one problem that's plaguing most enterprise AI teams today. They it's easy to get from zero to POC, going from POC to production. It has always been an age old problem in data science, and it's no different now, especially with developers and data science teams kind of almost merging together with this and becoming this new

Speaker 2:19 concept of an AI engineer, the need for understanding how reliability can actually occur, and they can sleep at night knowing that things are going to be safe for this non victimistic system is extremely important. The way we look at this is when it comes to agent reliability, it's two things. One is the reliability problem just becomes 100x more crucial because agents are more complex as a system. Like when we talk to developers in these organizations that are that have gone through, you know, prompt engineering two years ago, and that being the end all and be all of

Speaker 2:52 application development to kind of Rag based architectures, and then more complex Rag based architectures with multi tone chat systems, to now, you know, using these models as essentially smart routers to plan tasks and call different kinds of models and perform actions, has done something really interesting in the industry. That is, you move from chat completion to task completion,

Speaker 3:16 and essentially because of that, the ideas that organizations have around what the use cases can be have just exploded. They've gone out of the realm of just a chat interface now, where to do literally completely any task. So there are billions of dollars of OpEx at at stake here to reduce within organizations. Right? And then, of course, there's all the other side of revenue increase. So given the criticality

Speaker 3:40 of these agentic systems when it comes to the future of running businesses, as well as the criticality of the importance of the of the reliability problem overall, in the emerge whether these things together, agent reliability, we think is going to be an extremely crucial part of the future of AI, and that's why we've been focused on it for the last year and a half, frankly, from some earliest developers who were working on agentic systems, when they weren't even going this, they were just shocked

Speaker 4:08 by I don't think I can name them here, but they were just shocked by the idea that, hey, we just use this model to call the right kind of model. We used this model router. They weren't there was no concept of an MCP based tool protocol for calling tools in the right way. There was no concept of tools that we were all galvanized on. But as that started to mature more and more, we've gotten to this point where we feel like

Speaker 4:27 2025 is the year where the protocols and everything else is gonna come in place. But by the end of the year and early next, it's gonna be cats out of the bag, and everyone's gonna be building out agents everywhere. So it's extremely, extremely critical problem to solve. Completely agreed. And I think we see this need for reliability in AI everywhere,

Conor Bronsdon 4:44 both from a standpoint of building trust with customers and users, and also major examples. I mean, we just saw last week, Grok four ran wild on X- That's right. After new alignment changes were made, and they had to shut the whole thing down and kind of reassess. And I mean, that's exactly the kind of problem that we're looking to solve, and that paradigm needs to change for AI builders. That's right. And if you talk to anybody who's been building out

Speaker 5:09 applications where there is some kind of massive amounts of tool calling and different kinds of paths that the application can take, which tends to happen as soon as you're creating any kind of a, like a multi turn based task overall. And what happens there when you talk to the developers is they'll tell you very quickly that debugging and understanding where things went wrong and understanding what the failure modes have are extremely hard.

Speaker 5:35 We've all faced this firsthand if you're building out applications, but when you're doing this in production, and the stakes are pretty high, you've got make mistakes. So yes, what happened with Rock is, and really an interesting tidbit, and then, you know, when you look at this at scale across these organizations where they have to adopt AI and they have to adopt agentic systems and they're going all in,

Speaker 5:55 they can't do that unless they have, reliability protocols in place.

Conor Bronsdon 5:59 Completely agreed. And Galileo's previous work developing a data and measurement layer for enterprise grade evaluations at companies like Comcast and and Reddit and and many others has really enabled a lot of our new announcements here, such as the Insights Engine, which we were super excited to unveil to the public yesterday. You tell us a bit about that Insights layer? Yeah, so, I mean, at the core of what we do at Galileo is we've always been focused on the measurement problem of AI. We've not built the world's hundredth first

Speaker 6:28 orchestration system, and it's, you know, all of that stuff. But the focus for us has always been on building out the measurement, solving the measurement problem. At the core of that lies the developer experience, where they don't want to get to necessarily a UI where they're plowing through hundreds of metrics to make sense of it all, right? So when you talk to developers, they wanna shift left as much as possible. They wanna move towards not having to go through each and every single

Speaker 6:57 node to understand what's going wrong, but instead they wanna figure out what the failure modes are early on so that they can try to fix it. Further shifting left from there is taking all of these different insights, and then actually putting them inside of the code itself, and inside of the browser, and feel like we're all gonna start moving in that direction,

Speaker 7:13 and that's what RCA for these kinds of agent application development is going to look like. So this is one big step in that direction where we figure out that, hey, if you want to eventually build out self learning or self healing agents, you need to have some kind of an engine that's constantly observing all the logs, constantly understanding all the different metrics, custom or otherwise, that have been built out, and really figure out from there using a reasoning engine and reasoning model

Speaker 7:41 what the potential failure modes are, and give that back to the developer. So that's essentially what we've done. We've taken all the logs and all the data that we sit on, whether you build out this reasoning engine based insights engine, which helps with automatically surfacing these kinds of failure modes. And what we've seen in this, in the early beta of this application, is it really quickly becomes

Speaker 8:05 very hard to live without, because now you don't have to kind of go into the weeds of every single node instead. You're just getting a bunch of insights out of the box, and the game of whack a mole becomes much easier, because now you know exactly what's going wrong. You can start to fix it and see whether that actually worked or not. So the main key components of the Insight Engine is automatically providing failure mode analysis and giving that back to the developer, number one. Number two,

Speaker 8:30 letting them know how often that occurred and whether that's a one off or it's a frequently occurring problem. Number three is not just telling them here's a problem, but actually telling them what's the solution, and then moving towards what they need to specifically fix as a next step. So it's extremely action oriented.

Conor Bronsdon 8:46 Completely think that is the right direction. As we've thought about this, it's really clear that, yes, like we have added new agent metrics and we can talk about that. Those are, I think, really exciting. They're helpful. Yes, we've added new observability views where you can interface with what's happening with your agent and your AI systems in different ways. Those are important too.

Conor Bronsdon 9:06 But it's this automation to the insights layer, the insights engine that we've built, that now can take those inputs and just automatically identify failure points, help you with that root cause analysis, let you debug faster, and just give you improvement opportunities that you can then apply, and then watch for these failures in production with guardrails through our Luna two SLM models. I think this is where we see this holistic reliability story coming together.

Conor Bronsdon 9:32 And to me, it seems super clear that the insights engine is what's gonna power our next level of like, okay, now we have evaluation agents that are working for you full time,

Speaker 9:43 and we're moving away from the static evaluation layer. Yep, that's exactly right. It's a natural evolution, I think, in the world of evaluations, and this is something which has been very core to us because we've always thought of our DNA being very much about the applied data science side of things, the infrastructure side of things, and this fits exactly in that bucket, right? It's not about just the software layer and having a UI where you can see graphs and charts. That's really not what the developers are looking for. They want the algorithms. They want the infrastructure aspect of this abstracted away. Like in order to provide these insights with millisecond agencies inside of our UI or through our API,

Speaker 10:22 took, takes a lot of work from an infrastructure perspective. So that's exactly what we've we've been working on, and we're super excited to launch this, and this is the first step towards this direction of making sure that all these agents can be self healing and self improving over time. 100%. We want developers to be able to spend more time improving the product and less time digging through logs.

Speaker 10:42 Yep, exactly. And they don't wanna do that. They don't wanna dig through the logs all the time. So this engine is great. It provides real time analytics, helps you identify root causes,

Conor Bronsdon 10:53 evaluations at scale, but it requires some serious horsepower to do this as you try to bring this to enterprise systems. We've built Luna two, a family of small language models to help power that, to help power real time guardrails. Yep. What makes these small language models, these SLMs, the right choice for this job versus using an LLM like GPT-four for every evaluation?

Speaker 11:16 So just to tie the two things together, Insights and Luna, they're at different parts of the SDLC life cycle. Cycle, right? So there's the building side of things and the observability side of things, which is where the main pain point that developers have is what's going wrong? What should I do next? And I tried to fix it. Did it fix the problem? That's kind of where they sit, and that's where the insights engine is trying to set them up for success by making that entire process 10x faster.

Speaker 11:46 Now, once you actually do identify some kind of a failure mode, let's say the insights engine says that here are three different tools that are kind of doing the same thing, especially this one tool over here seems to be messing up in this specific way. That's an indicator to the developer that they might want to just quickly create a metric, a bespoke metric to detect that, to see if that's happening over and over again. Let's say they've done that. Now what happens when you actually go into production and at scale, you want to be able to see if that's happening at scale, and if that exact tool malfunctions in that exact way, which the insights engine has already told you that it did.

Speaker 12:19 You want to take an action on that, and that's kind of where now Galileo is in the mix of the user experience, because before the task gets completed, if the tool malfunctions, we want to be able to block that action completely, or take a different kind of action. Now, in order to do this, you had to provide this kind of, the metric computation and inference has to happen with millisecond latencies, and with a fraction, a massive fraction of the cost of a large language model.

Speaker 12:46 And then we surveyed the ecosystem about a year and a half ago, and we embarked on our LUNA journey. We realized that what people were doing in production was just using an open AI model or a very large model to do this, and the question simply was, this is not sustainable. It's going to cost you millions of dollars at scale, as long as your product is at that scale,

Speaker 13:07 and it's going to be super, super slow. And that's kind of where our research came in around using some of the smaller language models, but also there, Garner, we realized that just using, let's say, a LAMA model or something like that, we experimented with a bunch of them, wasn't gonna be good enough, right? You have to, because they're built for reasoning. It's kind of like using a really large tank, whereas you could just use a small pistol, and so instead of

Speaker 13:33 of just using these models out of the box, we had to dramatically work on some of these open source models to make them a bespoke single token generator evaluation focused version of itself, which can be easily fine tuned with data which is built for evaluations, such that they can be extremely adaptive to any kind of use case. And that's the Luna two set of models that we have now come out in the market with, which is much more generalizable

Conor Bronsdon 14:03 than LunaOne, which is based on the BERT models at its very, very core. And I think an important point of this, and what you're alluding to when you talk about stripping this model down and focusing it, is how much cheaper it becomes. This is something we've heard from major customers who are looking to scale to millions of traces a day. That gets really expensive if you're doing a GPT model for all your evals. Whereas Luna can not only enable the custom metrics you need, the lower latencies that you need, but also do it at much reduced cost. So you can actually have evaluations and guardrails in production.

Conor Bronsdon 14:35 And it's also enabling us to create new sophisticated agent metrics like flow adherence, action advancement, action completion. That's right. Why do you think both out of the box and custom metrics are so critical for ensuring agent reliability

Speaker 14:51 as we go into these larger multi agent systems? The way we think of the out of the box metrics is, imagine you're driving a car, and the car stops functioning at some point. There are certain attributes of the car that are just common across most cars. They'll always have an engine, they'll always have like windshields, etcetera, etcetera. So at the very minimum, at any given point in time, you need to know the health of those specific pieces, and if it starts going south, then you want to be able to know what's gone wrong.

Speaker 15:20 Why we, we would rather not build this. So the reason we did this is because out of the box, you need to know what's going wrong as a developer. You need to know what's going wrong in an unsupervised way because you have no ground truth at that point, right? So that zero to one problem is really, really key, and that's why we built these out. The second reason we did this is because going back to the car analogy,

Speaker 15:40 not everyone's a mechanic. Not everyone can go in and actually build the world's best engine quality metric for the guards, and then start detecting it constantly, build the right sensors and all of that stuff. So we realized that when it comes to RAG systems, when it comes to even agentic systems, there are certain things that you would want as a developer, can be really, really cool to just have out of the box in an unsupervised way. And the reason why we also did this and we own this versus just, you know, boning it off to a third party metric system,

Speaker 16:10 is because we realized that out of the box, the quality and accuracy of these metrics have to be really high. And so we have, as our listeners probably know at this point, we have a very large applied data science team, and we always have since day one because and they've been focused on the problem of measurement. And so their whole role is focused on the idea of how do you make these out of the box metrics even more

Speaker 16:31 high performing? Should we use different kinds of models? Should we use different kinds of algorithms? You've published research papers on this, and that's the reason we keep pushing on out of the box metrics, but then at the same time, you don't just have the engine problems, right, in the cars. You also might have some use case specific problems. You're driving in the night, driving in the wintertime,

Speaker 16:49 you might want to have some extra precautionary measures, which we have no idea about. It's very bespoke to our use case, and so that's where we've come up with this idea, this notion of making it pretty, very easy for any of our developer users, our product manager users, our subject matter expert users to do their own custom metrics for agents as well. And I think this idea of, look, everyone still has a car. We can apply certain these metrics across domains

Conor Bronsdon 17:13 is something that's really shown well and demonstrated effectively in our agent leaderboard. And I'll say, anyone who's listening to this episode, day of release, there's a little sneak peek here. Our new agent leaderboard v two is actually live at galeo.ai/agentleaderboard, and it includes average action completion as well as tool selection quality metrics across multiple domains. So you can actually slice and dice based off of banking, health care, insurance, investment, telecom.

Conor Bronsdon 17:42 We're gonna add more domains over time. We're gonna keep on adding more of our out of the box metrics. But I'll give a sneak peek and say that GPT model did top that chart for action completion, and Quinn actually made the top four. So super interesting data there. We're going to be releasing a lot more about that in the following days and weeks. Go check that out and learn more about how we approach our agent metrics and how different

Conor Bronsdon 18:04 LLMs actually effectively evaluate with those metrics and how they perform in real world agentic scenarios. And we're going to keep on pushing on this because as you point out, Vikram, yes, like there's stuff we can do out of the box. There are things we can evaluate to make sure we actually understand the effectiveness of these different models, these different systems, especially cross domains, which I think is really crucial for companies are listening that are maybe in health care and they say, hey, I need something that actually works really well in this domain specifically.

Conor Bronsdon 18:32 But if we just take a cookie cutter approach, we're never truly going to solve the problems of enterprises that have unique customer sets, unique data sets. And that's where I think the customization element and the scaling element that's enabled by Luna two really

Speaker 18:49 comes in for us, why it's such a crucial piece of how we're approaching agent and AI reliability more broadly. Exactly. I'm excited about the results from the agent and leaderboard, and it's also frankly very extremely crucial and important for a lot of our enterprise customers, to have a Switzerland that's kind of looking at this and has high quality data to look at it. And it's all the data and everything else and the measurement, everything else is out there

Conor Bronsdon 19:13 on GitHub for people to see, so nothing's behind closed doors for this. Yeah, and, and anyone who wants to start leveraging Galileo models or our platform for observability evaluations, I'll say another key piece of the story is our integrations approach, where we want to be agnostic of whatever type of agents you may be building, whether it's CrewAI, whether you're doing something with

Conor Bronsdon 19:34 LangChain, whether you're working with Lama Index, or, whether you're bringing in REG components from Pinecone or Weaviate or MongoDB or Elastic. We can help. We work with all these companies. We're partners with folks like NVIDIA and many of the people I just named. And our goal is to be truly agnostic of whatever you want to bring to the table and help you customize your system in whatever ways you need.

Speaker 19:57 Viktor, are there any other closing thoughts you have about our agent reliability announcements that you want to share with the audience? I'll just say we're just getting started on this. And a lot of this is very exciting. It's a culmination of a lot of work that's been going on for the last many, many months, but then there's a lot of other very exciting stuff that's in the

Speaker 20:14 hopper right now that we will be coming out to market with. So very excited for developers to just try Galileo out for free. If you're building out agents, no matter who you are, where you are, you go to galileo.ai and just sign up for free, and you could start using our platform. If you wanna help with anything, please reach out at any point in time. We'd love to hear from you. But there's a lot more coming out, which has never been seen before, so I'm very excited about that. We're we make some bold moves all the time, and so we're excited about what, what's coming up soon. And hopefully, developers love it and would love feedback. 100 agree. Definitely give us that feedback. We'd love to hear from you on social media, whether it's about this episode, about the features, or anything else. Vikram, thank you so much for setting the stage and sharing the vision behind the agent reliability platform. Of course. Thank you. Thanks, Connor. And when we come back, you'll hear my conversation with Philip Crenn, head of developer advocacy at Elastic and a key partner of ours in that platform. Stay with us.

Conor Bronsdon 21:06 For decades, the search box was our primary window to information. Now, AI agents are becoming our proactive partners in discovery and action. But for an agent to be trusted and accurate as a partner, it must be reliable. The link between the information on agent retrieval and the action it takes is where the promise of AI meets that of production risk. And that's why I'm excited to have Philip Crenn, Director of Developer Relations at Elastic here with me today. Philip, thank you so much for joining me. Thanks for having me, Connor. Yeah, I'm excited to have a conversation about how Elastic is enabling

Conor Bronsdon 21:42 incredible innovation in the AI space, how we're working together with Galileo, and so much more about building, observing, and absolutely nailing these agentic and other AI systems. Let's start with, you know, who are Elastic? Like, what are y'all doing in the AI space? People may know you as a company that's been public since 2018. What are you up to now?

Speaker 22:04 So, we've been doing search for a long time. So, if you search anywhere on the Internet, there's a good chance that you use Elasticsearch in the background without doing even knowing. So, my classic examples are if you search in Wikipedia or Stack Overflow behind the search box, we're doing the search for you. If you do anything on GitHub, almost everything on GitHub is cached or powered by Elasticsearch in the background, so they use us very heavily. I didn't know that. And many, many other places.

Speaker 22:28 So now, I feel like all the excitement has shifted over to AI. Yeah. Where data is still important, you need to bring your private data or you need to bring bring more up to date data. So we're very happy to also participate in in that and bring you your data like we did before. I love it. What's next for Elastic as you continue to grow in this AI space? Right. So I feel like we have, for a long time we've been being kind of like the data in the background. But with AI, it's like the the interaction mode almost changes. Like, one of the hype topics right now is MCP.

Speaker 23:03 So historically, we've always had REST APIs. But now you don't want to work against one specific REST API of your data store or whatever other systems you have. But you will let your LLMs just figure it out and say, this is the MCP connection that you need. Get the right data, see what actions you can run. So just switching that interaction mode for us is a big thing right now, having a proper MCP server,

Speaker 23:29 And it's just from fetching data as the first step, but also interacting with it. So you could just say like, build me a Kibana dashboard that looks in a certain way. Things like that is more descriptive. You don't need to know the tool like maybe inside out to write the queries yourself, but you can just more talk to it or also automate away little stupid problems like generating test data. Now you could just tell that the LLM is like, oh, look at this mapping.

Speaker 23:57 Generate me a 100 documents with this type of data in there. I think there are a lot of little things that we can do today to make us actually much better overall, even though each one of those might not look like a huge improvement. It's all of these combined that really move us forward, plus everything that we've been doing with Rag lately that search is just a very different game than it used to be before. So let's focus on that. If I am a builder who is creating something with AI today, whether that's an agent, someone with some sort of compound system,

Conor Bronsdon 24:27 or a workflow, anything else. How would I leverage Elastic?

Speaker 24:31 So we're generally best used to keep your data, your LLM like classic rag application, where you do the retrieval first, you get the right context, and then you generate the output with your LLM. We could also be used, for example, to cache results, like if somebody has a very similar question, you might be able to skip like regenerating with the LLM, but you just to get faster and also cheaper answer, you just reuse a similar question from before.

Speaker 24:58 And then, of course, there is the evaluation side since we are also doing OpenTelemetry and the classic ELX stack has been doing logging, and it's doing a lot more than classic logging nowadays with OpenTelemetry. We can collect all of the data. So we we go from keeping your data and driving your answers

Conor Bronsdon 25:16 to then potentially seeing how they are performing in terms of like, other quality, but also performance, cost, all of those. And developers, if you are using the OpenTelemetry standard to work with Elastic, guess what? You can also use it to work with Galileo, and bring a lot of that observability data in, and begin to evaluate it, which is why I'm super excited that we are partnered and integrated with Elastic.

Conor Bronsdon 25:38 It's been so fun getting to know your team and and working with you over the last couple months. How do Galileo and Elastic work together?

Speaker 25:45 Yeah. It's it's great to see that you're bringing that aggregation side, and we're happy to be the the data store to kind of like keep all of that. But pulling out the information the right way and then using OTLP, like the the wire protocol for OpenTelemetry, to integrate all of that, that's that's great. And that's kind of like feel like maybe this is a bad comparison,

Speaker 26:07 MCP is kind of like the protocol for AI and OpenTelemetry is the protocol for all telemetry data more or less. So in a similar way, we standardize on these protocols, and then it just allows us and everybody else to partner much better together.

Conor Bronsdon 26:21 And, how do metrics for agents with Galileo work with agents that are leveraging Elastic as their, you know, reg store and search option?

Speaker 26:33 Galileo helps developers be the guardrails for your data, for your LLM generated data. So, we're happy to store the results of that and then kind of like be part of the evaluation of like how things have been going. But, we're not doing, like, the evaluation itself. We'll we're happy to leave that up to Galileo to to figure that out. What are you excited about in this new world of AI agents and

Speaker 26:57 the onrush of all these AI applications? Yeah. I I always need to preface this. I'm European, so we are maybe slightly less excited.

Conor Bronsdon 27:05 But I'd

Speaker 27:07 so I I think You're laughing. You're excited. Yeah. We are excited. No. No. We are excited. AI is exciting. I think it's the classic problem of like we're often like over focused and like today and like it's the classic story. My example is always self driving cars. Like they were a big thing or everybody had high hopes like a couple of years ago and it didn't quite happen back then. Yeah. But nowadays, they're on the street. Like, I'm driving with a Waymo pretty frequently. It's just doing its thing. I think AI is a similar idea that

Speaker 27:36 we have a lot of work to do and it's still early days, but there will be a lot of that coming out even though it might take a bit longer until we get to that like production. Like this is what happens in reality phase, and we're trying to do a lot of things. Maybe it takes a little longer to actually get to the end goal, but we're on that journey. So it's definitely exciting to be part of that.

Speaker 27:56 But I'm also sometimes cautioning people that it might just take a little longer to get to the that final result or even if you don't see today or it's more like, this is an interesting thing. It might not happen in reality yet, but there will be some major changes coming. Yeah. What about Galileo and Elastic together? How can we continue to build out what we're already doing? What what can we grow into, do you think? Yeah. So I think along the journey of, like, LLMs becoming a more integral part of the workflow and everything, I think that evaluation side is like you don't have too many hallucinations or you can figure out, like, how good the quality is.

Speaker 28:34 It will also have other aspects around like performance. I feel like initially we were very forgiving to LLMs like, it would take like five seconds to generate an answer. Maybe that will change that people will require faster answers. Cost is always a question, though I think the cost of LLMs actually is decreasing quite rapidly. So that the initial fear that it would be hugely expensive has

Speaker 28:58 come a bit less of a fear, but cost management is still a thing. Totally. For performance, like my favorite story there is always that I remember like a couple of years ago when people were using Elasticsearch and running a query, they would say like, oh, it takes two hundred milliseconds to get the response back. That's unacceptably slow. And now it's oftentimes like, oh, LLM takes five seconds to give me an answer. That's perfectly fine. So I think we're still like this early excitement

Speaker 29:23 where we're a bit more forgiving, but that will also change. Like having fast answers and faster response times, having it more real time, I think that will be something that will become more and more important, and then the evaluations are great for that. Yeah.

Conor Bronsdon 29:38 I think for me, we're seeing a lot of the same stuff here, where the challenge now is more about how do we create enough trust for enterprises to go to production with their AI. Right. Because like, at first you have the novelty and like everything is new and exciting,

Speaker 29:53 but as it becomes the standard in the workplace or also like in your private life, the standards will rise or like you need to rely on this, otherwise people will not accept it. Yeah. And that's why we're talking about this phrase like reliability

Conor Bronsdon 30:06 or, you know, predictability, because obviously the non determinism is the magic here, but we need to have an understanding of the bounds of when certain things will happen, particularly with agents and multi agentic systems. There's so much opportunity to solve knowledge work problems, to solve physical problems with when linked with robots. But if it goes completely off task, there are risks to that. And it's hard for

Conor Bronsdon 30:34 major customers like financial service providers or other very consequential use cases to go to production with AI when we're not able to provide them guardrails, when we're not able to help them have a more reliable system. That's a big part of why I know we're focused on reliability and why Elastic is focused on data reliability as well to back end all of this.

Speaker 30:52 And I feel like we have almost gone through different cycles with that already. I feel like initially when LLMs came out, like many companies put them on their website as like, oh, you have this chat interaction. And then everybody could talk the LLM into saying something stupid. Like, it's like, oh, yeah. You can cancel your flight for free. I think there was even a court case in Canada or so where somebody You lie. Yeah.

Speaker 31:13 Where the airline lost that case and, like like, because it was their agent, then, well, the result kind of, like, had to to stick. And and I feel like then in the next phase, a lot of companies kind of, like, took the LLM sometimes back and not not let them have these interactions because they could just be talked into doing the wrong stuff too easily. So I think today, a lot of the interactions are often more inside companies because it's more like you have you don't have malicious actors, but you have your employees who hopefully try to just do their job, but not talk the LLM into doing something weird. But I I think if we have the right guardrails in place, the LLMs will take a more front and center customer facing seat again, and if they can be trusted.

Conor Bronsdon 31:52 So let's, I mean, let's assume that Galileo and Elastic together will be able to help these LLMs be trusted and guardrail and set up right. Oh, you know, we'll get back to that. Let's assume we do. What's that picture of the future look like? What does it look like when we have reliable production grade agents and systems that are in place? How do you think our, our day to day will change?

Speaker 32:17 I hope I will never have to call any hotlines again. Oh, please. Or I have to wait on some chat system or like some agent to answer my questions. So I think there is just like from the interaction mode will become nicer and easier for And we we don't have to have abused people on like some phone hotline to to take one angry customer call after the other. So I think there is something for everybody to win.

Speaker 32:47 And then I hope to free these people up to actually, I don't know, solve the problem in the background rather than just taking the call and then forwarding it. So, it's Some people are very pessimistic and are like, oh, people will get lose their jobs and things will change. I feel like we've always had that since the industrial revolution people have been oh automation is bad and everybody will lose their job. Doesn't seem to have happened yet. I don't think it will happen now.

Speaker 33:10 I just hope we can shift our work a bit. Like initially you had to do a lot of manual labor, then that shifted. I think now there are a lot of very repetitive tasks that can be automated away, but there is still a lot of other work to do in the background. So I'm not afraid for people to lose all their jobs, but just to move us to the next way of interaction and how things are going.

Conor Bronsdon 33:37 Speaking of background work, a lot of that is happening through things we've already talked a bit about, like guardrailing, evaluations, observability, so that you can actually ensure the reliability and trustworthiness of these systems that everyone's building. What would be your advice to developers who are building with agent frameworks, are experimenting with AI, who are trying to go to production about how to approach

Conor Bronsdon 33:59 all the work that goes into making that magic happen.

Speaker 34:03 Yeah. It's it's interesting because I feel like it's very easy to get started and build, but really getting it to production is a is a hard problem. Yeah. And I think guardrails are like a very important part, and you really need that evaluation side of saying like, okay, it is actually doing what I want it to do. It is also kind of like in within the expectation of cost and performance

Speaker 34:26 because nobody wants to put out a system that gives you a bad user experience. Totally. So I think going to production is like you need to have the right tools. Having been in the observability space for a while, it's like oftentimes it's an afterthought. It will be the same for AI, but people will figure out sooner or later that it shouldn't be an afterthought and you need to actively work on that.

Speaker 34:49 Maybe AI even puts you in a better position there because like with the hallucinations and giving wrong answers, like you have an even stronger business impact. Yeah. Or like business drive to give the right answers.

Conor Bronsdon 35:01 Are there particular areas about what's coming with AI or what's happening today that you think people aren't paying enough attention to?

Speaker 35:08 I think evaluations have been one of these areas that have been underserved. So I was last year there sorry, last week, there was AI. Engineer, the big conference around everything AI. And they had a track on evaluations, but it was one of the areas where people were excited and very interested, I would say, just because we've been building for a while. And the building initially was great, but now we're really is like, how are things improving or changing? The same for search that you can

Speaker 35:38 you have like with vector search and with the LLMs generating the output, you can do a lot of things, but is it really improving the answers overall? And like oftentimes people don't have like any system in place to figure that out. You just throw something out, you hope it's better, but you don't really know. And as we are progressing, I think that that feedback loop will need to become better because you have now so many options.

Speaker 36:01 But options also force you to make the right choice. And finding or making better decisions is an important part of that. I've been passionate for technology for a long time. Yeah. I think it's now it's an exciting phase. Totally. Being at Elastic, or working on Elasticsearch, search was oftentimes, like, not an afterthought, but it was not a fancy problem anymore because search was never solved. But it wasn't, like, so center as it is nowadays.

Speaker 36:31 Search is suddenly interesting again and also has budget Yeah. Which almost surprised us. Or no. It doesn't surprise us, but it's like we always thought it should have the budget that it suddenly now has, but people have like this expectation that you can get much more out of search again and the data that you have. So that is there's a good change. And I think just like as AI gets into other areas,

Speaker 36:54 has a rippling effect into many other things like observability just interacting with your data and figuring out what what you have in all the observability data that you have collected. It just gives you great new options. And I always hope that we can get out of some of the more boring work and figure out more of the exciting work. It's like I think there's always this saying in data science, you spend 80% of your time cleaning up data. Yep. I think LLMs have actually added some interesting tools to make that a lot better and faster,

Speaker 37:23 so we can focus more on the exciting stuff rather than the boring stuff. I hope that the same will happen to observability data in general or just to give you better tools to do things that were hard before or were very not necessarily even hard, like were boring.

Conor Bronsdon 37:39 Well, you brought this up earlier, which is this idea of change and how constant it is and the shift from so many folks working in farming and fields and hard physical labor to where we are today. And we're seeing this new paradigm shift occur, but there's always little paradigm changes. So I'm with you on A, like I think this broad systemic change is going to continue to happen and we have to continue to adapt and like,

Conor Bronsdon 38:04 that's just a constant. We can't expect the job will stay the same. I'm also optimistic that many of the drudgery tasks are going to be automated out. I mean, you brought up one earlier, which is like, I don't really write tests anymore. Like I might edit them, but like, I don't love writing tests. Also, let's be honest, I've said this on the show before, I'm not the best dev out there. There's a reason I talk about things for a living. I'm not always building things for a living and and working production environments all the time. So it's nice to not have to spend as much time on that. And I think there's plenty of other examples of this that folks are using. Oh, yeah. I I think there have been many boring tasks that were hard to get rid of in the past.

Speaker 38:42 But it's like, yeah, before we had the right machines, we had to do the menu labor. There was no way around it. Now we hopefully can do more of the interesting stuff. I think there this is like a a great example of the partnership that Elastic and Galileo have. What models does a company build? And at Elastic, we're not building large language models. The models that we build are about inference and then re ranking. And then we rely on our partners to provide either a large language model to generate the answer or like in Galileo's case with the new LUNA models on the evaluation side.

Speaker 39:15 So I think that is one of the interesting aspects of like these partnerships that you can build so many things. And as we're building more, the small language model is actually an exciting space because of cost and latency that you want to build more of these. That I think there is a big area that is a bit underserved right now or where we're exploring more, like how can we just go from like these huge general purpose models to something that is more

Speaker 39:43 specialized, easier and cheaper to run, and has a much lower latency to get you to the results that you want to have? Because, yeah, a while ago people were very forgiving for latency because it's like this novelty, but that might not be the case anymore, and of course we want fast answers. So building more specialized models and give you what you want is a great space to explore. I completely agree, and we're really excited about launching our Luna two models to enable real time guardrailing and evaluations for things like private information leakage,

Conor Bronsdon 40:15 toxicity, prompt injection attacks, and prompt injection attacks. This real time guardrailing is so crucial to help create reliable, trustworthy AI systems. And it's really exciting to see Elastic also investing in small language models and the clear opportunity for these to dovetail and work together in building reliable multi agent AI systems is very exciting.

Conor Bronsdon 40:40 Can you tell us more about Elastic's small language models and how developers can leverage them? Yeah. So we we have

Speaker 40:46 we have built something for Elastic re ranking. It's like you retrieve a larger set of data and then you have a model that is more expensive than the regular ranking that you have to re rank like a subset. So we can run something a bit more expensive on a smaller subset of all the data. So let's assume you have a million documents and then you retrieve in the first step the first thousand, and then you run the re ranker, which is more expensive and a bit slower per document on that top thousand document list. That was a use case that was not really possible before, but now with a fast and efficient small language model for re ranking,

Speaker 41:19 we give you that option. The same thing for creating the embeddings or the doing the inference. Right now, we heavily build on e five, which is a multilingual model. To do that in the dense vector space, we also have a custom model for the sparse vector space, which is basically keyword expansion to find related keywords. So there's a lot of excitement about doing these, and you don't need a general purpose large language model. They would be too expensive, too slow

Speaker 41:46 to do that anyway. So you need to find the right models. It's actually almost surprising. Feel like we haven't found that much in the more tuned or specific area for language models. So, example, if you have observability data, I remember people were already a year ago talking about like, we'll have something that evaluates that observability data and gives you the right output for that. I think to a large part, companies are still using general purpose large language models because they can do that and that's what's available.

Speaker 42:16 I guess we'll have to see when people build more specialized models or maybe it's just too expensive to build more specialized models and we keep relying on the large ones. But that's an exciting space. But yeah, for something like the evaluation side, your latency requirements, your cost requirement will be pretty

Conor Bronsdon 42:35 constrained, I guess. Yes. Yeah. That's why you need to build your own models to do that. Totally. And and we've worked with a lot of enterprise customers on this, and that's exactly what they come to us for is they say, hey, look, you know, this LM is a judge concept, we're using it, it doesn't scale to millions of traces necessarily. That gets really expensive. Latency becomes a real issue if we want to go into production.

Conor Bronsdon 42:53 So I love that Elastic's also investing in small language models to help with key effectiveness and tasks. Think it's very In our specific area,

Speaker 43:02 that's and that's where all the partnerships kinda are coming from. I feel like I always describe this as like almost like a lasagna, like you have these little layers in the the AI lasagna. Yeah. So So you have these layers and everybody kind of like partners with the other layers. So we're happy to do the the storing of the data and then having the retrieval there.

Speaker 43:20 But then you need the others for maybe for the evaluation side and give us a different frameworks. And there are so many layers in this AI lasagna right now. There are a lot of components. Yeah. So it's great to be able to build that lasagna

Conor Bronsdon 43:36 or cake. Amazing. Well, thank you so much for joining Really, us really enjoyed chatting with you. It's always good catching up with you, and I'm so excited to continue to build with Elastic. Everyone who wants to out there, you should check out our documentation, see some the examples we have built out, and give it a try with Galleo's new free AI reliability platform. There's so much you can build using Elastic for your search, for your reg, and it's a huge opportunity ahead, and we can't wait to build more together. Sounds great. Thanks, Philip.