Why do most enterprise AI agents fail to reach production?

On the show, Lyzr’s Siva Surendira says “95% of all AI work today done by enterprises remain as proof of concepts” and don’t move into production. When his team dug into why, the primary concerns were safe AI and responsible AI: enterprises worry about getting sued, or mistakenly approving a $100,000 refund when it should have been $10. He frames the obstacles as hallucination, toxicity, and prompt injection — which is why Lyzr built guardrails “natively to the core agent architecture” rather than as a bolted-on third-party API.

What are the three types of multi-agent orchestration Lyzr uses?

Siva describes three. First, managerial orchestration — a master agent that calls worker agents at will to complete a task; he credits Autogen as the first to do this. Second, DAG (directed acyclic graph), a step-by-step workflow he says LangGraph launched. Third, Lyzr’s hybrid flow: rather than being restricted to a DAG or managerial setup, you can also bring in deep-learning, ML, code, and SQL agents — connected to Azure ML Studio or Amazon SageMaker — which he says makes the system “far more deterministic” for high-complexity cases like a customer-refund or retirement-planning agent.

What are the biggest bottlenecks to enterprise agent adoption?

Siva names three. First, integration: for any enterprise above $100M (mid-market) or $1B (large) in revenue, custom in-house apps “completely outweigh” standard systems like Salesforce and SAP, and these decades-old apps lack documentation, so the actions an agent can take are undefined. Second, data readiness — enterprise data isn’t “LLM ready”; it’s scattered, the gap tools like Glean address, though many orgs aren’t even ready to feed data to Glean yet. Third, the skills to build, since many enterprise agent platforms are closed and no-code-only, limiting you to low-complexity use cases.

Where does Siva see the biggest opportunity for agents — replacing SaaS or replacing human work?

He’s skeptical the main prize is replacing SaaS. While he notes agents eating into the software layer — citing a large media org that chose Moveworks over ServiceNow’s own agents — he says “the biggest unlock is always going to be agents replacing the mundane work that humans do.” His example: a food-delivery company with 15,000+ employees that hired about 30 contract HR staff every June for two months of reviews; agents his team built in three weeks now handle that at scale for 15,000 employees. He expects agents to start with low-skill follow-up work and move toward high-skill work in 2026—2027.

How does Siva think about evals and choosing between local and frontier models?

He describes two kinds: agent-level accuracy, groundedness, and hallucination, plus test cases for how well the agent solves its business problem. He sees a shift to local open-source models, which let enterprises run thousands of experiments with data in-house at “like 100 times cheaper” than GPT or Claude — and fine-tuned local models become the enterprise’s own IP. On AWS Bedrock, he names NVIDIA NIM and Meta Llama as the #1 and #2 local models for banks and insurers. He recommends a two-layered setup: Lyzr’s built-in guardrails plus an eval platform like Galileo as the “antivirus” layer.

Episodes · S2 E21 ← Prev Next →

Why Enterprises Need a Different Approach to AI Agents | Lyzr’s Siva Surendira

Apr 30, 2025 · Siva Surendira , Lyzr · 39 min

AI Agents Multi-Agent Systems AI Evaluation & Reliability Enterprise AI

Listen on any app

Key takeaways

Most enterprise AI never ships. Siva Surendira says “95% of all AI work today done by enterprises remain as proof of concepts” and never reach production. The blockers are safe-AI and responsible-AI concerns — hallucination, toxicity, prompt injection — like an agent approving a $100,000 refund where it should have been $10.
Lyzr builds guardrails into the core, not as an add-on. Because the team wrote the agent framework themselves, they made safe/responsible-AI guardrails “natively to the core agent architecture” — agents that are “responsible by design,” not a third-party API bolted on, enforceable centrally “whether you are building 10 agents or 10,000,000 agents.”
Orchestration isn’t either/or — it’s hybrid. Siva names three modes: managerial (a master agent calling workers, which he credits Autogen with first), DAG popularized by LangGraph, and Lyzr’s hybrid flow. For high-stakes cases like a refund agent, hybrid flow mixes in deep-learning, ML, code, and SQL agents to make the workflow “far more deterministic.”
Integration and data — not models — are the real bottleneck. For any enterprise above $100M revenue, custom in-house apps “completely outweigh” systems like Salesforce and SAP. These decade-old apps have no documentation, so agent actions can’t be defined, and the data isn’t “LLM ready.” The third bottleneck: the skills to build on closed, no-code-only platforms.
Go after the human layer, not SaaS. Siva’s advice: target “the mundane work that teams are doing.” His example — a food-delivery company of 15,000+ employees that hired ~30 contract HR staff for two months of annual reviews — replaced that with agents his team built and deployed in three weeks, now running at scale for 15,000 employees.
Self-healing agents aren’t here yet — and Siva has the receipt. Testing an automated feedback-and-conflict-resolution loop across a multi-agent setup, “such loop cost us $300” in LLM calls and still didn’t fix the problem, so they stopped it. He pushes local open-source models instead: thousands of in-house experiments at “like 100 times cheaper” than GPT or Claude.

Frequently asked questions

Why do most enterprise AI agents fail to reach production?: On the show, Lyzr’s Siva Surendira says “95% of all AI work today done by enterprises remain as proof of concepts” and don’t move into production. When his team dug into why, the primary concerns were safe AI and responsible AI: enterprises worry about getting sued, or mistakenly approving a $100,000 refund when it should have been $10. He frames the obstacles as hallucination, toxicity, and prompt injection — which is why Lyzr built guardrails “natively to the core agent architecture” rather than as a bolted-on third-party API.
What are the three types of multi-agent orchestration Lyzr uses?: Siva describes three. First, managerial orchestration — a master agent that calls worker agents at will to complete a task; he credits Autogen as the first to do this. Second, DAG (directed acyclic graph), a step-by-step workflow he says LangGraph launched. Third, Lyzr’s hybrid flow: rather than being restricted to a DAG or managerial setup, you can also bring in deep-learning, ML, code, and SQL agents — connected to Azure ML Studio or Amazon SageMaker — which he says makes the system “far more deterministic” for high-complexity cases like a customer-refund or retirement-planning agent.
What are the biggest bottlenecks to enterprise agent adoption?: Siva names three. First, integration: for any enterprise above $100M (mid-market) or $1B (large) in revenue, custom in-house apps “completely outweigh” standard systems like Salesforce and SAP, and these decades-old apps lack documentation, so the actions an agent can take are undefined. Second, data readiness — enterprise data isn’t “LLM ready”; it’s scattered, the gap tools like Glean address, though many orgs aren’t even ready to feed data to Glean yet. Third, the skills to build, since many enterprise agent platforms are closed and no-code-only, limiting you to low-complexity use cases.
Where does Siva see the biggest opportunity for agents — replacing SaaS or replacing human work?: He’s skeptical the main prize is replacing SaaS. While he notes agents eating into the software layer — citing a large media org that chose Moveworks over ServiceNow’s own agents — he says “the biggest unlock is always going to be agents replacing the mundane work that humans do.” His example: a food-delivery company with 15,000+ employees that hired about 30 contract HR staff every June for two months of reviews; agents his team built in three weeks now handle that at scale for 15,000 employees. He expects agents to start with low-skill follow-up work and move toward high-skill work in 2026—2027.
How does Siva think about evals and choosing between local and frontier models?: He describes two kinds: agent-level accuracy, groundedness, and hallucination, plus test cases for how well the agent solves its business problem. He sees a shift to local open-source models, which let enterprises run thousands of experiments with data in-house at “like 100 times cheaper” than GPT or Claude — and fine-tuned local models become the enterprise’s own IP. On AWS Bedrock, he names NVIDIA NIM and Meta Llama as the #1 and #2 local models for banks and insurers. He recommends a two-layered setup: Lyzr’s built-in guardrails plus an eval platform like Galileo as the “antivirus” layer.

Concepts in this episode

AI terms discussed here — each links to a plain-language definition.

AI Agent AI Evaluation Vibe Coding Frontier Model Accuracy AI Hallucination Faithfulness AI Benchmark Explainability Model Context Protocol (MCP)

Chapters

00:22Introduction and Guest Welcome
00:52Enterprise Agent Framework
02:48Building Enterprise-Friendly AI Frameworks
04:56Enterprise Concerns with Vibe Coding
09:23Safe and Responsible AI Implementation
11:05Multi-Agent Orchestration
14:13Challenges in Multi-Agent Systems
14:22Enterprise Integration Bottlenecks
17:37The Role of Low-Code and No-Code Solutions
19:55Inter-Agent Communication Standards
21:49Future of AI Agents in Enterprises
29:37Evaluating AI Agents
36:34Conclusion and Final Thoughts

Show notes

Agentic AI exploded in 2025, but how do businesses move beyond prototypes to deploy reliable, valuable agents at scale?

Join host Conor Bronsdon and Lyzr AI CEO Siva Surendira as they discuss the complexities of building and managing AI agents for enterprises. Siva shares his journey creating Lyzr, focusing on making powerful agent frameworks accessible and trustworthy for enterprise developers. They discuss the critical hurdles businesses face, including productionization challenges, ensuring responsible AI, and bridging the gap between rapid innovation and the stringent requirements of regulated industries.

Listen as Siva explains Lyzr's approach to embedding safety guardrails natively and learn about the nuances of multi-agent orchestration, including managerial, DAG, and hybrid flows. Siva also offers insights into the limitations of "vibe coding" for enterprise use cases and stresses the crucial role of robust evaluation (evals) and choosing the right models—from local open-source options to frontier LLMs. Explore the bottlenecks hindering adoption, like custom application integration and data readiness, and learn why Siva believes the biggest opportunity for agent companies may not lie in replacing SaaS platforms but rather in automating the mundane work currently performed by humans.

Connect with Chain of Thought host Conor Bronsdon:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

Follow Today's Guest(s)

Website: lyzr.ai

LinkedIn: Siva Surendira

Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠

Agent Leaderboard

Transcript

97 segments

Siva Surendira 0:00 And I started seeing white coding as a kind of analogy would be a fast food thing where, okay, it's fantastic. You can eat something really quickly. You can get hold of your burger or sandwich in, like, matter of minutes, or even seconds. But then it has to be done in moderation. It has to be used carefully.

Conor Bronsdon 0:24 Welcome back to Chain of Thought, everyone. I am your host, Conor Bronson. And today, are joined by Siva Sorendra, CEO at Lyser I AI. Siva, welcome to the show. Thanks, Conor. Happy to be here. I'm excited to have you here in particular because 2025 has rapidly become the year of Identik AI, or at least the year so far. So many of our guests here on Chain of Thought are now either experimenting with agents

Conor Bronsdon 0:53 or already building with them. And you might be ahead of the game there, as Lysr is an enterprise agent framework that helps businesses build AI agents. It seems like the perfect moment to be leading a company that's an agent builder. So I gotta ask, did you foresee agents becoming the thing in AI here in 2025?

Siva Surendira 1:13 I always saw the power that an agent had, like what it could do, for example, a task repeatedly over and over again, which is exactly what lot of us humans do in our daily work life so that the agents could replace. I was able to see that. But I never thought it going to blow up the way it has happened now. We had concepts like RAG and fine tuning, which got popular. They had their own hype cycles.

Siva Surendira 1:41 But I think the way agent became mainstream, more like database and more like cloud, how they became mainstream, agents have become mainstream. I never thought it'll have this level of adoption, but but cannot complain because we were one of the early movers and we are sitting pretty in this whole agent train so called. Absolutely.

Conor Bronsdon 2:02 It's great timing, certainly, and very cool to see the success is having. What's the core problem that you see LyseR solving for developers and enterprises who are looking to implement AI agents?

Siva Surendira 2:16 Now, I've spent the last ten, twelve years of my life working with enterprises, working for enterprises, working for GSIs, also known as global system integrators. Like, if you look at companies like Accenture and IBM Services, they all come out of that bucket. And the developers there are not the ones who would normally spend their time on a framework, on a command line interface, and go through Python packages and modules.

Siva Surendira 2:46 So given my background, I thought, okay. This is very powerful. The the frameworks the open source frameworks are very powerful. But how do we make it enterprise friendly, rather more importantly, enterprise developer friendly? How do we get the developers in these enterprises and these GSIs build agents and manage agents at scale? So that was the line of thought

Siva Surendira 3:09 which led me to build a framework back in July 2023. And whole of 2024, we were just a framework company comparable with Autogens and Crew, AI and Landgraf. But then we realized that, again, we are getting into the same trap because developers are not building necessarily on frameworks. They need a UI. They need a Versa like platform, an AWS like platform, which led us to build the LyseR Agent Studio,

Siva Surendira 3:40 which went live, in December 2024. So, that's been the journey, the primary line was this. How do we bring the agent framework experience to an enterprise developer via a a beautiful UI for them to build and manage agents.

Conor Bronsdon 3:57 I think it's really interesting to see this different approach that you're taking to the agent market because, as you noted, a lot of frameworks have gone broad. And you've made this decision to say, Hey, look, we're going to focus on enterprises now. We're going to focus on enabling trustworthy AI as, as some folks would put it, or responsible AI, as I saw one of your modules calling it. I'm curious how that dovetails

Conor Bronsdon 4:25 with this incredible innovation that's happening around vibe coding, where people are able to enable builders around the world to create applications. Even bad devs like myself can build really cool things now. But it's become a lightning rod because, obviously, there are all kinds of positives and negatives to this. Yes. We're creating a lot more code. We're building a lot more applications. We can build agents way faster,

Conor Bronsdon 4:53 but they're also often sloppier. There may not be the best CICD processes. And a lot of enterprises in particular who have their highly consequential use cases where if their customers have an outage, there's major money lost or or even lives in the with health care providers and other folks. They have different standards. So how do you set up an agent framework, which I think most people see as something that's like, oh, quick moving. Let's let's go. Let's go. Let's go. To,

Conor Bronsdon 5:26 you know, face this enterprise challenge, particularly as these low code or almost no code just simply done by, you know, ChetubbyT or Claude or whoever's doing the coding for you, cursor, windsurf, or whatever. How do you reframe agents for this new world of both speed, risk, and opportunity?

Siva Surendira 5:52 Yeah. There's there are two parts to this. Right? Let me handle the white coding thing first, and then I'll go back to, what's the other problem that we started seeing in the market, especially with enterprises. So on the wipe coding part, it's fantastic. Like you mentioned, where like, I don't code I don't know React. I know Python. I know databases. I was a Teradata developer. I know all the back end systems. React never had a time to learn about it. Same. But, yes, thanks to platforms like Bold, Lovable, Cursor, Windsurf, you don't have to know you don't have to know React now. We can just build the whole thing via white coding. But

Siva Surendira 6:32 can I go back and edit? Can I go back and make it production grade? Can I go back and make it something that can scale up to, say, a million users? No. And and I started seeing white coding as a kind of analogy would be a fast food thing where, okay, it's fantastic. You can eat something really quickly. You can get hold of your burger or sandwich in, like, matter of minutes, or even seconds.

Siva Surendira 6:57 But then it has to be done in moderation. It has to be used carefully. Andrei Garpati had a very good take on this. He was the one who termed the term who termed wipe coding in the first place. And then I think he came back and made a pretty, interesting tweet. He said, it's not I mean, even for you to white code and build something, you need to know the architecture of what you're building. You need to have a plan in place. You need to figure out the modules are still in your control. You need to first plan the modules out so that you can use wipe coding in moderation.

Siva Surendira 7:34 Now that will be a strategy that'll take help you move farther rather than trying to build the whole thing via white coding. I don't know if you came across the one of a Twitter user who actually built and launched an app purely via white coding, and then he started I think he got started getting attacks from all over the place including as he told Yes. The Twitter universe that he had vibe coded it. Some lovely

Conor Bronsdon 8:00 folks out there were like, oh, let's see if you're exposing any endpoints. Let's see how tracked your how your data is set up here. Let's, let's see what's going on. So Good. I mean and enterprises are not going to take this risk at all. No. I Certainly not.

Siva Surendira 8:15 I was have been speaking with enterprise. We we are working with some of the biggest banks in the world, some of the biggest insurance organizations in the world, some of the biggest airlines in the world. But mostly, the focus for LIZR is, the financial institutions because they are the most regulated. They are the most worried in terms of data and AI. So, we started going to the deep end of the problem on day one instead of trying to do all the other stuff. And there,

Siva Surendira 8:43 we did speak to them about using Cursor and Bold and Lovable. They have very few subscriptions, and that too only for the product teams and some marketing teams to build some prototype UIs. They are not allowed to export the code anywhere. They can only take screenshots of whatever they build. They do allow people to go and be creative, but you cannot push that code to production. You're not allowed to take that code to production. So, is how enterprises are reacting to white coding. It's a fantastic, I

Siva Surendira 9:18 would say, tool for startups, for SMBs who can get a lot of things done, but enterprises have a very different take on white coding. And, so going back to the second aspect, it's the exact same mindset enterprises have towards open source agent frameworks. The other problem was the productionization of these agents. 95% of all AI work today done by enterprises remain as proof of concepts. They don't move into production.

Siva Surendira 9:50 And when we kind of started digging into it, aspects around safe AI and responsible AI were the primary concerns. Enterprises are worried. They don't want to get sued by their customer. They don't want to, by mistake, approve a $100,000 refund to their customer where it was just a $10 refund. So these are all the hallucination issue, the toxicity issue, the prompt injections. These are all the issues that bothers them, which prevents them from taking their agents to production.

Siva Surendira 10:20 And that's the other problem that we started tackling. How did we tackle it? Since we wrote the core framework ourselves, we were able to incorporate the SafeAirresponsibly guardrails natively to the core agent architecture. So, LISR agents are responsible by design. They are not an afterthought. They are not a third party API that you integrate. And that allowed us to go to enterprises and say that, Hey, now you have a viable alternate

Siva Surendira 10:48 where the agents are they have built in guardrails. And you don't have to worry whether you are building 10 agents or 10,000,000 agents. You can centrally, enforce the guardrails to the entire organization. So that is how we started tackling the problem number two, which is how do you get your agents to production?

Conor Bronsdon 11:09 And one of the key things that may impact that is gonna be multi agent orchestration. Obviously, as you alluded to at the beginning of this conversation, a lot of the first agents we saw were very simple. And now we're starting to move towards multi agent systems that can potentially solve much more complex tasks, but most of those aren't in production To actually deliver value, it's critical that enterprises have complex multi agent orchestrated workflows.

Conor Bronsdon 11:39 What are the key challenges in achieving effective multi agent orchestration? And what are some promising approaches you may be seeing to address them? Yeah.

Siva Surendira 11:49 A multi agent so there are three types of orchestration that we provide to our customers. Option one is the managerial orchestration, which is quite popular. I think Crew AI made it popular initially. Not Crew. I think Autogen was the first one to do that, the managerial orchestration, where you have an agent as a master agent and you have worker agents. The master agent can call the other agents at will based on what it is supposed to deliver, and then it completes the whole job. The master agent is responsible to complete the whole task.

Siva Surendira 12:20 So, that's the manager agent concept. Then you have the DAG, which is a concept. DAG has always been around but LangGraph was the one to launch it from an HEDG point of view, directed acyclic graph, which allows you to have a workflow and execute step by step. I think flow wise GUI is a beautiful GUI where you can see how agents can stack one after the other. And Flowwise was not calling it as agent initially. They called it as just a, it was just prompt chaining basically.

Siva Surendira 12:53 But that is where DAG comes into picture. Now, the more we started working with enterprises, we realized that it's not an eitheror. It is a combination of managed real orchestrator agent, but plugged in into a larger DAG orchestration of multi agent system, but still it was not solving the problem in very critical use cases. We realized that you cannot go production

Siva Surendira 13:17 for high complex use cases or scenarios like a customer refund agent, or a customer onboarding agent, or a say, retirement planning assistant agent, etcetera. You cannot go, especially when your agents had to transact or speak to a database system or an internal system of record, etcetera. We realized that we may have to bring in deep learning agents, machine learning agents, or code agents, or SQL agents into the mix.

Siva Surendira 13:45 That improved the deterministic nature of the entire workflow. So that's where Lysr came up with the hybrid flow approach where you're not restricted by a DAG or a managerial orchestration. You can now also bring in your deep learning, machine learning, and code agents and secret agents as well. So we built those connectors to platforms like Azure ML Studio or Amazon SageMaker,

Siva Surendira 14:10 so you can run your ML models, DL models there, and integrate. That allows you to have a far more deterministic system. What about limitations?

Conor Bronsdon 14:18 Where are there still challenges that you're seeing with the current approaches to multi agent orchestration?

Siva Surendira 14:26 Integrations is a very big challenge still. There are these fantastic API platforms like Merge and Composeo and Pipe Dreams of the world. They have few integrations available. But enterprise, if you go to enterprise, any enterprise above a $100,000,000, be it a mid market enterprise above $100,000,000 revenue or a large enterprise above $1,000,000,000 revenue,

Siva Surendira 14:49 the number of custom applications that they have completely outweigh the standard service nodes of the world, SAPs of the world, Oracles of the world. So now integrating with these custom applications is a bottleneck today because there is no documentation. These are all ten, twenty year old applications running in these organizations. So documentation there without any documentation, you'll have to sit and figure out what is happening. Because even when you can build an API integration pretty fast. I mean, you have OpenAPI standard. You have MCP. You have Cursor. Throw everything into Cursor. It'll do something. That's okay. But

Siva Surendira 15:27 it's about actions. What actions can the agents take via the API on that software? So, someone has to define those actions. For platforms like Salesforce and SAP, those are already defined easy. But when it comes to these custom ones, it's not defined. So that becomes bottleneck number one. Bottleneck number two is the availability of tribal data for these agents to work on because they have to be LLM ready in a way. They are not LLM ready. They are scattered all over the place. They are of various

Siva Surendira 16:02 types. So, now you're thinking about building a semantic model on top, which is what Glean is doing pretty well, to be honest. They are able to bring all the data to one single place and build a semantic model. But organizations are not even there yet in order to even supply data to platforms like Glean. So that becomes bottleneck number two, data readiness for agents to really work on. So we've seen a lot of organizations,

Siva Surendira 16:27 starting with agents that don't require a lot of internal data. So they're starting with those use cases rather than trying to, clean up their data. In a way, they it's kind of a fresh start, beginning of a new tech debt season all over again with AI agents coming in. But these are the two major bottlenecks that we see. The third one and the last one is capability or skills required to build these.

Siva Surendira 16:53 Because most of the enterprise agent platforms, as you may see today, are closed platforms. They are not developer friendly. It's either built by the startups themselves or it is pure no code. So when it's a no code builder, you can only tackle low complexity stuff because you cannot build high complexity ones like you build on Landgraf, for example. So, which is where

Siva Surendira 17:17 the skill set and the training comes into picture. And we've been focusing on that's all the customers that we work with. Mandatorily, we actually do technical sessions for their engineering team, for their system designs team, etcetera, so that they can do things themselves. So these are the three major bottlenecks we see with enterprise adoption.

Conor Bronsdon 17:41 It also sounds like you don't think low code or no code is gonna necessarily unlock the next level of agents.

Siva Surendira 17:50 Low code, yes. 100%, yes. You don't have to go through. For example, if you want to add, say, memory, short term memory and long term memory modules to an agent, you don't have to sit and code your way through the whole thing. You should be able to just enable memory in one click. And that's what you do on Lyse today. If you want to enable groundedness as a fact checking mechanism,

Siva Surendira 18:15 you don't have to sit and integrate and build the whole thing just in one click enable. So, low code specific is definitely yes, because it takes all the unnecessary repeat tasks and it helps you focus on the business logic because you get an API endpoint for the agent, which you can stitch together into solving the business logic with the multi agent system.

Siva Surendira 18:36 No code has its place. I won't say no. I think make.com is fantastic. NADIN is fantastic. We have a customer who has chosen two technologies as their go to automation platforms, make.com and Lyse. Make.com is for all the no code builders within their organization, which automates a lot of marketing, CRM, enriching, etcetera. And Lyse for all the business automations, their core business automations. They are an analyst firm, So we are looking at automating research and analysis, including,

Siva Surendira 19:09 structured data analysis and so on, and generating a final draft, etcetera. So these are I think this is how organizations will move forward. They'll have a combination of Make plus N8N on the no code side, and then probably an open source side like Landgraf, Autogen, AG2 or Crue, and then an enterprise, platform like Lysere. So that is what even I recommend because

Siva Surendira 19:34 it's it's, horses for courses. Anything deep and highly complex. It's super complex. Go ahead and build your own thing from scratch. Don't even work build on any of these frameworks. Just build it from scratch. Get some best engineers and build it from scratch. But otherwise, we are looking at pockets. These are the pockets that would, be there in larger enterprises as we move forward.

Conor Bronsdon 19:54 How are you thinking about inter agent communication? Obviously, the standards here are evolving. Transparently, we're involved with an effort with Langchaine and Cisco and Glean and a few others around agency trying to establish inter agent communication mechanisms. I'm curious how you're thinking about that. Yeah. No. I've been following the Cisco. I think Cisco initiated the whole,

Siva Surendira 20:17 thing, and and we're we're also, getting in touch with them. That is definitely one standard. But I can tell you this, all the big hyperscalers, they are going after this market. Yes. We've been in touch with AWS. We've been in touch with Google. They are bringing their own standards now to the market. Microsoft, as you know, will always be slow in these things.

Siva Surendira 20:37 But go AWS and Google have already started. There are few announcements that are coming out shortly. And there is an angle to this, and I'll I'll bring this out here. When you go to a large enterprise and when you say, hey. Where do you want your API integration to run? Are you gonna run it on a startup, or do you wanna run it on AWS or Azure or Google Cloud? They always prefer the hyperscalers because of the built in compliance and security measures that are in place. If at all they want their data to hop through a channel,

Siva Surendira 21:07 be it agent to agent or agent to system or agent to human, they want to rely on some platform that is far more, reliable and been around for a long time. I think AWS and Google has picked that, and they are now doubling down by building their own, agent to agent communication protocols. What Cisco I mean, I think that project is also quite interesting because,

Siva Surendira 21:31 it needs a lot of traction, a lot of adoption from the open source world, as you may know, for it to become mainstream. But I think these are a few interesting options that are coming out in the market. Time will tell which agent to agent communication standard protocol will win, but it's a very good start.

Conor Bronsdon 21:52 What's coming next as we enable agents to work better together? What are the new problems that are gonna start getting solved?

Siva Surendira 22:00 I mean, you see Jason Lemkin of SAST AI speaks about agents replacing SAS to some extent. He speaks on both sides. Sometimes he would say it doesn't. Sometimes he would say it is. But there is, I think, a much wider talk that, okay, agents are going to replace SAS to a large extent. And we've seen that in pockets as well. There was a deal that we were working, probably a few months back, even before we launched our studio.

Siva Surendira 22:27 It was a very large media organization and they wanted to move or they wanted to have ServiceNow agents. Interestingly, they decided to go with Moveworks and not ServiceNow themselves. And now those are good examples of how agents eat into software layer. But to me, the biggest unlock is always going to be agents replacing the mundane work that humans do. I'll give you another example. This is a large,

Siva Surendira 23:00 food delivery company in the world. They have 15,000 plus employees. And every June season, they had to do performance reviews for 15,000 employees. So just for those two months, they would hire a bunch of contract HR team, about 30 of them, just to do this job. Go speak to the employees, collect their self review, self assessment, go to their managers, collect their assessments,

Siva Surendira 23:29 compare them, prepare a report, come up with a detailed report. So now today, agents, which took three weeks for us to build and deploy, is able to operate at scale for 15,000 employees. So that is it's a mundane job, and I would say the biggest opportunity for any agent company here is not to go after the SaaS platforms, go after the human layer, go after the mundane work that teams are doing.

Siva Surendira 23:58 And the next stage for these agents is to become a lot better at reasoning and start moving up the ladder towards some high skill work. Currently, they'll go after the low skill work, which is just doing these follow ups, etc. But the stage two, like 26, 27 will be more on taking it further up a notch and go after the high skill work. So, this is something that we are seeing,

Siva Surendira 24:21 glimpses of it already happening where agents are doing some high skill work as well. And as we move further along, the human in loop normally goes down and out, and agents become like agents would end up running 80% of an organization over a period of time as the maturity matrix improves.

Conor Bronsdon 24:41 As we rapidly expand this agent infrastructure, as we have organizations that are more and more reliant on agents, How do some of the concepts we've talked about already in this conversation, like low code, no code approaches to agent building and agent orchestration come into play, especially if you're trying to do things like ensuring governance, reliability,

Conor Bronsdon 25:09 transparency for AI agent development and deployment?

Siva Surendira 25:13 I think for the first time, we enterprises are spoiled for choices. There's never been a scenario before like this. If you think of CRMs, yeah, there are a few CRMs in the market. If you think of databases, you would always choose within, say, MySQL or Postgres or a SQL Server or Oracle. You know your choices. It's not a big deal. But in the agent side, because everything still is an LLM eventually end of the day, a prompt call to an LLM in and out.

Siva Surendira 25:40 Organizations like us, we have tons of opportunities to build complexity over and above. To give you an idea, we had, one of the, again, one of the largest bank who came to us with a problem. How do you handle agent entitlements? What if a finance agent is somehow, by mistake or by intention, the developer connects a marketing agent with the finance agent, and the marketing agent goes and writes a blog post by

Siva Surendira 26:09 wherein it uses all the financial projections of the of the company as an internal data. How do you how do you prevent the finance agent from sharing this data with the marketing agent? Even though the developer who's a creator, who has all the admin rights did that. So which led us to build what we call as an agent entitlement policy. Now age, when you enable that, agents are now context aware. So their context aware,

Siva Surendira 26:36 access control aspect kicks in and their context aware, they understand which is the other agent. Can I share the data with that agent? And so these are the newer problems that we are tackling already. The other big problem that we're tackling is what happens when you share a feedback, be it a human feedback or a system feedback, to a multi agent setup. Think of this multi agent setup

Siva Surendira 27:00 which has DAG and I would say managerial orchestration, etc. So, how can a feedback agent interpret and send individual updates to individual agents, to the respective agents, and then automatically run all the test cases. So, it has to pass all the test cases, and then the new version of agents get updated. But if test cases fail, now you need a conflict resolution

Siva Surendira 27:26 agent to kick in and it has to go and check why the conflict, how to resolve it and appropriately change the feedback. So this whole loop keeps running for several times. You won't believe such loop cost us $300 Just the amount of LLM calls it had to take and it had to complete the whole loop. Did it still fix? Answer is no. After $300 of spending, we had to stop the whole loop because we wanted to see if there is a self healing or an auto healing thing. We are not there. So these are the newer challenges that we are already facing as we get into the multi agent system, right, like feedback handling, agent entitlement policy, and so on.

Siva Surendira 28:10 And that's where the opportunity is. I see that organizations will end up choosing multiple frameworks, multiple technologies for various use cases. That's not going to be one thing that fits all. So, LIZR, like I said, is going after the deep end of the problem. We are starting there because if we could solve it, obviously, you could solve the whole thing else. But we are going after the deep end of the problem. But there are N8Ns and

Siva Surendira 28:38 make.com will be very prominent, if you ask me, because that gives power to a lot more business users to build automations on the fly.

Conor Bronsdon 28:48 So, yeah. So that's the world that VC would evolve over a period of time. Yeah, it definitely feels like there is this amazing opportunity for self learning or self healing agents, however you want to find them. But creating that continuous learning loop is still challenging without human feedback right now. Yes. Because often they'll hair off in the wrong direction and the actual result may not be completed. And also that's something we've done with our own metrics here around agents is tracking

Conor Bronsdon 29:15 things like action completion, it's like, did the agent actually finish what you needed it to do? And sometimes the answer is no. So it's always interesting to have that kind of visibility, and I would argue very crucial to enable an observability layer and start to set up those metrics so you can actually sufficiently and effectively solve these self healing and continuous learning challenges. How are you approaching

Conor Bronsdon 29:41 evaluation with agents today? Yes.

Siva Surendira 29:44 Very good point. This opens up to another major trend that we are seeing. You see the usage of local open source models versus third party frontier models like GPT and Claude.

Conor Bronsdon 29:59 Absolutely. Much cheaper in many cases. That's for sure. Much cheaper.

Siva Surendira 30:03 And it gives you it gives an enterprise an option to run thousands of loops if you want, thousands of experiments, and the data still remains within your organization, more importantly. And your cost of experiment is like 100 times cheaper compared to running it on, GPT or clouds of the world. And we're we're seeing this big shift happening. For example, we're working very closely with the Amazon,

Siva Surendira 30:29 the AWS team, the Bedrock team, where NVIDIA NIMS model and the MetasLAMA models are the highest, the number one, number two models preferred by banks and insurance at this point in time when it comes to local models. And enterprises are also realizing that the local models are the ones which will be their IPs as they move forward, because you continue to fine tune them and they become your IP, which is not going to be available with frontier models.

Siva Surendira 30:56 Or you don't want to share your data to frontier models to fine tune those models, actually. So, these are two aspects that allow now the eval part coming out really well. So, evals, there are two types of evals here. That is, obviously the accuracy, groundedness, and, the hallucination index, evaluating that part of an agent. The second one is test cases, the entire test case itself. Okay. Because all these agents are solving a business problem. How well are they solving the business problem? That is the second type of, eval, that happens.

Siva Surendira 31:28 And, and the cost of experiment comes down when you use local models, number one. And, this, this actually increases the case for companies like us because we want to push agents to production. We want the customer to get comfortable to run as many wells as possible, both at an agent accuracy level and also at a business level, so that they can take it to production with more with higher confidence. It's gonna be an evolving and the most important aspect as we move forward because with 30 wells, you are not going to push to production tomorrow. You don't wanna be that developer or engineering manager who pushes something and then something crashes and gets questioned and even gets fired. Don't wanna do that.

Conor Bronsdon 32:09 Relates directly to something we just released with Galileo, which is our updated agent leaderboard, looking at different datasets for different LLMs, different open source versus private models, looking at which ones are most effective for agents, and particularly most effective in consequential use cases like healthcare, finance, and banking, but across a broad set and multiple different metrics, multiple different data sets, multiple ways to slice and dice that. And I think it's really important that we continue to have open conversations about this because otherwise it's so easy to default to,

Conor Bronsdon 32:46 you know, a Claude 3.7 SANNA, which is a great model, fantastic reasoning model, does a lot of stuff really well. But if you're doing thousands or tens of thousands or hundreds of thousands of calls to it in a couple of days, it gets really expensive. And so it's important to actually identify what your use case is, how intensive it's going to be, and figure out, do you need a custom trained SLM?

Conor Bronsdon 33:12 Do you need that Frontier LLM because you're trying to push through and do some really heavy reasoning? Is an open source model maybe the right approach for you? It can really depend, and I appreciate that nuance that you're bringing to this because, particularly what we've seen in our research, and I'll link the Hugging Face leaderboard we have up. If folks haven't seen it, definitely worth checking out.

Conor Bronsdon 33:33 There is just such a variety in which models perform well, depending on which circumstances. And especially as we get to the customization piece that you're talking about with enterprise, it's so crucial that in order to build trust, in order to make them cost effective at scale, you really think through those architectural decisions and don't just default to, oh, this is the model that I expect because,

Conor Bronsdon 33:55 you know, Gemini 2.5 Flash came out, it's so much more powerful. I should just go use this.

Siva Surendira 34:00 It it is. And good that the benchmark report is coming out from your end because that will help organizations choose wisely. They don't have to they they will not end up blaming the LLM rather. They will know that it is their design and not the LLM's, in fact, to be blamed. And interestingly, the way I recommend to our enterprise customers is, hey, look at eval platforms like Galileo, for example, as the antivirus

Siva Surendira 34:25 equivalent for your system because you need to have these checks and balances. If you don't have them, there is no way for you to explain when something absolutely goes wrong. Now, obviously, enterprises do ask us, okay. But you guys have safe, responsible AI in build. But we do cover the core the basic aspects of the very important aspects of it. Like, think of a MacBook having an antivirus software on top, which is good. The MacBook is secure by nature,

Siva Surendira 34:53 but having an additional, antivirus doesn't hurt you because it's gonna even catch things that probably the core, agent doesn't catch, which is exactly the kind of architecture that enterprises are looking at. They don't want, anything to be taken for granted. We are working with another very large organization automating their marketing functions all the way from blog post generation, ebook generation,

Siva Surendira 35:20 white paper writing, etcetera. It's a very custom mega workflow that they are building. And, for a smaller SMB, they will not think twice. They will like build something, chat GPT, copy paste, etcetera, and move on. Large enterprise cannot do that. They need their keywords to be there. They cannot allow competitor names to be there. All of these aspects come into picture. So in that scenario,

Siva Surendira 35:43 they are better off using or having a two layered approach. Okay, LIZR handles something, but then Galileo handles over on top of it. So they are far more the architecture is architected better. And now in addition to that, like you mentioned, if they have the right LLMs to choose from for the right workloads, and if they give and fine tune some of them, it just increases the overall,

Siva Surendira 36:07 I would say, output in safe and secure manner, but also the quality of output increases a lot. So this is kind of the architecture, today as of now that we recommend, but, and we'll have to just wait and see how it moves from here on. But definitely, Evals are here. Evals are going to stay. Evals are how the AI engineers can keep their jobs safe because you have explainability at the end of the day. Can go and explain

Siva Surendira 36:34 what happened, why it happened with eval logs, basically.

Conor Bronsdon 36:37 Well, I don't think there's a better point to end it than on that wonderful note. For folks who are listening, you can absolutely find lyser.ai's entire suite on their website. They've got some amazing stuff on there. And I have to say, Sifa's a great LinkedIn follow as well. So highly recommend it. I will check it out. We'll we'll link that in the show notes. And,

Conor Bronsdon 36:57 well, as Steve has said here, like, eval is crucial. They'll help you keep your job. They'll help you get your next job. So maybe check out galileo.ai. I'll take the shameless plug. Let's do it. Maybe it is.

Siva Surendira 37:09 We we've been following Terrelio AI for a long time, and I really like what you guys do. You're one of the three stacks I always recommend to our enterprise customers as well when they think about evals. So, yes, I would recommend the audience to check it out, and you can integrate and play around and try and break it. Right? That's the best way. Yes. Try and break and see and get your confidence going that way. I mean, with this kind of support, you're back welcome back anytime, Siva.

Conor Bronsdon 37:38 It's been fantastic having you on. Thank you so much for coming on the show. Thanks, Connor. Thanks for having me. Awesome. Well, we will include links to everything in the show notes. And as always, listeners, watchers, if you're watching on YouTube, thank you so much for listening to Chain of Thought. If you are only listening, you really should consider checking out our YouTube. There's so much more content there, including great how tos, more webinars, and so much more,

Conor Bronsdon 38:02 over on the Galileo YouTube channel. Episodes drop every Wednesday at the same time they drop in your podcast feed. And if you prefer this audio format and you're just enjoying Spotify or Apple Podcasts or wherever you happen to get your podcasts, you know what? Drop a comment. Say hi on LinkedIn. Drop a comment in the Spotify feed, whatever floats your boat. We just love hearing from our listeners,

Conor Bronsdon 38:23 and we couldn't thank you enough for everything you do for us. It is so much fun creating this show for y'all and having incredible guests like Siva on. So, Siva, thank you so much for today. It's been a distinct pleasure, and that's all for this week, everyone.

Siva Surendira 38:36 Thanks, Connor. Take care.