How much faster does AI-powered modernization actually make this work?

Modernizations that conventionally run “on the magnitude of five plus years” compress to “months or a year,” executed by “five people instead of 40” — a return on investment Rachelle calls “really, really insane.”

What parts of legacy modernization can LLMs actually handle?

On a million-line codebase, LLMs map the vertical slices of functionality, write missing documentation and comments, and generate equivalence tests to prove a replacement matches the legacy app. That covers about 80% of the work; the remaining ~20% stays purely manual.

What is forward deployed AI engineering?

A model coined by Palantir: rather than a product manager scoping a feature from afar and handing over a “fully fledged person two years later,” you embed engineers with the customer to build whatever is useful and works right then, as fast as possible — starting from a “steel thread” of core functionality and building outward.

How does the “chain of repair” keep AI-generated code quality high?

Generated code is scored against a custom evaluation metric and looped — “fix the code, make it better” — until it clears a set score. No human reviews it until it passes; only after a fixed number of failed iterations does a person step in. Code is never sent to test until it has already cleared the quality bar.

Why should engineers read academic research before building with AI?

Unlike engineers, “the academic community documents everything,” so the field already records how LLMs behave and what works. A day of reading saves lessons learned — for example, papers may show a task’s practical ceiling is around 75% accuracy, so there is no point chasing 100% and the rest will be manual.

Episodes · S2 E14 ← Prev Next →

Using AI to Modernize Your Legacy Applications | MongoDB’s Rachelle Palmer

Mar 12, 2025 · Rachelle Palmer , MongoDB · 44 min

AI Evaluation & Reliability AI Engineering AI Coding

Listen on any app

Key takeaways

Rachelle runs modernization “incubators” for organizations captive to legacy code — teams that outsourced “the care and feeding” of their codebase to contractors and now find themselves “paying millions of dollars for the maintenance of their own applications” with no in-house knowledge left.
An 80/20 split governs the work: LLMs handle roughly 80% (documentation, comments, equivalence tests, business-logic conversion) while ~20% stays “purely manual” — and the LLM share runs on a machine timescale of “minutes or hours or days” instead of the months or years a manual engineering effort would take.
ROI is measured by old-fashioned benchmarking: a senior or staff engineer does the task manually first, then the same task is run with AI. Modernizations that would take “the magnitude of five plus years” collapse to “months or a year,” and “the team is five people instead of 40.”
High-risk domains demand a different bar. “None of us are doctors… it could say you have tuberculosis and I don’t know” — so you pair engineering with subject-matter experts to proof outputs, fine-tune on domain data, and run “repair loops where we have an LLM play different roles and evaluate the outputs.”
Rachelle’s team built a “chain of repair”: generated code is scored against a custom eval metric and looped — “fix the code, make it better” — until it clears the bar. “No human looks at the code until it hits that score,” and a human only intervenes after a set number of failed iterations.
Forward deployed AI engineering (a model “coined by Palantir”) puts engineers on the ground to build “whatever is useful and works right then as quickly as possible.” It follows a “steel thread” — bones, then meat, then skin, then hair — rather than a PM’s vision of the “fully fledged person two years later.”

Frequently asked questions

How much faster does AI-powered modernization actually make this work?: Modernizations that conventionally run “on the magnitude of five plus years” compress to “months or a year,” executed by “five people instead of 40” — a return on investment Rachelle calls “really, really insane.”
What parts of legacy modernization can LLMs actually handle?: On a million-line codebase, LLMs map the vertical slices of functionality, write missing documentation and comments, and generate equivalence tests to prove a replacement matches the legacy app. That covers about 80% of the work; the remaining ~20% stays purely manual.
What is forward deployed AI engineering?: A model coined by Palantir: rather than a product manager scoping a feature from afar and handing over a “fully fledged person two years later,” you embed engineers with the customer to build whatever is useful and works right then, as fast as possible — starting from a “steel thread” of core functionality and building outward.
How does the “chain of repair” keep AI-generated code quality high?: Generated code is scored against a custom evaluation metric and looped — “fix the code, make it better” — until it clears a set score. No human reviews it until it passes; only after a fixed number of failed iterations does a person step in. Code is never sent to test until it has already cleared the quality bar.
Why should engineers read academic research before building with AI?: Unlike engineers, “the academic community documents everything,” so the field already records how LLMs behave and what works. A day of reading saves lessons learned — for example, papers may show a task’s practical ceiling is around 75% accuracy, so there is no point chasing 100% and the rest will be manual.

Concepts in this episode

AI terms discussed here — each links to a plain-language definition.

AI Benchmark AI Alignment AI Hallucination Latency

Chapters

00:00Introduction and Host Welcome
00:58Challenges in Modernizing Legacy Applications
02:52Real-World Examples of Code Modernization
04:00The Role of LLMs in Code Modernization
08:01Measuring Success in AI-Powered Modernization
12:28The Future of AI in Engineering
16:17Evaluating Modernization Success
21:12Returning to Your Startup Roots
29:07Forward Deployed AI Engineers
35:36Importance of Academic Research in AI
42:10Conclusion and Farewell

Show notes

Imagine cutting your legacy code modernization timeline from years to months. It’s no longer science fiction and this week’s guest is here to tell us how.

Rachelle Palmer, Director of Product Management at MongoDB, joins hosts Conor Bronsdon and Atindriyo Sanyal, for a discussion on the groundbreaking ways AI is modernizing legacy applications.

At MongoDB, Rachelle's forward-deployed AI engineering team is tackling the challenge of transforming complex, outdated codebases, freeing developers from technical debt. She details how LLMs are automating tasks like improving documentation, test generation, and even business logic conversion, dramatically reducing modernization timelines from years to months. What once demanded teams of dozens can now be achieved with a small, highly efficient team.

Connect with Chain of Thought host Conor Bronsdon:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

Follow Today's Guest(s)

⁠Rachelle Palmer MongoDB Application Modernization Factory

Check out Galileo

⁠⁠⁠⁠⁠Try Galileo⁠⁠

Transcript

105 segments

Speaker 0:00 Even if you believe that the output of an LLM is not perfect, I have bad news for everybody. Your output's not perfect either. And so let's not hold the technology to standards that we ourselves, our flawed human selves, can't even meet.

Conor Bronsdon 0:22 We are back on Chain of Thought. I am your host, Conor Bronson, and back today with me is my cohost, Atin Driosanyal, CTO and co founder at Galileo. Atin, great to jump back in with you once again. Yeah. Great to be here again. And we are delighted to be joined by Richelle Palmer, director of product management at MongoDB and running a forward deployed AI engineering team, which is a weird sounding role, but is really cool, and we're gonna talk all about that. Rochelle, welcome to the show. Thank you. I'm excited to be here and chat about AI.

Conor Bronsdon 0:56 It's gonna be a lot of fun. Thanks so much for joining us. Rochelle, you lead a team focused on a fascinating challenge, using AI to modernize legacy applications, making life easier for developers who are wrestling with decades old code bases. And when we were chatting before the episode, I got the sense that you work your team that the work your team performs is almost like a public service. It's it's so beneficial for teams.

Conor Bronsdon 1:18 Let's start there. What sort of challenges are you facing and overcoming with your team? It really depends on the

Speaker 1:26 on the organization that we are doing these. We call them incubators. We're doing these incubators in. You might be looking at an account team, organizational team, because sometimes they call them project teams. Sometimes they have special interest teams that maybe had a domain specific language that they built their application in twenty years ago, and now they have nobody who knows the app. You might be looking at a team that's sort of looking to move from old front end to a new front end. You might have them using an old Java framework, and they wanna move to a newer Java

Speaker 2:00 framework like Quarkus, or they've only used Spring because they wanna use Spring AI, but they committed to Hibernate two decades ago. So it really depends. It depends on the tech stack. It depends on the the organization. I we find actually that there's a lot of the time they have outsourced the care and feeding, if you will, of their code base to external contractors, so they don't actually have knowledge anymore or maintain their code base anymore. Then they're sort of captive to paying millions of dollars for the maintenance of their own applications. And they wanna get away from it, but they don't know how.

Speaker 2:39 And so, that's where we kind of start at when we, conduct this in incubator.

Conor Bronsdon 2:47 Anton, I see you nodding along. It sounds like this is a pain you felt before.

Speaker 2:52 Oh, absolutely. If I turn time back to some of the earlier days at Uber when I was there, we moved the entire code base from Python, which was the prototypical language that we had chosen to build the earlier versions of Uber all the way to Golang, and eventually it became the largest Go shop in the world. And for obvious reasons, it was just a lot tighter on the wire, and it was just

Speaker 3:16 way faster, and it reduced tech debt. And this is a problem we've all faced in every organization, in every team to the challenge of tech debt and also keeping, you know, up to times with modern languages. And, like, this is an era of Rust and Go. For legit reasons, like they run really tight on the wire. They're optimized better for CPU than say, erstwhile,

Speaker 3:41 you know, frameworks around Java, which have over the years as systems have scaled, has shown major issues like tail latency problems. And they they are only you can only improve a language linearly, but you can get exponential benefits if you can move, from one language to a more modern one quickly. And combined with the fact that LLMs have proven their worth

Speaker 4:05 quite a bit on the coding side of things. It's one of the areas where it has truly shown its value. So it's a very exciting application, Richel, that you're sort of dealing with because personally, it's one of my favorites because it allows your engineering tech stack to cope with the times without the investment of, you know, your best engineers and your best teams for over a year to build something, you know, in a cave and then do a prototype to the CEO. They're like, alright. We're gonna move the entire ship to this new ship.

Speaker 4:36 So personally, it's one of my favorite applications. Yeah. I think it's really exciting. And to be clear, like, the example I'm about to give, it's not one of our incubators. But a few years ago, you know, Coinbase moved from Ruby to Golang, and I remember watching that from afar and just chewing my nails, being like, Flynn, Flynn, do not lose all my Bitcoin and my Ripple and my Ethereum

Speaker 5:01 in your move to Golang. And I think, you know, watching that happen, it's really easy to stand here and sort of retrospectively be like, why would you ever choose that technology? But you have to remember that what's you know, you choose a technology when you're first getting started. You don't know if your application is going to have five users or 5,000,000,

Speaker 5:22 and so you're just trying to get something out the door. Maybe at the time, we made the best choice at the time, but then as technology evolves, it's no longer the right fit for you, but there's so much work to do to move from from your from a front end, from the middleware, what I call a middleware layer, which is like your frameworks, your ORMs, your whatever,

Speaker 5:44 and then also your database layer, in addition to all of the things that access your APIs, your business intelligence and data warehouse layer. It's actually such a huge amount of effort, and not many of us have the time or the talent, the number of engineers you would have to have to make that successful. Yeah. Absolutely. You know, I think this is one of the most underrated problems

Speaker 6:08 for any company because the zero to 10 of a company and the 10 to 100, they're very different companies. And the technical choices that you make to go from zero to 10 are very different from 10 to 100. Most companies get stuck in this chasm of how do we move brick by brick from the zero to 10 set of decisions that we made to the new world where we need to scale our application or scale our business. And a lot of business opportunity

Speaker 6:36 is lost in this JASM. So LLMs can bridge the gap pretty well. So I'm very optimistic and very excited about, you know, the initiatives you're taking, Rochelle. For sure. And I think, you know, I actually you know, last week, was in London for a series of meetings, and we were talking about this. And someone said to me sort of off the cuff, like, wow. You sound like you're really drinking the Kool Aid. And I was like, I'm more than drinking a Kool Aid. I am a Kool Aid mermaid right now because I can see,

Speaker 7:03 you know, they're not perfect, but certainly the capabilities of an LLM in the hands of senior engineers is nothing short of awe inspiring. And I feel like a lot of us, when we look at what's possible, we're seeing sort of this zero to one, right, like you said, which is like generate a MERN app. And that's great, but that's actually not what most senior engineers need

Speaker 7:27 and not what most organizations even need. And so, like, I just think it's really exciting what their capabilities are when you have a an experienced engineering team and you allow them to use LLMs,

Conor Bronsdon 7:40 actually. It's really clear, I think, to anyone who's been in engineering for a while, particularly if you've built engineering teams in multiple places, that this is a major challenge, this modernization. You both have highlighted varieties of that challenge, different constraints we work under. But now we have this huge opportunity with LLMs to accelerate this process.

Conor Bronsdon 8:01 Rochelle, I'd love to understand more about how you're leveraging AI in the modernization process.

Speaker 8:07 I think it's best with some examples of the kind of things that we have done. For instance, if you have a really old code base that is massive, so I'm talking about a million lines of code, one of the problems that you're probably going to face is that you don't have good documentation on said code base. And when you try to onboard new engineers, nobody knows where anything is. Like no one knows where the vertical slices of functionality in that code base are. Okay, that's a thing that can be solved by LLMs and analysis.

Speaker 8:39 Another example is your code's not commented, right? You can just have LLOs add comments to code. Another example is that you don't have test coverage of your code base. And so for us, in order to prove equivalence between a legacy application and a replacement application, you actually need to generate effectively equivalent tests between your legacy app and your new modernized app.

Speaker 9:05 That can be done with LLMs instead of with human beings. Now, I say instead, and I mean that kind of loosely because I do believe the eighty twenty rule applies here, which is that you can get LLMs to assist with 80, but you still have this remaining 20% that's like purely manual effort. That being said, the 80% being done by LLMs is not on a human time scale, right? So the 80% lift done by LLMs can be done in minutes or hours or days. What would take a manual engineering, an

Speaker 9:34 engineering team manual effort month or years even. So those are just some examples of things that we have done. Certainly other areas that are interesting is like sort of transforming the front end, as well as being able to do, like, conversion of business logic into code that's either documented business logic or even actual coded business logic for sure.

Conor Bronsdon 10:01 Totally. And and, Aten, I see you nodding along here to, like, everything Rochelle's saying, it's it seems like you're seeing similar opportunities to leverage both AI coding tools and agents in these kind of processes.

Speaker 10:14 Yeah. Absolutely. I've seen, I think pretty much what Rochelle said in other enterprises as well. One interesting thing I've observed also is sort of connecting the dots, watching different advanced sort of use cases across different kinds of businesses is number one, the immersive experience that MLMs are providing in terms of writing code and code understanding, like Rochelle was saying,

Speaker 10:40 where you have modern IDEs where while you're in the process of writing code, you can essentially get rid of the 80% boilerplate, like doing left joins and, you know, basic for loops, and those kind of stuff is pretty much, I think, be fully automated out where you can literally write 80% of the code through prompts, and that leaves the 20% where you can focus on more abstract things like design.

Speaker 11:06 But one other thing I've noticed is the application of agents, and this is a little more modern more recently, but that's also equally exciting where you're actually trying to automate workflows using agents around code generation and code understanding. For example, there's yeah, you can easily code up a GitHub bot that would automatically write unit tests for you,

Speaker 11:28 and because unit test writing is a little more and less abstract, it's more deterministic, and there's lesser, you know, deviations of patterns that developers have historically applied to writing tests, that is pretty exciting where you can get pretty much a lot of test coverage out of the box, which obviously enhances the quality of your applications. But also you can get advanced code reviews done. This is kind of like taking code understanding to the next level where a developer puts a PR out and you automatically get

Speaker 12:03 comments and suggestions on what you can change. And I've seen these couple of more agentic applications emerge in the recent past, but this will only get more and more advanced as newer types of agents come into the mix and more complex agentic architectures come. So there's a lot to sort of tap into the potential of LLMs. Yeah, totally. And if I could just whinge on that for a second. Like,

Speaker 12:32 in a way, right now, this technology is so new, so it's a lot of engineers don't really understand well how it works, and business folks, legal folks, engineers as well, are a little bit uncomfortable with it. But I don't as somebody who works in this environment all the time, I don't see it as being terribly different than Dependabot, right? Dependabot looks for security vulnerabilities inside of your repository and it suggests changes to you, and most of us just click okay and merge,

Speaker 13:03 Right? That's your new world that we're gonna be looking at with AI, I think. We're a little far from it, maybe. Maybe we're still a little far from that future world, but I don't think we're years out from that being how it is by default, actually. I absolutely agree. We're not years out at all. Pretty similar to Dependabot, like you said. It's a little more advanced kind of a Dependabot, which Right. Which has a brain of its own and has

Speaker 13:32 more understanding of the surrounding context, I think we lose a lot of efficiency in coding. Like you said, a new developer comes and just the understanding of the certain vertical business logic is missing. And I think LLMs can automatically gain a lot of that. So we're not years out at all. In fact, I wouldn't be surprised if within this year we have some very good high quality integrations with, you know, IDEs like Cursor and GitHub Actions

Speaker 14:03 especially with, you know, it's getting easier and easier to create agents. So we will see a lot more innovation on this front this year.

Conor Bronsdon 14:13 For sure. I think that's true as well. And it's not just agents either. I you both have referenced the IDE opportunity. And as many folks have probably seen from latent spaces lately, evidently Cursor is the fastest ever SaaS to grow from a million to a 100,000,000 ARR. They have a small team. They really just built an incredible product where a developer can leverage it to rapidly speed up what they're doing.

Conor Bronsdon 14:37 And I'm curious because I think one of the challenges here is how do you then measure the success of what you're doing? Like, okay, great. You're shipping more code, but how are you tracking the success of your AI powered modernization efforts? What's what's your approach you're taking, Rochelle? Level of effort, and time. Right? So we do a

Speaker 14:55 kind of an old fashioned scoping exercise where we say, like, okay, what are we doing? Here's the task. How long would this take a human manually to do? And sometimes we'll even do the task just to benchmark ourselves and we don't throw an intern at it. We have a senior or a staff engineer do the task. Maybe even in some ways it's a little unfair. And then we do the same task again with AI, right? And so the results on that are crazy, by the way.

Speaker 15:24 And even if you believe that the output of an LLM is not perfect, I have bad news for everybody, your output's not perfect either. And so let's not hold the technology to standards that we ourselves, our flawed human selves can't even meet. In which case, we now have an effective official benchmark per task level. And so across many tasks, the list of tasks that you would have to perform to actually modernize an application, that number

Speaker 15:55 just becomes really, really insane. So you're looking at the ability to modernize legacy applications in months or a year, whereas normally it would take you on the magnitude of five plus years. And the team is five people instead of 40. And so that is really, really a large return on investment, honestly.

Conor Bronsdon 16:17 Atin, how does this correlate with your approach to evaluating success with AI applications, particularly around modernization, etcetera?

Speaker 16:25 You know, I think the reason why LLMs have shown a lot of success on the coding front is simply because code is a much more narrower sort of problem space than than say natural languages. It's Anyone who studied theory of finite automata would say that it's a finite state machine. There's a limited set of things that you can do, there's for loops, while loops, you can list them out, there's a couple dozen things that you can do in code, and that's really it, and all of the code in the world is some manifestation of some of these concepts,

Speaker 17:00 so that's the reason why we've seen so much impact that an LLM can have, and this combined with the fact that Microsoft Research had published the paper Textbooks are all you need, which showed just the high quality of LLM output that you can achieve if you train it on high quality data, which they did with Stack Overflow code, etcetera, and there's a lot of open source repos out there. So that's the reason why it's successful.

Speaker 17:24 Now coming to sort of applicability, I think number one, it comes down to kind of what, you know, if you take the examples of Cursor and Windsurf, what they've done is they've met the user where they were prior to this whole revolution, which is in the IDE. So you're not changing user behavior that much, and instead kind of providing a more immersive experience

Speaker 17:47 in an existing, in a familiar setting. And this principle should apply to, I think, in anyone who's building innovative applications with LLMs and trying to of up the ante on the quality of some user experience with LLMs. They need to meet the user where they are. So that's where I think these companies like Cursor, etcetera, have done really well. I think to take it to the next level,

Speaker 18:14 it comes down to good evaluations because like Rochelle said, humans are flawed, these LMs are flawed, they hallucinate, they throw wrong output. And in the context of coding, the line is very thin between good and bad. You know, a a little hallucination, you know, a missing import statement can cause the entire system to fail. Right? That that's whereas in language, you know, the the lines are kind of blurred. Here, the lines are very thin and very, very distinct.

Speaker 18:44 So you need robust evaluation. You need human intervention, but also you need I think this is kind of our Galileo territory where we've been tasked with the, you know, the the responsibility of figuring out the Six Sigma evaluation experience. And when it comes to coding, it really comes down to customization because no one metric will solve your problem. It depends on what you're building, what you're coding.

Speaker 19:11 So you want to provide a layer of customization where you are allowed to build a metric that will work for you, and then the metric itself will evolve as your application evolves. That is, I think, one of the fundamental things. And in light of that, I guess, to a shameless plug here, I guess what Galileo kind of offers is this ability for a developer in this case to

Speaker 19:36 write their own metric. Like they want to build a code quality metric and here's their five criteria and this they define it in natural language and it goes through our system and out pops a metric on the other side, which measures the quality of their code and they are allowed to validate it and centralize it and share it with others. This evaluation workflow is critical to eliminating

Speaker 20:00 any flaws that NLM can output in terms of code. Yeah. No. And I think actually, I think in some ways it makes engineers in a better position than, like, many businesses who wanna use AI. And the reason why is because I do think some of that subjectivity is removed. Right? Like, in the end, the code either builds or it doesn't, and it either passes your test or it doesn't.

Speaker 20:25 And there's this extra layer of like, well, maybe we have coding conventions, and like we also have like sometimes we care a lot some organizations care a lot about performance. Some of them it's like, meh. So there's this area of quite a lot of leeway and the ability to define your own snowflake, if you will. But in the end, for engineers at least, either builds or it doesn't and it passes tests or it doesn't. If it doesn't, then you can just tell the LLM to

Speaker 20:56 correct it until it does, which is also nice. That's a good point you bring up about the coding style and each organization has its own sort of standards that they define and certainly LLMs can take that standard into consideration.

Conor Bronsdon 21:13 Rochelle, you alluded earlier in the conversation to this idea of AI enabling teams to go back to their startup roots and how that's particularly something that you're applying as a mentality within MongoDB. Can you expand on what you mean by this and how you see AI facilitating this and other changes for your team?

Speaker 21:33 Firstly, like, when you're a startup, it's anybody who can do the work, please do the work because there's so much work to be done. And then when you get to be a very large organization, you get into kind of this hyper specialization where everybody has their own little square of turf and they own the thing, and where it becomes challenging to move quickly is that you have to consult everybody's square, right? You're The United States, only there's a thousand states and each one of them has to say yes before you can build anything.

Speaker 22:04 And one of the things that's quite frankly awesome, I want to curse here because it's like, this is needed, is that the reason we have all of these processes, we have program management that makes us write a PD, they make us write a scope, they make us write functional requirements and technical designs, is because you have to get buy in from all of the states.

Speaker 22:28 And you have to give everybody ample opportunity to understand what it is you're going to build, and you have to give them ample opportunity to say no to whatever you want to build because maybe there's some other need that might be more important that might arise sometime. And so we impose all these processes around our engineering teams. And the reality is, is that in a new world with AI,

Speaker 22:53 why do you need all that if you can write a whole application in two days? And you don't like it, we can throw it away and we'll write a new one. And I actually believe even if you wanna keep all the same processes, you can also use AI with those processes. So we use, on my team, we've used AI to write a PD in minute, as opposed to week. We've used it to write functional requirements and technical designs, and then evaluate our technical designs, and then change those technical designs based on things we as humans might even forget.

Speaker 23:26 And so all of that is just so much faster, and that just assumes you even want to keep the same processes, which I'm not sure that you do, but we're not yet in the era of figuring out what processes do we need to facilitate this level of speed that we're now capable of generating code and generating tests and generating new features so quickly. The processes that we have grown up with as engineers are

Speaker 23:55 so, so I I think we would all agree they're soul crushingly slow anyway, but they are definitely soul crushingly slow in the era of AI, in which case we need a whole rethink about how processes facilitate engineering work in this in this new mental model and working model. I completely agree. And I see Aten

Conor Bronsdon 24:17 wants to chime in here as well, but it's such a huge opportunity to speed up our experimentation and iteration process. And I mean, we've talked about agile and engineering for decades now, but like we can be so much more agile because we're being enabled with these tools and it's like, we'll just try more things.

Speaker 24:35 Yeah. Absolutely. I mean, I agree with you both, and just to add this one additional point, coming back to the earlier example I used of code migration from Python to Go at Uber. Like one of the reasons why that was really successful was not because a thousand, you know, opinion docs were written about, you know, the the advantages of Goroutines and multithreaded.

Speaker 24:57 It was simply because we prototyped a version of Uber on Go and showed 10x improvement. Right? Proof is in the pudding and you can go a lot further if you just show the proof. So I've personally been moving our engineering at Galileo to this prototype driven model where, you know, stop writing docs, just build it in a in a two hours, just build an MVP, just show that what you're trying to say, and show me the numbers

Speaker 25:23 and we'll move to that. Because at the end of the day, why do we write docs? Like they're a means to assisting what the eventual end is, which is the code and the application. So anything that can get us to those decisions on why we wanna do it, nothing better than a quick prototype, and LLMs can really disrupt that. Yeah. And I think what's really amazing, like, if you're sitting where I'm sitting at, you're looking at potentially,

Speaker 25:51 like, true infrastructure of these companies of the world, like trading platforms, insurance platforms, government platforms that provide assistance to the needy. And when you can prototype a newer version that's more performant, that's on modern technology within weeks, people are truly astounded because they're used to this time scale that's so, so long to see anything.

Speaker 26:21 And I I just think I just think it's a really great thing.

Conor Bronsdon 26:24 I completely agree with you both that there is this huge opportunity to change our processes in a positive way that enables more experimentation, enables more problem solving, and ends up in better products faster. However, there are challenges or considerations that come with this increased velocity of code creation and this increased agility that we're trying to insert in our org, and it's major changes for some orgs. How are you both thinking through

Conor Bronsdon 26:55 these actual, like, org changes and how you're changing the org charts or how you're changing the processes or how you're changing roles? Rochelle, I know you have a bunch of thoughts on this.

Speaker 27:04 There's a couple of areas where I think you see a lot of change, and let's talk about, like, what what I think is really needed. And I can give you an example that we are not involved in yet, but I I it's all moving on the horizon. Right? And it's the top of mind example, I think, for everybody with AI, which is your sort of worst case, which is medical records, right?

Speaker 27:27 So I know and have read about AI being involved in diagnosing patients and giving recommendations, for instance. In my opinion, what is needed in these areas of very high risk is a tight partnership between engineering that's building the applications and the subject matter experts who can proof the outputs of an LLM. Because if our engineering team is building a medical app,

Speaker 27:57 none of us are doctors. We have no idea, right? It could say you have tuberculosis and I don't know, okay, I mean, that is a thing you can have, Check mark success. Whereas, you really need a much better partnership between your subject matter expert, in this case a doctor or a nurse or a physician's assistant, to prove that the application that you've built is accurate. And basically, you need to invent ways

Speaker 28:24 to have the LLM act as a judge against itself with that subject matter expertise. Right? So that may mean fine tuning a model with medical information. It may mean what we use as repair loops where we have an LLM play different roles and evaluate the outputs, But definitely, like, effort and work has to happen here to be able to be sure that what your l l that what the LLMs is producing actually

Speaker 28:58 will meet a quality bar. Definitely.

Conor Bronsdon 29:01 I know you've had to kind of adjust how your team approaches at times. And one of the ways you've done that is this new terminology that we're seeing happen and which is highly involved in your role, which is forward deployed engineers, forward deployed AI engineers. Can you explain what this means and how this differs from previous iterations of these roles? So

Speaker 29:25 before running this team, I was a product leader at MongoDB for developer experience. This is an outward facing role where it owns the responsibility of our client libraries, so our drivers, and our framework integrations, including AIML frameworks, and then also our developer education, docs and academia, so textbooks and things like that. How we would develop a product or a feature is basically someone like me

Speaker 29:58 in that old role would think of it, and then we would write a bunch of documentation and we'd pass them around and we'd get alignment and we'd scope and then we'd start writing code and then we'd build the thing and then we would show a beta to some users and hopefully they would reply and give us their feedback, and then eventually we GA. In the forward deployed engineering model, which was really coined by Palantir,

Speaker 30:22 you put engineers on the ground with the customer or the organization, and their mission is to build whatever is useful and works right then as quickly as possible. So you have this time to minimally viable thing going on that's so fast. And with AI, it's only faster. And in that way, you can actually take I think in academia, they refer to this as the steel thread, which is that if you think about how a human being is constructed, you kind of like start with the bones and then you add meat and then you add skin, and then you add hair. So you start with this core functionality and build outward as opposed to having somebody,

Speaker 31:04 usually a product manager, stand back and from afar envision a person, and then what they attempt to hand to the customer is the fully fledged person two years later. So it's a very different way of developing software, but I think it allows for really rapid innovation in a way that suits AI very well. Hopefully, And all makes sense. I I don't think I've ever had to explain forward deployed engineering to me. I'd love to add here that, you know, this concept of forward deployed engineers,

Speaker 31:37 the main reason is to, like Rochelle said, get to an MVP faster, iterate faster, move faster, just be more agile. But, of course, you know, velocity can lead to disarray. So, you know, how do you how do you curb the disarray is kind of the question. And from my vantage point, I've seen is kind of what going back to what Rochelle described, some of these architectures

Speaker 32:01 or responsibilities of forward deployed engineers. The paradigms that they follow are kind of the early goings of agentic architectures, like she described. And you can automate a bunch of that by, you know, building simple agents and they can assist with forward deployed engineering a bit. But where we've seen a lot of value come through, at least in still early days, but we are seeing a lot of value in kind of centralizing

Speaker 32:28 evaluations as well as like how, Rochelle described app simulation where there's, you know, subject matter experts who want to test apps. So the centralizing of all this can lead to less disarray because now you have a system of record and protocols in place on how you want to simulate an app and how do you take something from prototype to deploy it on the cloud for your subject matter experts to test and, you know, change the prompts and change the parameters and, you know, suggest information back to the developers who can then again,

Speaker 33:02 you know, change their agents or, and then add back to the system of record. So this centralizing of and evaluations is part of this. So centralizing metrics, centralizing, you know, custom evaluation methodologies, and really providing like a central kind of one stop shop. But for all these, just sort of, it takes a village to productionize an application, and the goal is to do it fast and move iteratively.

Speaker 33:27 The central management system, which we are trying to build here at Galileo, of like the central agentic evaluation system, LLM assisted, of course, that can go a long way in kind of ensuring that, you know, the disarray doesn't happen and it lead potentially doesn't lead to all the pitfalls of, you know, moving too fast. Yeah. Think that's I totally true. Mean, evaluation

Speaker 33:53 purposes, we get asked for so much advice and one of them is like, which LLM do we use? Which embedding model do we use? And actually just having benchmarks of like, Well, this LLM and this embedding model, this is the benchmark for producing Golang code. And just being able to see that in front of them enables engineering teams to decide on which LLM and which embedding model to use because they have no idea.

Speaker 34:20 Sometimes I actually think benchmarking is really important, or evaluation tools are really important at the beginning of the process, just knowing what pick, what to choose for yourself. Because if you know that this LLM and embedding model is 73% accurate and this one is 91, there's no reason to pick the 73 or whatever. So actually evaluation could be really important before you even get started.

Conor Bronsdon 34:48 Completely agreed. And Rochelle, I really liked how you talked earlier about the benchmarking process that you approach these modernizations with as far as having engineers deliver specific discrete parts of the modernization process to then benchmark against the LLM deployed piece of it as well, because this clearly sets you up to not only have this like source of ground truth of like, okay, here's how fast we can do a certain unit of this task, let's now break it down and really understand the breadth of it, but then you can kind of apply that across your evaluation of metrics approach.

Conor Bronsdon 35:23 And I know part of how you've come to this perspective is that you have done a lot of review of kind of the academic research of the perspectives out there before diving into AI tool development. Can you talk a bit about why you think it's crucial for, engineers to be paying attention to the academic side as well and how it can help teams avoid potential pitfalls?

Speaker 35:48 For sure. And, yeah, this is so important. When you are on an engineering team that wants to start out with using AI, it is very tempting and super easy to just maybe Google it and read a Medium post or a Stack Overflow post or a blog, and or or or not even do that and just kind of wing it and get started with just, I'm gonna throw some stuff in an LLM and then see how it responds.

Speaker 36:16 And that's mostly because our way of working as engineers over the last fifteen to twenty years has been exactly that, right? There's nothing wrong with that. However, we have this additional benefit, which is that AI and ML was born in the academic community. Unlike engineers, the academic community documents everything. So you actually have a huge wealth of knowledge in terms of how LLMs operate,

Speaker 36:50 how you can construct software chains of actions together to get better and better results, and that's been studied by computer science, people getting PhDs in computer science for years and years. So if you just take even one day to research that and then read some of those papers, you actually will go into your development cycle with a lot more knowledge than even the average engineer,

Speaker 37:20 and you can make much more effective tooling. I think this is just a new way of working for software engineers. We don't typically look at, like, academic white papers written by university students before we, like, build apps. But if you do that, you you will really unlock your productivity, and and you will save yourself some lessons learned because it's already been studied and and documented. It also gives you an idea of what to expect because it could be that, you know, six academic teams all around the world have studied this. They benchmark the performance of an LLM at a specific task, and there's no point in you trying to get to 100% because the best you're ever going to get, even with

Speaker 38:01 every single optimization that's been thought of so far, the best you may ever get is 75, and the rest is going to be manual. It's good for you to know that and be able to see what variations people have tried already so that you don't have to go through that again.

Conor Bronsdon 38:16 And, Aten, I know you've also taken this perspective of diving into the research. It's actually something we've done quite a bit at Galileo, is creating our own research as well. Do you have any thoughts you want to add here? Yeah, I mean, just a second

Speaker 38:30 what Rochelle said. It is really this new muscle that should be part of modern day software evaluation tooling. Cause I think I've always taken this hot take that in the next three years, every single software application will have some kind of AI footprint, either it would have been written partially by AI or it would have AI features in it. But in this new world of developing software,

Speaker 38:55 the gap is really what Rochelle was highlighting, which is the modern way of evaluating and testing, and so far it's been a lot more deterministic, unit tests, integration tests, smoke tests, what have you, but now we have benchmarking and evaluation practices and AB comparisons and the practices which researchers are typically used to, so as far as bringing that into a modern day evaluation

Speaker 39:25 tooling is concerned, our take was that there is a need for new ways to look at, you know, software evaluations and new methods to evaluate it, because even from a machine learning standpoint, just taking a dataset and splitting it into two and having ground truth and running an f one score on it, the entire hypothesis of Galileo, the underlying reason we started the company was we saw those methods as

Speaker 39:54 kind of myopic and necessary but not sufficient, and they would treat all errors the same way, and it was very, very limiting in its way approach towards determining whether an AI system is good or not. So we set out on this journey to create new methods to evaluate it, and now in this modern era of LLM driven software development, we are bringing a lot of data research

Speaker 40:17 into tooling and building a platform which is conducive for the user, meets the user where they are, and is customizable so that they can create their own metrics and do this kind of evaluation and benchmarking in a way where it's not an entirely new thing which can shock them, but kind of sort of eases, you know, you I earlier said, right, you need to meet them where they are, so the ability to do that in a unit test or in an integration test is kind of the right way to go about it. Rochelle, what are your thoughts on this approach that Autin's talking about? Absolutely.

Speaker 40:52 And I can give you an example of a tool that we have and how we use that that sort of evaluation. So at MongoDB, one of the things we worry about is the code quality generated by an LLM. Think everybody worries about that. When you have it generate tests, the tests are actually just codes also. And so what we have built is this chain of repair where it looks at the evaluation metric, which is a custom evaluation metric that we have come up with over time,

Speaker 41:20 and until it hits a certain score, we repair, right? And we just say fix the code, make it better, fix the code, make it better. And of course, we have a lot of ways in which we explain to the LLM how to fix it and make it better, but it continues to loop through that until it hits a certain score. And no human looks at the code until it hits that score. And we go through a certain number of iterations of that, and if we can't hit the score, then a human intervenes. But otherwise, that allows us to just sort of have this more

Speaker 41:49 automated code generating machine, right? Where we can be assured that the code passes a quality check before we even send it to test. Because there's no reason to send it to test because it'll always fail because it's not high quality code. Right? And so that's how we use that's an example of how we use these custom evaluation metrics to

Conor Bronsdon 42:07 to yield good result and speed. I love it. Rochelle, thank you so much for coming on the show. It's been an absolute pleasure having you with us today. Where can folks learn more about the work you are doing and the work you're doing at MongoDB?

Speaker 42:20 Nowhere. No. But seriously, nowhere. So we've been really quiet about the modernization factory, and forward deployed engineering and field engineering. So I guess message me on LinkedIn, and that's your method. There's, I think, one blog post on mongodb.com.

Conor Bronsdon 42:39 So I'll have to try to find that. But, thanks so much for giving us a little bit of a sneak peek into your thought process. It's been a ton of fun, and, thanks for joining us for this episode of Chain of Thought. If folks wanna dive deeper into everything we've discussed today, we will have that all in the show notes. As always, check it out for all the links and resources. And hey, while you're at it, go ahead and subscribe to Chain of Thought wherever you get your podcasts. It helps more folks discover the show. We're on YouTube as well. Plus, if you're really loving the episodes we produce, if you thought Rochelle was fantastic,

Conor Bronsdon 43:08 you could let us know on social media. We'd love to hear from you. Or consider leaving us a rating or review on Spotify or Apple Podcasts. It's a great way to help other folks discover the show. Aten, Rochelle, thanks so much for joining me today. Yes. Thanks so much for having me. Pleasure.