What does Malte Ubl mean by the “vibe check” when building an agent?

Malte Ubl describes the vibe check as a non-programming exercise to find out whether the LLM is anywhere up to the task — before you write any code. His example: if you have a set of writing rules, go to ChatGPT, paste your text, and ask it to check the text against those rules. If it delivers all kinds of false positives, that sucks; if it doesn’t find the problems, that sucks too — but maybe it does. You quickly learn whether you’re giving the AI a task it isn’t ready for, in which case you might need to wait a year or do something else. If it seems to work, you go turn it into a program.

How does Malte Ubl recommend structuring a production agent?

Malte Ubl encourages the simplest architecture — the one behind why Claude Code works as magically as it does. You have an LLM, a trigger to start work, input data and a prompt, and a set of relatively simple, atomic tools — for a coding agent, read files, list files, edit files, create pull requests; for deep research, maybe a search over your Glean or a generic SQL tool hitting Snowflake. You describe what each tool is for and the goal, tell it it gets up to ten turns, and let it cook. He stresses tools should be atomic rather than high-level — uncomfortable, but it just works.

Why does Malte Ubl say prompt injection can’t be solved like SQL injection?

Malte Ubl explains that with SQL injection, escaping user inputs per best practices gets you to essentially zero risk — in principle it’s solved. Prompt injection is different: a tool’s response becomes part of the prompt, with nothing jailing it. You can sanitize inputs, but an attacker can phrase the injection in Spanish, so you can’t “100% ensure” safety the way you guarantee escaped SQL. His fix moves security to a different layer: hard-coding a query’s tenant conditions inside the tool rather than passing a user ID the model can be tricked into changing.

What does the Vercel AI Gateway do, according to Malte Ubl?

Malte Ubl says the AI Gateway is directly integrated with the AI SDK and lets you use every AI model from any provider without any API keys. To give a new model a vibe check, you literally just change the model string — no going to a provider’s website to get a key, which he notes can be especially hard for some models. During development the rate limits are very low and it’s free, on the logic that you’d never need much yourself; when you go to production you can access the same models and Vercel bills you at market rate. He frames it as a frictionless way to actually try AI.

How does Malte Ubl suggest teams without AI experience get started?

Malte Ubl’s main advice is to try it in a non-pressure setting, because this is a new kind of software most people don’t have intuition for and some feel anxiety about. Vercel ran a one-week agent hackathon for all engineers; beyond the “pretty awesome” outcomes, the bigger win was that everyone had now built one agent, so they’d approach real business tasks having done it before. He also points to AI SDK examples — including his colleague Nico’s well-received session building a coaching agent from scratch with a not-smartest model and three tools — to show the technique transfers to other use cases.

Episodes · S2 E39 ← Prev Next →

Vercel's Playbook for AI Agents: From Vibe Check to Production | Malte Ubl

Sep 10, 2025 · Malte Ubl , Vercel · 54 min

AI Agents AI Evaluation & Reliability Enterprise AI AI Security AI Coding

Listen on any app

Key takeaways

Malte Ubl reframes agents not as something weird but as “software that we always wanted to write” — the daily-task automation that needs a little flexibility and was historically hard to model with exhaustive if-this-then-that blocks. Agents do really well on exactly that kind of software, so suddenly it’s super easy to write.
Vercel built what Malte reluctantly calls a “DevSecOps agent” — it takes anomaly detection from a firewall and goes fishing for what happened. Hand a modern frontier model a few tools to query the data stream and it does a pretty good job from scratch; he says you spend an afternoon and have software that used to be incredibly hard to write.
On why coding agents lead, Malte is precise: software engineering has source code you can validate and run unit tests on, so you can reinforcement-learn on this valuable, open-ended problem in a way that’s harder elsewhere. He treats that as a signal the effectiveness most likely generalizes — not proof it already has.
Malte’s recommended architecture is the uncomfortable-but-it-works simple one — the reason Claude Code feels magical. An LLM, a trigger, input and prompt, and a set of relatively simple, atomic tools (read files, list files, edit files, create PRs). Tell it to make up to ten turns toward the goal, then “just let it cook.”
Prompt injection demands a different fix than SQL injection, Malte argues. Tool responses become part of the prompt, period — nothing jails them, so you can’t escape them the way you escape SQL. The fix moves security to a different layer: hard-code a query’s tenant conditions instead of passing a user ID the model can be tricked into changing.
The under-discussed pattern, in Malte’s view, is “AI in the loop” — the inverse of human in the loop. The human does the work and AI reviews it. He’s a fan because AI code review is relentless, doesn’t forget your past mistakes, doesn’t go to bed or wake up tired, and stays sharp on tedious checks where humans go low on quality the fifteenth time.

Frequently asked questions

What does Malte Ubl mean by the “vibe check” when building an agent?: Malte Ubl describes the vibe check as a non-programming exercise to find out whether the LLM is anywhere up to the task — before you write any code. His example: if you have a set of writing rules, go to ChatGPT, paste your text, and ask it to check the text against those rules. If it delivers all kinds of false positives, that sucks; if it doesn’t find the problems, that sucks too — but maybe it does. You quickly learn whether you’re giving the AI a task it isn’t ready for, in which case you might need to wait a year or do something else. If it seems to work, you go turn it into a program.
How does Malte Ubl recommend structuring a production agent?: Malte Ubl encourages the simplest architecture — the one behind why Claude Code works as magically as it does. You have an LLM, a trigger to start work, input data and a prompt, and a set of relatively simple, atomic tools — for a coding agent, read files, list files, edit files, create pull requests; for deep research, maybe a search over your Glean or a generic SQL tool hitting Snowflake. You describe what each tool is for and the goal, tell it it gets up to ten turns, and let it cook. He stresses tools should be atomic rather than high-level — uncomfortable, but it just works.
Why does Malte Ubl say prompt injection can’t be solved like SQL injection?: Malte Ubl explains that with SQL injection, escaping user inputs per best practices gets you to essentially zero risk — in principle it’s solved. Prompt injection is different: a tool’s response becomes part of the prompt, with nothing jailing it. You can sanitize inputs, but an attacker can phrase the injection in Spanish, so you can’t “100% ensure” safety the way you guarantee escaped SQL. His fix moves security to a different layer: hard-coding a query’s tenant conditions inside the tool rather than passing a user ID the model can be tricked into changing.
What does the Vercel AI Gateway do, according to Malte Ubl?: Malte Ubl says the AI Gateway is directly integrated with the AI SDK and lets you use every AI model from any provider without any API keys. To give a new model a vibe check, you literally just change the model string — no going to a provider’s website to get a key, which he notes can be especially hard for some models. During development the rate limits are very low and it’s free, on the logic that you’d never need much yourself; when you go to production you can access the same models and Vercel bills you at market rate. He frames it as a frictionless way to actually try AI.
How does Malte Ubl suggest teams without AI experience get started?: Malte Ubl’s main advice is to try it in a non-pressure setting, because this is a new kind of software most people don’t have intuition for and some feel anxiety about. Vercel ran a one-week agent hackathon for all engineers; beyond the “pretty awesome” outcomes, the bigger win was that everyone had now built one agent, so they’d approach real business tasks having done it before. He also points to AI SDK examples — including his colleague Nico’s well-received session building a coaching agent from scratch with a not-smartest model and three tools — to show the technique transfers to other use cases.

Concepts in this episode

AI terms discussed here — each links to a plain-language definition.

AI Agent Prompt Injection Jailbreaking Prompt Engineering Frontier Model Human in the Loop AI Evaluation Latency Precision and Recall

Show notes

What’s the first step to building an enterprise-grade AI tool?

Malte Ubl, CTO of Vercel, joins us this week to share Vercel’s playbook for agents, explaining how agents are a new type of software for solving flexible tasks. He shares how Vercel's developer-first ecosystem, including tools like the AI SDK and AI Gateway, is designed to help teams move from a quick proof-of-concept to a trusted, production-ready application.

Malte explores the practicalities of production AI, from the importance of eval-driven development to debugging chaotic agents with robust tracing. He offers a critical lesson on security, explaining why prompt injection requires a totally different solution - tool constraint - than traditional threats like SQL injection. This episode is a deep dive into the infrastructure and mindset, from sandboxes to specialized SLMs, required to build the next generation of AI tools.

Connect with Chain of Thought host Conor Bronsdon:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

Follow Today's Guest(s)

Connect with Malte on LinkedIn

Follow Malte on X (formerly Twitter)

Learn more about Vercel

Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Agent Leaderboard

Transcript

127 segments

Malte Ubl 0:00 On the agent in particular, this framing that I have where it's a new kind of software, but not software that's like weird, but software that we always wanted to write. Where I'm coming from is that there is a type of software that automates something that we do on a daily basis, where there is a little bit of flexibility that's needed, where that's actually quite difficult to model in software, where traditionally because you would have written these if this and that type of blocks, it's just difficult to make that exhaustive,

Malte Ubl 0:27 but agents do really well on that same software, so suddenly it's super easy to write.

Conor Bronsdon 0:36 Welcome back to Chain of Thought, everyone. I am your host, Conor Bronsden. And today, I'm joined by Malta Ubel, CTO at Vercel. Malta, it's great to see you again. Great to be here. Super awesome. Yeah. It's it's been a lot of fun chatting with you over, I guess, the last year. It was the last time we had a chance to sit down and kind of dive deep on everything happening with AI. So it's been fun to watch a couple of your talks

Conor Bronsdon 0:59 as you've continued to be a prominent voice in the AI space, often striking a balance between excitement and pragmatism and the ability to be actionable, which I think is something Vercel does a really good job of. And you've even described yourself as an anti hype guy at times, something which was on full display at your recent Vercel ship twenty twenty five talk.

Conor Bronsdon 1:21 And you had some fantastic takeaways there. In particular, you talked about the FOMO that people are feeling right now. And as someone who talks to leaders like yourself every week who are all doing exciting things in AI, I feel this FOMO vehemently. Every week I'm going, Oh God, here's a new thing that I haven't experimented enough with yet. And it's very common, I think, for folks who are building in a space to experience this. It's almost the default emotion of

Conor Bronsdon 1:48 AI builders as perpetual FOMO, because there's so much happening, there's so much hype, there's so many new and exciting things that are actually being achieved, and the space is moving faster than I think many of us have ever experienced. You've posited that AI agents in particular are a new kind of software and a new paradigm, that is something we've always wanted to create, but couldn't for economic and,

Conor Bronsdon 2:15 you know, energy and, frankly reasoning reasons. Could you expand on how you see

Malte Ubl 2:23 agents and AI more broadly shifting the paradigm of how technology is working today? Yeah, totally. And I think the way I qualify anti hype guy is that I'm actually I'm pretty hype y about this stuff, but I try to ground it in reality. Like I think there's always this moment in when tech comes out that people kind of feel compelled, like they feel the FOMO, but they don't really quite know what the thing is. And so like, you don't really know where to start, right? And what

Malte Ubl 2:51 definitely is different in AI is that there is value though. Because I, you know, it felt like almost the entire Web3 hype cycle was like, just perpetually there. But you can never find, you know, the thing that actually would be the value was never really discovered, but this is so different, right? Like, it's, you can go try it out and then it actually happens to work.

Malte Ubl 3:17 So so we're coming from a different perspective. And the the yeah, the the on on on the agent in particular, the this this framing that I have where it's a new kind of software, but not software that's weird, but software that we always wanted to write. Where I'm coming from is that there is a title software that automates something that we do on a daily basis,

Malte Ubl 3:45 where there's a little bit of flexibility that's needed, where that's actually quite difficult to model in software. Where like traditionally because you would have written these like if this and that type of blocks, it's just difficult to make that exhaustive. But agents do really well on that same software, so suddenly it's super easy to write. I could give a few example,

Malte Ubl 4:10 and just from the top of my head, so we've been working on, I don't like this category, DevSecOps agent. It's basically just kind of takes anomaly detection from a firewall and goes fishing for like, what happened? Like and it could be anything. It could be many, many things, and and and it can be all of these things, like, combined in novel ways. And so writing software that would do that analysis would be, I mean, at least difficult and a lot of work, and

Malte Ubl 4:44 if you just take a modern frontier model and you give it a few tools to say, here's how you can query our data stream, what happened? Right? It will do a pretty good job just from the scratch. And then you can do it better over time, but it actually is able to answer that question in a way that is valuable, and you spend you know, an afternoon on it. And now you have a piece of software, again, like that used to be so hard to write and and and,

Conor Bronsdon 5:14 you know, the now it's now it's something we can actually do. I completely agree. It's it's interesting too, because it feels like, and maybe this speaks to your anti hyppositioning, that we all overestimate the ability to go from zero to 100% of solving the problem, and maybe underestimate how much effort is required in that final you know, 20% of solving it. Because to your point, it's really easy to have a, you know, simple agent, that's working off a frontier model, you know, take your pick depending on the task, and solve a lot of this problem right off the bat. And then

Conor Bronsdon 5:50 we maybe are distracted and don't go through all the needed tweaks to get it to a 100% of our task. Yet at the same time, that that's incredible. We in an hour or two have solved most of this problem. And I wonder if part of why there is this, bit of a hype gap there is that while most developers are thinking about agents, not all of them are actually building them yet. And, you know, I I referenced your talk at Forsell's Ship 2025,

Conor Bronsdon 6:25 and one of the first things you asked the audience was how many of you are actually building with agents today? How many of you have actually tried it out? And, I believe you said about 5% of that audience actually had. Whereas, I know you think it's gonna be more like 100% when you come back and ask that same question next year. Right.

Malte Ubl 6:43 Yeah. And the reason just is that I think we, like in our finding, like we tried actually quite a few things, and they all work. Unreasonably effective. And so I feel it generalizes to multiple categories. Now, will give one caveat, which is that the reason why and obviously, we're talking to like a software development audience who is using AI agents today in substantial fashion.

Malte Ubl 7:14 Right? Maybe cursor, but I mean, certainly Cloud Code just fits that bill and is working and it's fascinating. And so the the I think the the agreed upon reason why you have these like very effective agents in software engineering and you don't have them in other disciplines, is that in the software engineering area, you can reinforcement learn on this very valuable, open ended problem in a way that's more difficult for

Malte Ubl 7:49 things that don't have source code that you can validate and run unit tests on and so forth, right? So it's a little bit easier to make these agents very good at software engineering than it is to make them really good at anything, but having said that, I think that is basically just pointing towards something that is going to be true more generally, and that's why I feel that

Malte Ubl 8:09 what we see as signal today in our own discipline does most likely generalize further out. Because I am working on DevTools, I only have DevTools examples, right? Actually, let me give you one example that I thought was really cool. Please. So I live in Alameda, and specifically on Bay Farm Island, which is the same peninsula that has the Oakland Airport. And I don't hear airplanes

Malte Ubl 8:36 except for the trijets. And so I I don't love them, and they they and they, you know, not obviously no no passenger airline anymore flies them. But the but the cargo airlines do. And so I was out I wanted to wondered when are they phasing them out? And the Internet doesn't know that, so I I sent out deep research, and what JetGPT did, it looked at all the flight data of every day from the Oakland Airport that it publishes, and figured out what is the over time change in tri jet patterns. Oh.

Malte Ubl 9:10 Obviously I could have done that myself, but it was like a hell of a lot of work, So I wouldn't have done that myself, and now you have this thing that does it. So that just worked, out of the box deep research task. It created novel data that was technically there, but obviously it would have been incredibly difficult to surface. We actually have a product under development that is very similar,

Malte Ubl 9:33 which is we have an agent that does essentially the same as Git bisect for Vercel preview deployments. It's because Vercel has immutable deployments, you can essentially bisect them to find out where a regression occurred, and so this bot, can teach it how to reproduce a problem, and then it just does the bisect. It's a similar thing where, I mean, you can do that as a human, but it's a horrible task. We happen to be bad at it, even with a bisect helper,

Malte Ubl 10:08 going to go up, down. Humans are so bad at this, it's very easy to program that thing, and so it perfectly fits into an agent. This agent, this is already different from a Clock code, right, because you're in this, in a world that is kind of outside of coding, that again just works. With relatively small effort, you have this incredibly valuable tool that is able to navigate more uncertainty than a traditional computer program have been able to. Yeah.

Conor Bronsdon 10:41 That's a really great example. And I know we we all know like, it's fantastic to have a grid pattern search, and we we wanna approach things that way. But God, when you have to, you know, bisect things yourself and recreate a problem, it can be tedious. Yep. So I love that as an example. And it sounds like you're identifying other unlocked opportunities for businesses that are embracing AI agents today. Know, obviously coding and dev tools are

Conor Bronsdon 11:08 clear examples here where we have a ton of public data to train off of. We have a lot of solved problems. We have a ton of people who are working on these problems who can help, you know, fine tune and improve this. But what are the other use cases that you think businesses should be embracing today for agents? Yeah. I think the clear

Malte Ubl 11:28 big one is support agents. That's where we also see kind of this entire ecosystem of kind of, let's like, call them SaaS businesses that act specifically in this space, like Decagon, Sierra, etcetera. Right? And I think generally speaking, there's this because it's now cheaper to make software, I wonder how much you're actually gonna buy software packages like that versus you're able to buy these yourself, have the perfect tailored support experience. So I think that's a that's a big one.

Malte Ubl 12:03 And I mean, I just gave this deep research example. So deep research as a basically just like generalized way of, you know, going through data. That's useful, and whatever public AI is there does not have access to your company's data. So I think one one really concrete thing that you can probably work for your company is that you've built a deep research tool

Malte Ubl 12:32 that has access to your private data, right, that you don't share with anyone else. How would you suggest that

Conor Bronsdon 12:39 engineering leaders or product leaders who are currently evaluating their own AI use cases think through these opportunities? So it sounds like one, you know, what can you unlock with your private data? I think that's a great example. Two, you know, what tasks are simply challenging to solve today that you can at least get mostly solved much faster with an agent or AI solution?

Conor Bronsdon 13:05 What other ways would you encourage other leaders to be thinking about solving problems with AI? Yeah, think the, I mean,

Malte Ubl 13:13 overall, it's all about being creative, right, and knowing your business. I think besides the categories that I've been talking about, it's worth whenever you have a workflow, it's worth looking into, and often in like, especially bigger companies, you have a process, there is some kind of routing phase in there where it's difficult to write this algorithm, like what's next, right? And that's I think where AI models excel, like finding,

Malte Ubl 13:45 where you basically just write down the business rules in pros, and you basically just have the LLM execute your business rules. Right? That that works really well. The other one, the other pattern I think that is under discussed, in my opinion, is AI in the loop, because people love to talk about human in the loop. Human in the loop, the AI does the work, and then you have a reviewer. That's obviously what usually should happen. Right? It's

Malte Ubl 14:12 so obviously true that I don't even think it's necessary discuss this, but the opposite also works really well. So if the human does the work, then AI is an incredible reviewer. I think what we're seeing now is that AI core review, first of all, is amazing. It has this relentless list. It doesn't forget about your past mistakes. It checks again if you make a million. Stuff like that. So I'm a fan. And so I think that is almost

Malte Ubl 14:43 like there's no way that doesn't generalize. Right? Like I think validating work product against about like about hard to like again, the the rules you wanna validate against are really difficult to put in algorithm. Right? Mean, often with human work product, it's a natural language, it's very hard to write lint rules against human output, right? But you can basically think about AI doing a similar thing where

Malte Ubl 15:14 like let's say you're in the marketing department and you have a certain policy around how language should be used, like that is something that an LLM can enforce really well, and again, it doesn't go to bed, it wake up tired, right? So it's good at these very tedious tasks that humans, the fifteenth time they do it, go low on quality. So I think those are kind of the areas that are

Malte Ubl 15:43 ripe for ideas. And again, these are just examples of so you have you have an idea in your head. The the next step then to, like, turn this into something more practical is that what I call the vibe check, where you actually do a non programming exercise to find out if the LM is anywhere up to the task. Some of the examples I just gave, they're almost trivial

Malte Ubl 16:11 to check-in that way, right? Like if I have a set of writing rules, and I wanna understand if the AI is doing a good job at forcing them, I can do that right now, right? I go to chatjibidi.com, I paste my text, I ask it, can you please check this text against these rules and see what happens, right? And if it delivers all kinds of false positives, then that sucks, and if it doesn't find the problems, that also sucks, but maybe it does. And you really quickly can understand,

Malte Ubl 16:42 am I giving the AI a task that it's just not up to, and maybe I need to wait a year, or maybe I need to do something else, or you're saying, okay, though, actually this does seem to kind of work, and great, now I can go and turn it into a program, right? Because obviously, the next day, ideally, we don't have to paste it in the JGBT

Conor Bronsdon 17:04 and write a prompt from scratch, but that part is like the easy part. I I feel like this is almost a a simpler version of the kind of MVP approach that a lot of folks are encouraging when it comes to building out new product features, Where they're saying, oh, well, just, you know, vibe code a quick version of that feature, see how it looks. Or, you know, of, you know, vibe code what you want the new UI to look like, and, you know, let's let's have something to work off of.

Conor Bronsdon 17:29 And you're saying, hey, you know, that's a great step. But before we even get to that, with let's just really simply check the use case. Is this something that an LLM is kind of equipped to handle right now, or are we gonna have to add more structure and more context, or simply wait till better set up for? A 100%. Yeah. I think you're right in that it

Malte Ubl 17:49 fulfills a similar role in the process. I mean, I'm generally the biggest fan of rapid prototyping and using an early version of the app as a discussion platform, rather than the fully blown thing. Obviously, we're making V0, which is like, that's the whole point, but I think that great for when your idea is that you're gonna ship this product, right, and it has

Malte Ubl 18:16 an experience and you wanna experience it. So that's one thing. The other thing is that, now, if that product has an AI at heart, obviously, it might not, but if it does, like, you know, because you're building an agent, then it really helps to just kind of prompt your, you know, your way to seeing if it's if it's working, because it it just removes all these steps that are essentially enterprise process automation, right? You know, and maybe that's like only five minutes in your startup,

Malte Ubl 18:50 but if there's an SAP and Oracle and you know, whatever. Right? Like, it might be a lot of stuff. And like writing code against those doesn't hurt, but like you also you don't you know, that's I think we're all kinda confident that we could do it if we you know, if the AI at the heart kind of did what we wanted. What would be your next step? So you've started to do this iteration.

Conor Bronsdon 19:19 You've kind of validated that, hey, yeah, this is a real use case we could apply this to. What should builders do next to

Malte Ubl 19:27 turn this into a fully fledged agent or system? Yeah. So hopefully we we found out that the AI somehow magically actually, you know, does what we want. Then, yeah, then I mean that's the step where you go and put it in the software, right? Like that's when we attach it to the, our business systems that have the data that we wanna process. Yeah, again, it could be all kinds of different services depending on what we wanna build. But it's like, at this stage,

Malte Ubl 19:57 what we're really doing is effectively the just plain old workflow integration, right? So that step, it should like, if you're a backend engineer, you're right at home. Like, you're not learning anything new. That's what you're already doing day to day. Nothing's nothing's weird. The other stuff is that at this point, you build up the actual agent, and we actually haven't talked too much about what that means, so maybe now is a good time. Please.

Malte Ubl 20:23 There there are different architectures, but the one that feels uncomfortable, but works that which is the architecture, which is why, for example, Cloud Code works as magical as it does, and and the one I would encourage using is the most simple one, which is where you have an LLM, and you you know, obviously you have some kind of trigger for why you start working, have some kind of input data and prompt,

Malte Ubl 20:48 and you give it a set of tools. And these tools should be relatively simple. So for a coding agent, would be read files, list files, edit files, create pull requests. Nothing crazy, right? Really just to kind of go figure out what's in my repository. Maybe there's a grab files in there or something like that, right? If you're you're building any form of like

Malte Ubl 21:18 deep research agent, you'll give it as like and you wanna attach it to your company database. Maybe you give it a search functionality that hits your glean or whatever you're using, right? And maybe you give it like a generic SQL execution tool that gives it access to your Snowflake. Or maybe you're doing something more specific, but that's the type of tools you give it. And you describe to the LM, this is what these tools are good for, this is your goal,

Malte Ubl 21:47 and you tell the LM, okay, you get to make up to 10 turns using a tool that's towards your goal, and then you just let it cook. So you literally tell it, go figure it out, right? And then, okay, says, okay, this is my task. Okay, I have this tool. Okay, that seems useful. Execute the tool. It gets the response back and sees, okay, this is what happened, okay, does this help me, does it not, maybe I need to call this tool again with a different query, or no, this is actually right. So it figures out this entire sequence

Malte Ubl 22:19 of kind of iterating towards the result in a fully autonomous fashion. And the magic is that that works. And you have to, it does feel a little bit uncomfortable. I think we all, what we've been finding is that you actually really want to make these tools somewhat atomic, rather than more high level, more specific to the task, and that just works. You have to give it a try, but that's

Malte Ubl 22:50 the architecture, and then at the end, that LM produces some kind of output, and there you go. If folks are not so familiar with how these LMs work, you have multiple turns, and technically this almost like a immutable function in the sense that the when you call it again, you just give it the previous output. And so it acts like it would be continuing on previous interactions,

Malte Ubl 23:24 the previous two calls, obviously it's processing a new tool return, but really it always goes from scratch and just does the token completion, but given the new data, right? And so that actually works. Yeah, at the end, get some good output, there was something you were going to do, right? So you do that. That's where you're back to the plain old workflow automation,

Malte Ubl 23:52 and there you have it. That's the whole program. How do you then approach optimization

Conor Bronsdon 23:58 of this to ensure you move from 90% of great to a 100% or whatever it is you're looking to truly solve? Yeah. I think I mean, even maybe slightly before that,

Malte Ubl 24:11 because we have this chaotic software now, right? The core kind of business logic was kind of only expressed in words, and the LLM is figuring out its own path, given the rules you gave it. So that is, again, that's a new kind of software. It's it's very different. So we do have to put in all kinds of logging and tracing to understand what it does. Right? And if, know, if you're professional software engineers, were probably already like super deep in that topic, but if we skipped it for now,

Malte Ubl 24:46 you need tracing now. Because the trace tells you what the thing did. It's not always the same trace. It really depends. So you gotta figure out what's going on, and from there you can go and do the optimization. The optimization itself, folks probably have heard the term prompt engineering, which is basically really golfing out prompt towards being more successful.

Malte Ubl 25:19 I think one thing that isn't as intuitive is that if you give a tool to the LM, that is also an exercise in prompt engineering. So you need to explain to the LM what the tool is for, what that kind of I mean, obviously, the the signature of the arguments, but also, like, let's say you have a string argument. You need to tell it what goes in there, like, what the format is, if there's anything that cannot be expressed in the time system. Like, you have to be really expressive.

Malte Ubl 25:47 The the way I think we're essentially past this step for in g p d 3.5 step. You needed to do all these tricks with prompt engineering. Right. It's largely not needed anymore, but you really have to treat my advice is to treat it like a junior engineer who isn't the smartest kid on the block, but has incredible patience with you. So you can tell it how things work,

Malte Ubl 26:24 then it always considers it. Sometimes it ignores it a little bit and stuff, the education that you give it is actually considered every time the program runs from scratch, because it's previously stateless, it's not a learning system for now, but that also means that it's always top of mind. Your business rules are always top Right? Of So, yeah, you have to write them down.

Malte Ubl 26:51 I know, as an engineer, we don't love writing docs, that's our job now, because we're writing docs for the machine. But it feels better because the thing that listens to you doesn't feel like writing documents because other engineers also don't love reading them that they don't read. These things like read you a Yeah, at least this one, it listens you a little bit. Yeah, so that's kind of the thing.

Malte Ubl 27:12 So that's how you optimize things. I don't wanna talk about fine tuning models. I think it's, obviously it does help with cost, maybe that's, but it's a whole, it's own topic. You also, I mean, there's this whole topic of evolves, which you do need to dive into to make sure your thing stays good over time, and to kind of optimize it at the edges. So, but that's kind of the loop that you have to do with the Rider guy. Initially there will be some low hanging fruit on the prompts,

Malte Ubl 27:42 then you'll test it with users, with yourself, you see kind of with those special cases that don't work, there will be some low hanging ones, and then eventually you do have to write evalts, both kind of that show you how it works in production, so basically just in production, implementation of your system, and also these type of, like they're almost like unit tests,

Malte Ubl 28:09 where you effectively just run your AI, give it a known task with known data, and you'll

Conor Bronsdon 28:17 essentially write assertions that it does the right recommendations at the end. Right? Yeah. This concept of eval driven development is really starting to become talked about more similar to, you know, test driven development has been talked And, I mean, you referenced it earlier, this idea of using LMs as judges, I think, is obviously very common now within certain tasks, like writing, for example, and having it be an editor for you.

Conor Bronsdon 28:43 But in evals in particular, there's you know, an entire field and concept around it. We just wrote a large ebook about it, actually, talking about, like, basically how to effectively do LM as a judge. And I'm curious how you're applying AI as a judge to evaluate outputs during this optimization phase, and then also within your evaluations.

Malte Ubl 29:04 Yeah, you can use it in two ways. So one way how you can do it is actually not in the eval phase, but in the actual agent. So I mentioned how for your because your agent effectively is a loop, have to write some kind of exit condition, and it's some What I would start with is to just give it a number of turns, but it doesn't need to exhaust them. Obviously, if the AI says, I'm done here,

Malte Ubl 29:30 then great, right? But what you could do is that you could say, okay, my exit condition is actually just another LLM that I give the task and the output, and and maybe the intermediate steps or maybe not, and I basically just ask it like, what do you think about this? And I think we've all had this experience ourselves that if you use Chachibati or Clotter or some other chat AI like that,

Malte Ubl 29:58 if it that tells you something in high confidence and then you're like asking, are you sure? And it's like, oh no, actually, in hindsight, right? And obviously that is because they're doing the next token prediction and

Conor Bronsdon 30:09 they're literally in hindsight saying this is all like, doesn't make any sense. Well, it's always fun too to pit Claude against GPT or something. Right. Like, oh yeah, GPT

Malte Ubl 30:18 did this, what do you think? Yeah, and so you can put this inside into software, right? That works. And also because it's not bullshit, right? It's actually working, as in the model now has more data, and so that can actually make a better call than the next token prediction, which brings forward the error that it made 100 token before, where the data wasn't there. It kind of gets biased, and now it gives the wrong answer,

Malte Ubl 30:45 whereas then if you give it everything at once, says, okay, obviously this was a mistake. So you can put that into your core agent loop, but also very similarly, if you write an eval where you say, okay, given this input, produce the output, you can then ask the LLM to make sure that, or like to get an eval output, whether that's a good answer or a bad answer, And the reason why that's better than doing it in the main program loop, is that you

Malte Ubl 31:16 avoid kind of the extra cost, extra latency from the LLM call, and you just don't need it. If you get your eval to be so good that it always does the right thing in the initial turn, then you just don't need the LLMS charge kind of in your in your primary eval loop.

Conor Bronsdon 31:30 For teams that are maybe not doing concerted AI development yet, or at least don't have a system that they're applying, Let's say they get really excited. They're like, Malta's really got this idea down in these phases. Where would you advise them to focus within this framework as they begin to build their first agents? Where should they be spending the most time and effort? Yeah, think the

Malte Ubl 31:56 main thing, and like, it's, so first of all, it's new software, new type of software. So it's clear that most of us don't have a good intuition for it. Most of it, like if you have a slightly larger organization, there's gonna be a subset who have maybe anxiety or just overall don't feel sure about it, and the best thing to do against that is just to try it out in a non

Malte Ubl 32:24 pressure setting. So we, for example, just did an agent hackathon for just a week, all engineers in the company, and you know, I mean, actually it had pretty good outcomes and pretty awesome stuff, but like the bigger outcome is that now everyone in the agency has built one agent in their life. And when the business tasks come, they come a perspective that they've done it before.

Malte Ubl 32:47 We do have for the AI SDK which we're building, we do have a set of examples. We you referenced my talk, my colleague Nico had a had a session about building a coaching agent from scratch, which was super well received, and which kind of is a tutorial session that kind of goes, walks you through building an agent, and it kind of shows some of the fascinating stuff that I've been mentioning today,

Malte Ubl 33:14 where he's using not the smartest model on the block, he's giving the thing three tasks, and it's not as good as Cloud Code, but it's incredibly good. And now you have built that thing yourself. There's no ingredients, right? There's no, and I think that kind of builds the confidence that you can apply the same technique to other use cases. Because you're really not, you know, the thing is used like magic,

Malte Ubl 33:39 and I think I, you know, at least the magic in me creates this kinda little bit of fear response, little bit of respect for like, I'm sure they were super smart people. You know, obviously, Anthropic and and OpenAI and Google, they hire like the the smartest kids on the block. But but yeah, this turns out to be, you know, doing pretty well if you if you, you know, just do a little bit of coding. Yeah. Speaking of things that are, I think, going well, it seems very clear that Vercel has done

Conor Bronsdon 34:12 a lot of cool stuff with AI already, and is on the cusp, I think, a lot more. I know plenty of our audience are familiar with your work, but I'd love to maybe give a bit of an overview before we dive a bit more into structure and security of AI agents, of what you see as the next stages of enabling developers around the world to build with AI. What what do you see as, you know, kind of where Vercel's

Conor Bronsdon 34:43 AI SDK is going, and and what what's the plan, I guess. Yeah. No, totally. I think we

Malte Ubl 34:49 so definitely it's the AI SDK at the core. We're shipping version five. So it probably has g eight when this podcast comes out. There's a very late beta right now already released. You can just search X, it's very well received, people love it. It's breaking change, but of the type that people love, because we really just listen to what people were struggling with and making it better.

Malte Ubl 35:16 It really kind of drives down and kind of making the things that people want to build for agents be easier to model with with TypeScript and and and so forth. So that's at the at the core, and I again, like I would recommend using an existing example to kind of go from. We also publish the so called chat SDK, which is really a full blown chattypt like chat app. So if that's what you're building, right, if you actually have a UI,

Malte Ubl 35:46 I agree there's a bunch of funded startups that are really just forks of this template, and I think we have our first unicorn in the books, so that's a good starting point if you want to build a chat UI because that's kind of tedious, and they're all kind of you kind of want them to look the same, so they're familiar. Right? So this has all the features built in.

Malte Ubl 36:05 What we are working on is like more infrastructure offer sell great for doing this type of asynchronous programming that kind of comes up in this core of the agent, because it does it runs for quite a while, it does all these like tool calls, so you need what you would call like very reliable, very persistent compute, where obviously if this if your fifteen minute running agent fails after five, you need to you

Malte Ubl 36:41 need to have a way to continue from from your last checkpoint. And so this is something we're heavily investing in. So we are investing in in a queue product directly on Vercel on into a workflow management, etcetera. What people should definitely check out then is our AI gateway. So that's also directly integrated with the AI SDK, and the idea is that if you use the AI gateway

Malte Ubl 37:06 and you have it configured, you can use every AI model from any provider without any API keys. So if you want to, you know, communicate to comes out, guys give it a vibe check with your existing applications, that's literally just you just change the string from whatever model you have in there right now to a new one, and you don't have to get, you know, go to the Chinese website and and check it out, they get an API key.

Malte Ubl 37:34 Same with Gemini, it's probably even harder, but like, you know, so it's it's really just a convenience, and that's during development, like, we we we had this like playground, we gave everyone the like, low rate rate limits, and we thought we could just give everyone this for development as well. Rate limits are super low at this point, right, but that's also where it's free, because how much would you ever need yourself?

Malte Ubl 37:54 But then if you wanna go to production, you can also access all of the same models and we bill you for market rate here. So it's really just our way to get people to really try AI in a frictionless way, because it's so difficult sometimes to get to these models. And then the final point is that we do work on we have our marketplace, so there's lots of things that agents wanna do from

Malte Ubl 38:23 from workflows, memory, browser use, etcetera, that we are all gonna build, but we make available through the agent. As a bit of a special case, we also ship to sandbox. So this is for the case where you I mean, either you're actually building a coding agent, or what happens is that your agent says, well, actually, I'm really bad at math, but I just generated

Malte Ubl 38:45 some source code that would solve this problem, and so you need a place to run it because it's like, know, might be very insecure code, so that's what sandboxes are for, so that you can just essentially run this like one off code in a in a in a safe manner where there's no access to any of your secrets or any of that stuff. Speaking of security, there

Conor Bronsdon 39:05 are plenty of valid concerns around everything from prompt injection to personal identification leakage. How do you think developers should be considering security and privacy

Malte Ubl 39:20 as they build AI tools? Yeah. Yeah. I mean, thanks for mentioning it. It's a it's a really important topic. Like the, there are several threat vectors, problem injection is the most concerning one. Folks probably have heard about this, like it's because it goes through the news when people jailbreak JGBT, or make some model do something that it wasn't designed to.

Malte Ubl 39:48 I think these jailbreaking use cases are, they're usually referring to someone who had access to the model type prompting themselves, having relatively full access. That's kind of the last generation. Now prompt injection happens through you controlling the eyes fully in the back end, but some user control input gets in there. Right. And what we have to understand is that

Malte Ubl 40:16 there's the the I talked to them, these tools, right? Tools maybe reading from your database. When the responses go into the prompt, and the only thing stopping the model to literally take that that data as gospel for what it's supposed to do, that you kind of tell it not to. But obviously, as we know from jailbreaking, people find ways around that. That's like, we are really, it's in a way mind blowing. So the response of my tools become part of the prompt, period.

Malte Ubl 40:46 They're not in any way like jailed from, or treated in any special way. From our perspective, this is just like the rest of the prompt. And so it's really like, it's a similar situation to a SQL injection, but with SQL injection, I think we're in a world where if you follow best practices, it's not a problem. You escape the user inputs and there's zero risk. Now, obviously people get it wrong and so that's why they're still SQL injection attacks, but in principle it's a soft thing. Versus

Malte Ubl 41:22 on prompt injection, there's nothing you can do. It just becomes part of the prompt. Okay, I shouldn't say there's nothing you can do, but you cannot make it safe as in I 100% ensure that I escaped MySQL content. It's not like that. And so we unfortunately have to put security at a different layer. So, you can sanitize the inputs, right? You can check them against malicious stuff. Very difficult, right? Because you can prompt the checks in Spanish or whatever, right? This is actually not how, but you know, I would still encourage you to do it. But then

Malte Ubl 42:03 the other thing is, what is the worst thing that can happen? Right? Is an important question. And so like, I'll give you an example. If I if I give my ALM a tool to make a SQL query, and I and it's really meant to only search within a single tenant in my database or a single user. Right? But the part where that's enforced is part of the tool, like it's just an argument that's supposed to put in the user ID,

Malte Ubl 42:41 then I can trick that tool into making arbitrary queries. If instead I give the LM a tool that has hard coded what the conditions of the query is, then I cannot escape that. So I need to do stuff like that where I go and make my tools so constrained that even if I consider everything that is kind of part of the AI generated part of the query, if I consider all of that attacker controlled,

Malte Ubl 43:10 the attacker shouldn't get anything that they shouldn't get. Right? So so I do have to think about these like in-depth situations rather than being able to say, like, you know, this is trusted software and I can I can I I always trust everything it does? Are there other critical security and privacy considerations

Conor Bronsdon 43:31 that builders should be prioritizing when creating and deploying AI agents or other AI systems, especially ones that are interacting with sensitive data, like, for example, the customer support agents

Malte Ubl 43:42 Yeah. Or performing other actions within enterprise systems. Yeah. The other big one is just broadly data exfiltration. So this is where you try to make the model reveal data to a third party that it wasn't intended to do. Both GitLab and GitHub actually had essentially the same vulnerability in short order, where and it's actually a good example for what I was describing, right? Basically, people just put,

Malte Ubl 44:13 I think, pull requests to public repositories, which are a tech controlled, right? And into these pull requests, they just put prompt instructions, and because the you know, now becomes part of the, what the AI does, right? And so next step is, what, you know, now this AI, let's say there was code review tool, can't do that much, and first thing you can do is make a comment.

Malte Ubl 44:37 Well, what they did is they said, okay, the comments are marked down. I get to put images, and I get to put query strings in the images, and I get to put the entire source code of the repository into the query string or whatever, right? It's something like that. And so the lesson is that you have to treat the output of the LMS and trust it. We've had to do exactly this with Galileo, where

Conor Bronsdon 45:06 we have these Luna two small language models we developed to enable real time guardrailing, where we will essentially apply guardrail metrics against both the inputs coming in and then the outputs as well. Yeah. Because to your point, it's you know, if something gets missed, you need to also see, hey. Is is this leaking personal identifying information? You know, is data being exfiltrated here?

Conor Bronsdon 45:28 And you need to be able to block that risky content on the fly, or else there is major user risks around this and potential compliance vulnerabilities that you can run into if you're, you know, a regulated company. So Yeah. I love that you're thinking about this, because I think it's really crucial that you, you know, refine the rules for your AI systems without having to constantly redeploy code,

Conor Bronsdon 45:51 or else you put your whole system at risk. And, you know, the the magic and the beauty of an LLM is it can do things you don't expect and can solve problems without you telling it exactly what to do, but we have to account for that in the other direction as well.

Malte Ubl 46:07 No. I I honestly agree. It's like it's and, you know, it is it's really early days. Right? Like this is gonna be, I mean, across all dimensions, right? Like we'll learn much better how to build these programs, we'll learn more about security properties. I think we but we know enough today that it's certainly clear that it's something you have to pay attention to. What haven't we covered so far that

Conor Bronsdon 46:39 is, you know, burning in your mind right now around agents or AI or or what builders should be doing? Yeah. I think we we we did a good overview.

Malte Ubl 46:47 My like, I mean, as as a Vercel employee, what I pay a lot of attention to is kind of just what are people building, and how can I make that better at scale? The way we do that is that we go look at the older adopters and get their feedback, and then we try to build abstractions that make doing the thing that people maybe did by hand something that everyone else can do with a few lines of code. And so we're definitely in that phase where

Malte Ubl 47:19 that's what's going on, but but that's also that's always that's that's part of the fun part. And and in parallel, think we, like, you know, we're collecting kind of use cases. Right? Like, right now, I think we have this phase where things are pretty advanced in the codec space and other other verticals are kind of, you know, interested, but like not as advanced. And so I'm I'm I'm definitely spending a lot of time finding patterns, and and happy to very excited to share those in the future. Are there particular companies that you see doing particular let me rephrase this. Are there particular companies that you think are doing excellent work, or have

Conor Bronsdon 48:08 use cases that are especially exciting?

Malte Ubl 48:11 Yeah, this this is a good question. I think the way the space is, I think, set up right now is that, you know, they have the the coding space, which I think, obviously is exciting, and I think there's more innovation to come, right? I don't think we're in any way of a steady state. No. I already mentioned that I am personally very excited about AI based code review,

Malte Ubl 48:37 and we talked about security, think that's particularly interesting insight there. So I think we'll live in a world where I don't have a winner takes all market for a review agent because I I want like the security expert. And I want security expert on this topic, and then I want the world class team on this other topic, and I also want their stuff. And like, the

Malte Ubl 49:05 marginal cost for me having two is very low, right? So I really see us being in a world where you hire quote unquote all these experts, so that's gonna be a thing. Obviously, we are quite far along on the support agent side, and I think we that's certainly the most mature side on the Internet. I'm personally super excited and investing a lot of time in agentic commerce,

Malte Ubl 49:35 which I think is gonna be the first use case in the kind of consumer side that's gonna take off beyond the the chatbots, like the the the big chatbots, right, where there there is just like so much creativity to to be unlocked, for example, in in finding like, in doing coherent closing styles between various brands and so forth, where, like, AI really really really can help in maybe that's just me. I I would love to have someone buy my clothes for me.

Conor Bronsdon 50:09 No. I I think there are some exciting opportunities in ecommerce. No question. And speaking of exciting opportunities with AI, I'd love to give our audience the opportunity to learn a bit more about Vercel and what you're building. Where can our listeners go to keep up with the truly innovative work that Vercel is doing in the AI space? Yeah. I think we so we

Malte Ubl 50:31 we have a few kind of things in in the space that I think people should check out. I talked a lot about the AI SDK. That's probably it's the most tangible thing. Right? Okay. And I think, you know, we have the new version coming out. If you start today, do with the beta. If you listen to this podcast when it actually comes out, probably version five is NGA, so aisdk.dev,

Malte Ubl 50:52 that's the place to go. I also mentioned chat SDK, which is like the specific, I'm gonna build a chatbot and I'm starting with one that has already all the features. So if that's what you're going for, I definitely encourage people to go. We did drop mention of v zero a couple times in this conversation, so v zero dot dev is our software that knows how to take your idea into a working full stack app,

Malte Ubl 51:22 and yeah, I think that has been revolutionizing how people collaborate in early stages of a project, and for some folks actually bring it all the way into production, but I think what we are really focusing on that is in this in this ecosystem of of kind of folks who are quote unquote tech adjacent and people who are writing code that there's a really smooth transition between them.

Malte Ubl 51:46 And yeah, otherwise, you can always find me on the x platform, the everything app, and you know, or LinkedIn if that's your vibe. Like, we're trying to be super accessible answering questions, especially if you like my number one priority is that you guys are having a easy time deploying agents to Vercel. If that doesn't work, please ping me, DM me, etcetera, because that's our number one priority that this new type of software actually has a good place to run-in the future. Yeah. I believe it's at cramforce,

Malte Ubl 52:21 if I recall correctly. Yeah. I don't wanna I don't love it, but it's unfortunately way too late to change. It is memorable, so that's good. It is memorable.

Conor Bronsdon 52:30 Well, Malte, so good to chat with you. Fantastic to catch up, and thanks so much for joining us on that podcast today. It's been a ton of fun. Thanks for having me. It was super fun. That's a wrap for today's episode. Malta, thanks again for coming on. And as we continue season two of Chain of Thought, we're committed to bringing you even more valuable conversations

Conor Bronsdon 52:49 with the brightest minds in AI like Malta, even if they are occasionally throwing a little water on the fire that is our burning enthusiasm. I think that's a really important element of these conversations. And it's one of the reasons why I was particularly excited to have Malta on is that only Malta are and the team at Vercel building incredible things like their AI SDK,

Conor Bronsdon 53:11 vZero, to enable developers and builders and nontechnical folks around the world to build with AI, but they're also really thinking through, know, what do you need to do to observe, to evaluate, to improve your AI applications? And our our goal is to empower builders around the world, whether they're full time engineers, whether you're someone who's learning, with the the knowledge and insights you need to build the future of AI, to build your website, to build whatever it is you want, because that is the exciting

Conor Bronsdon 53:41 part of these AI tools is the creativity that it can unlock and the speed with which it can enable you to get to that MVP. And the best way that you can help us reach more leaders like Malta is by leaving a quick rating or review on your favorite podcasting app, you know, giving us a like on YouTube if you're watching on there. It really does make a difference. It helps us have more incredible conversations like this one. And with that, thank you so much for tuning in, and we'll catch you next week with another episode of Chain of Thought. Malte, thanks again. It's been a pleasure. Thank you.