Why won’t most websites get APIs for AI agents?

Dhruv Batra argues infrastructure evolves slowly and the long tail has no incentive to rebuild. School districts (20,000–40,000 in the US), government offices, and hundreds of thousands of e-commerce sites were built for humans and won’t expose agent APIs overnight. Much of the resistance is socio-political, not technical, so coding agents alone can’t solve it.

What does “pixels in, clicks out” mean?

It is Batra’s framing for how agents should act on the web. The source of truth is what a human sees on screen, not the DOM or accessibility tree. If a machine can perceive the pixels, click the buttons, and report that it completed the task, that capability is effectively the API — no re-architecting required.

What is Yutori’s Navigator and how is it different?

Navigator is Yutori’s in-house, post-trained browser and computer-use model, named after Netscape Navigator. It perceives pixels natively and, in version N1.5, writes custom JavaScript to shorten task trajectories. On browser benchmarks it edges out Opus 4.7 and GPT-5.5 on accuracy while running 2–3× faster and 4–5× cheaper, because it is small and specialized rather than general-purpose.

How does Yutori train browser agents without cloning websites?

Yutori’s agents learn by interacting with live websites and real traffic. Rather than building synthetic copies — you can’t reproduce a site’s backend — it uses privileged signals such as URL query parameters that appear after a filter is selected as verifier information the agent itself doesn’t see, making rewards easy to check while the agent still acts only on pixels.

Why are specialized small AI models becoming more important?

Batra points to scarce inference compute: a three-to-four-year-old H100 costs more than it did a year ago, and enterprises are burning annual AI spend in a month. Throwing the largest model at every task stops penciling out, so the pressure is toward smaller, cheaper, task-specific models, including ones that run on device for cost and privacy.

What does the company name Yutori mean?

Yutori is a Japanese word for the sense of well-being that comes from living with mental spaciousness. Batra frames the point of AI as giving people breathing room by taking the mundane and repetitive work off their plates, not making them run faster on a treadmill.

Episodes · S3 E63 ← Prev

Most of the Web Will Never Get APIs for AI Agents | Dhruv Batra

Name: Most of the Web Will Never Get APIs for AI Agents | Dhruv Batra
Uploaded: 2026-06-18T10:00:00.000Z
Duration: 55 min
Description: Dhruv Batra, co-founder and chief scientist of Yutori and former head of embodied AI at Meta's FAIR lab, argues that most of the web will never expose APIs for AI agents. He explains why Yutori trains specialized browser agents to perceive pixels and click buttons the way people do, and why they run faster and cheaper than frontier models.

Jun 18, 2026 · Dhruv Batra · 55 min

AI Agents

Listen on any app Watch on YouTube

Key takeaways

Most of the web will never expose APIs for AI agents. The long tail — tens of thousands of school district sites, government offices, hundreds of thousands of e-commerce pages — was built for humans and won’t re-architect itself for years. Batra’s model for action: pixels in, clicks out.
The web is a shared roadway. Just as roads are shared between human drivers and self-driving cars rather than given dedicated autonomous-vehicle lanes, agents will operate the human-built web the way people do instead of waiting for it to be rebuilt for machines.
Yutori’s Navigator edges out Opus 4.7 and GPT-5.5 on browser-benchmark accuracy, but the gap Batra emphasizes is speed and cost: it runs 2–3× faster and 4–5× cheaper because it is a small, specialized model built for browser and computer use rather than a general-purpose frontier model.
Navigator N1.5 writes custom JavaScript on the fly to shorten task trajectories — filling multiple form fields in one action, or jumping a date picker forward instead of clicking “next” twelve times — while still treating the pixels a human sees as the source of truth rather than parsing the DOM.
Yutori increasingly trains on live websites rather than relying on cloned synthetic ones. It uses signals like URL query parameters as privileged information for verifiers that the agent itself never sees, since you cannot reliably clone a site’s backend.
Specialized, smaller models are becoming an economic necessity. Batra notes a three-to-four-year-old H100 now costs more than when he signed his contracts, and enterprises are burning annual AI budgets in a month, pushing models smaller, cheaper, and onto devices.

Frequently asked questions

Why won’t most websites get APIs for AI agents?: Dhruv Batra argues infrastructure evolves slowly and the long tail has no incentive to rebuild. School districts (20,000–40,000 in the US), government offices, and hundreds of thousands of e-commerce sites were built for humans and won’t expose agent APIs overnight. Much of the resistance is socio-political, not technical, so coding agents alone can’t solve it.
What does “pixels in, clicks out” mean?: It is Batra’s framing for how agents should act on the web. The source of truth is what a human sees on screen, not the DOM or accessibility tree. If a machine can perceive the pixels, click the buttons, and report that it completed the task, that capability is effectively the API — no re-architecting required.
What is Yutori’s Navigator and how is it different?: Navigator is Yutori’s in-house, post-trained browser and computer-use model, named after Netscape Navigator. It perceives pixels natively and, in version N1.5, writes custom JavaScript to shorten task trajectories. On browser benchmarks it edges out Opus 4.7 and GPT-5.5 on accuracy while running 2–3× faster and 4–5× cheaper, because it is small and specialized rather than general-purpose.
How does Yutori train browser agents without cloning websites?: Yutori’s agents learn by interacting with live websites and real traffic. Rather than building synthetic copies — you can’t reproduce a site’s backend — it uses privileged signals such as URL query parameters that appear after a filter is selected as verifier information the agent itself doesn’t see, making rewards easy to check while the agent still acts only on pixels.
Why are specialized small AI models becoming more important?: Batra points to scarce inference compute: a three-to-four-year-old H100 costs more than it did a year ago, and enterprises are burning annual AI spend in a month. Throwing the largest model at every task stops penciling out, so the pressure is toward smaller, cheaper, task-specific models, including ones that run on device for cost and privacy.
What does the company name Yutori mean?: Yutori is a Japanese word for the sense of well-being that comes from living with mental spaciousness. Batra frames the point of AI as giving people breathing room by taking the mundane and repetitive work off their plates, not making them run faster on a treadmill.

Show notes

Most of the web will never get APIs for AI agents. School district sites, small business pages, government offices, and the long tail of e-commerce were built for humans, and they will keep working that way for years. So how do agents actually get things done across the web?

Dhruv Batra is co-founder and chief scientist of Yutori, the company building specialized browser and computer-use agents. He previously led embodied AI at Meta's FAIR lab, training robots in simulation and shipping the image question-answering model on Ray-Ban Meta glasses. His bet: the web is a shared roadway, much like roads split between human drivers and self-driving cars, and agents will be built to use it the way people already do.

Pixels in, clicks out. That is the API.

In this conversation:

Why the long tail of the web won't re-architect itself for agents
How Yutori's Navigator perceives pixels and writes JavaScript on the fly to shorten task trajectories
Why Navigator runs 2-3x faster and 4-5x cheaper than Opus 4.7 and GPT-5.5 on browser tasks
Learning from live websites, and using URL query parameters as privileged verifiers instead of cloning sites
What the shift from American to Chinese open-weight models means for startups
How smart glasses and robots share the same perception-action loop
Why demand for inference compute is pushing models smaller and onto devices

Chapters:

(00:00) Pixels in, clicks out
(01:37) Why most of the web will never get APIs
(08:47) Aggregation, specialization, and human friction
(11:39) Digital niches and specialized models
(16:41) The web's heavy tail and where browser agents win
(20:40) Inside Yutori's Navigator and Scouts
(24:08) N1.5: writing JavaScript to cut trajectory length
(27:45) Training on live websites
(33:29) Open source: FAIR's legacy and the Chinese frontier
(37:22) Agent frameworks: OpenClaw, Hermes, heartbeats
(40:57) How non-technical users adopt agents
(44:25) Smart glasses, robotics, and embodied AI
(50:57) Compute demand and smaller on-device models
(53:12) Why the company is called Yutori

Connect with Dhruv Batra:

LinkedIn: https://www.linkedin.com/in/dhruv-batra-dbatra/
X/Twitter: https://x.com/DhruvBatra_
Yutori: https://yutori.com

Connect with Conor:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

More episodes: https://chainofthought.show

Transcript

40 segments

Dhruv Batra 0:00 Digital surfaces like the web are going to be shared roadways for a long time to come. Some drivers of action are going to be humans. The web was primarily designed for human consumption. If a machine can perceive pixels and click buttons and do the things and report to you that yes, I'm able to do it. There, that's your API.

Conor Bronsdon 0:22 Most of the web is not getting APIs for AI agents. School district sites, small business pages, the long tail that actually runs the internet, all the useful things that drive the day-to-day of our children's lives and much of our own, they look like they were built in 2002, maybe 2008, and they're going to keep looking that way for years. So how do agents actually navigate and take action across the web? We're going to discuss it today. Welcome back to the Chain of Thought, everyone. I am your host, Connor Bronsdon. My guest today argues that the entire premise of identifying the web is backwards. The web isn't going to be rapidly re-architected for agents, or at least much of it will not be. Agents are going to be built to use the web the way humans already do. Pixels in, clicks out. As humans and agents share the digital roadways, Dhruv Batra, co-founder and chief scientist of Utori, is here to talk us through his approach. Previously, he led Embodied AI at MEDA's Fundamental AI Research Lab Fair, training robots in simulation and deploying them on Boston Dynamic Spot, plus much more. Dhruv, welcome to Chain of Thought. It's great to see you.

Dhruv Batra 1:34 Thank you for having me Connor, wonderful to be here.

Conor Bronsdon 1:37 I'm really excited to have this conversation because the more I dig into what you're doing at Yotori, the more interesting I find it. Yotori's agents are in use around the world with Yotori Navigator N1.5 beating out Opus 4.7 and GPT 5.5 with 89.7% average accuracy across 300 everyday web tasks for your NaviBench V2. They're looking great on cost per task and latency. And it's clear that you're different thinking about specialization and the agentic web are driving this. If most of the web is never going to expose APIs to agents, what does the actual infrastructure for agentic action on the web look like?

Dhruv Batra 2:20 Yeah, a wonderful question and getting to the heart of the matter. Okay, so a few thoughts. I think at a high level, it is beginning to be clear to people that the drivers of action on the web and digital surfaces generally are going to be agents, not humans. And maybe, you know, in certain circles it was clear a few months ago, maybe in certain circles it's not clear. It certainly wasn't clear two years ago when we started Udori. This is pre even the coding agent revolution, so it was certainly not clear that that was going to happen. Today, if you talk to a software engineer, I'm sure they can relate to the feeling of one day telling someone, did we really type every character in every line of code that we used to write by hand? And I hope in a few years, we are asking that same question about operating UIs, whether that's on browsers or other surfaces. Did we really, you know, go to websites, fill out forms, typing our names, clicking on buttons, dragging sliders along? Why don't we have agents do it? When people think about agentifying the web, the calls we are right now quite influenced by tool calling models and coding agents. We think that that's the way to solve this problem that suddenly overnight, websites will just reinvent themselves, offer APIs that they have not offered before. And it'll just be a CLI or an API call that lets agents act that It's an intelligence problem. We've solved the tool calling and now the infrastructure just has to get there. Unfortunately, that is not how the world works. Infrastructure takes extremely long to evolve and change. I live in San Francisco. The Golden Gate Park tennis court reservation system website was written by some indie developer. That website has a perfectly functioning calendar view reservation system. It's not going to reinvent itself overnight. Government offices do not reinvent themselves overnight. There are hundreds of thousands of e-commerce websites The head of that distribution, if the infrastructure is built by the Shopify's of the world, will perhaps expose APIs, but the long tail does not reinvent itself overnight. So I'll give you an example. If you ask a simple question, I have a product and that product is sold on multiple vendor websites, you know hundreds of e-commerce websites. And could you just tell me how much does it cost to ship this product to my house? Not the listed price by the way, just the shipping price. There is no API available for this. E-commerce websites expose that information after you add the item to cart, put in your zip code information, and then you'll see a listing of two or three options. Answering that question, requires a computer use or a browser use model or an agent that does the same process that a human does, which is go to that website, add the item to cart, put in the zip code, and then that information is available to you. There are something like 20,000 to 40,000 school districts in the US. They each have their own associated website identifying the leadership of that school district, identifying curriculum, identifying what they currently need. If you want to ask a simple question of is one of them requesting proposals right now asking for purchasing laptops? Answering that question is extremely hard. We live in a world where we still sign on tablets. Signatures are still a thing, yet people imagine that somehow overnight infrastructure will reinvent itself and, you know, MC-Fi-Fi or API-Fi itself. The analogy that you were referring to that I really like is we should think of the physical infrastructure that exists in front of us like roads. Roadways, if we wanted to solve autonomous driving and make it really simple, we could make it really simple by just making dedicated lanes for autonomous cars, embedding it with sensors that make things easy for autonomous cars and, you know, having perfect autonomy in there. However, we realized that reinventing infrastructure, roadway infrastructure is quite expensive. It's not going to happen overnight. And what we have in practice is shared autonomy. That roadway is shared between human drivers and autonomous cars. And I What I would like people to think about is digital surfaces like the web are going to be shared roadways for a long time to come. Some drivers of action are going to be humans and that's why that's who the roadway was designed for. The web was primarily designed for human consumption and we've spent 30 years of development of browsers and websites and all of that ecosystem, making it extremely efficient to display rich visual information to humans, because that is a high bandwidth input signal into the human brain, and getting information from humans with clicks, typing and GY interaction. And that is the roadway that we're going to have to share with digital robots that are going to be operating it like humans. And in some sense, agentification of the web is going to happen through machines that are acting like humans, perceiving pixels, clicking buttons, and as a consequence of that, offering APIs. Because, you know, what is if a machine can perceive pixels and click buttons and do the things and report to you that, yes, I'm able to do it. There, that's your API.

Conor Bronsdon 8:47 So this is an interesting frame because I don't disagree with you. I think we are overestimating how rapidly some of this infrastructure below is going to happen. And yet in the roadway example, physical infrastructure typically takes much longer than digital. Um, so I mean, we could deploy a bunch of coding agents to rewrite all the school district websites. We could also deploy specialized agents whose job is to aggregate all of this content and then provide an API to most other agents. So only certain ones need to be really good at navigating these you know, gravel road sites, so to speak, if we think of certain ones as much better paved with an API, with a CLI, whatever. Do you think we're going to see a mix of that behavior where we get this aggregation effect in certain areas and just more specialization around how certain agents are deployed to aggregate the information for agents that aren't as strong at navigating the web? Or do you think we're going to see a coalescence around, hey, agents have to be excellent at web navigation outside of the paved path? [9:59] Dhruv Batra: [OVERLAP] Yeah. So two thoughts again, I think at the top level completely agree with the pushback that, you know, the physical world is slow for a reason. It has to abide by laws of physics and there's only so fast you can, you can't You can copy bits, you can't copy atoms. Construction takes time, which is not the same in the digital world. Agreed with that distinction. And that is actually one of the reasons why we are seeing so much progress in coding agents and software agents and digital agents before we see progress in robotics. And you alluded to this on the top of your podcast, but the reason why I keep appealing to robotics and physical spaces, because that's what I used to do and spend time thinking. And so this is an analogy that I reach for quite frequently. But the second point, I see what you're saying, that yes, some change in agentification will be driven by access to coding agents. Yes, it's absolutely true that if the bottleneck was infrastructural change or adoption of new libraries or frameworks, then coding agents can do it. But some of the change, some of the resistance is just socio-political human [11:17] Conor Bronsdon: [OVERLAP] total. [11:18] Dhruv Batra: [OVERLAP] organization and coding agents don't solve that.

Conor Bronsdon 11:22 Is my principal going to hire someone to actually rebuild the website and put an API on it? Probably not a priority.

Dhruv Batra 11:29 Yeah, rollout takes time across human, distributed human organizations. And, you know, we are seeing that it's like,

Dhruv Batra 11:39 if tomorrow you could have an agent that can rewrite complete operating systems, is it really the case that every single laptop will ship with that new operating system? It's not gonna happen, even if you could write a operating system from scratch. the rollout takes time. So with that in mind, I think the other thing to keep in mind is there are actually certain analogies to laws of nature, if you will. I also find myself appealing to making analogies to the biological world and evolution. In the biological world, there are biological niches that select for specialization of bodies and brains. It is not the case that that we have organisms on Earth that are highly specialized for certain niches. Flying, the evolution of wings happened multiple times. The streamlined bodies and echolocation of underwater animals has happened multiple times. I think there are analogies to that, to digital niches that select for models. If you are operating in a in a problem domain like math or coding, where really it's a reasoning problem. You can take a lot of time in thinking because you're operating in asynchronous mode. There's a certain kind of model that appears. Think about a domain like audio interaction with another human being. You have a hundred millisecond latency constraint. If you are slower than that, then you're going to end up causing unnatural pauses or gaps in the conversation. And so a certain kind of model needs to be built and vertically on top of that, certain kinds of companies and products get built. You have to specialize into an audio processing full duplex model. You then build voice agents, which are then serving customer support. And I would make the analogy to UI interaction as a digital niche. We have gone ahead and plastered screens over our physical world. The reason why we've done that is because that's a high bandwidth communication channel to the human brain. Those screens are going to be mixed autonomy roadways. And in order to operate those screens, you're going to need specialized brains or models that do that. And the constraints there are, you have to perceive pixels natively. You have to operate extremely cheatly because this thing is going to be running in a for loop. Because you have to perceive act, perceive act, not quite at the same rate as a robot does. You don't need 30 frames per second, 60 frames per second. your trajectory lengths are going to be long. And so this has to be cheap, otherwise the tasks that you want to execute don't make economic sense. And it has to be low latency enough. And that I think is an entirely different ballgame.

Conor Bronsdon 14:42 you brought up a couple of really interesting concepts. So one specialization, you know, we alluded to it earlier of, okay, we're going to see aggregation of some of this content. There's going to be business opportunities, people who are like, oh, let me help your agents more easily access parts of the web that are not, you know, paved path for agents. Uh, and the other, I thought that I loved you bringing up is evolution, this idea of digital niches. And I think we're actually seeing that emergent behavior with coding agents today where Coding agents with very little human guidance are evolving to fill certain digital niches. And we're seeing the code that is output from them are solving certain problems that agents are having for themselves. And so there is this kind of self-evolving factor that is occurring. And you can see it really clearly when you look at something like Hermes agent or open clock communities where they are encouraging this behavior. They may churn 80% of a code base over a couple months and they're just going to keep going through it because they have access to so many tokens that they can just keep evolving on that code. And we're seeing this behavior happen with a lot of different code bases. But it almost seems, I think it's interesting to think about that evolving pattern and this kind of new development pattern that's occurring. alongside the highly planned approach to we're building some infrastructure. You know, what does the construction crew look like for it? Who's funding the rebuild and which parts of the web get rebuilt first? And there's obviously incentives for this, right? Like the fortune 500, you know, they will happily charge you per query toll for an agent native API once volume justifies it and they're all moving this direction. How do you see this build out kind of almost fracturing out based off of the kind of self-evolving side of it versus the very planned out? What incentives are you seeing at play?

Dhruv Batra 16:41 Yeah. I think the, if we think about the distribution of websites out there and digital surf, like, you know, in some sense, what is a website? It is a front end that lets me interact with a database on the back end. And, you know, what exists out there, the head of the distribution has enough incentives, is organized enough, is mature enough that it will respond to the incentives and provide structured pathways and like digital roadways. If you are holding on to either enterprise or customer data, that customer or that enterprise wants to interact with that data with an AI agent, which is how we are going to be interacting with the digital services. If you are big enough and sophisticated enough, you just build out those agentic use cases on your end or you just provide toll bridges where you are charging for access and things like that. I think that's absolutely where the head of the distribution, the top 10, 50, 100, 150, 500 will end up being. But I think people forget the heavy tail nature of the web. the long and heavy tail does not have individual value, but it has cumulative value. And we've seen enough aggregation businesses build out that have displayed that. I mean, after all, what is an advertising business if not showing you the value of that heavy tail? And that heavy tail It has to sort of become accessible to agents without having the investment to have done that on their end in the first place. And that is where I think Browser agents, which is what we specialize in, are providing real value. Fundamentally,

Dhruv Batra 18:42 the way to think about it is, if you as a human can look at a screenshot and look at a browser screen and know which button to click, that is the operation that a browser agent has to do. We can automate that process. We can scale it by hundreds of thousands of parallel agents. We can make it asynchronous. We can make possible entirely new products that we couldn't have built before because it just didn't make sense. you would never going to pay a human developer or teams of human developers to write custom scripts for 500,000 websites to do extremely bespoke tasks and then pay them to manage and update those scripts. You can argue that coding agents can write scripts. Yes, but the websites change their dynamic. They're sometimes hostile. The web is laid out for human consumption. We recently put out, not recently, a few months ago, put out a blog post called The Bitter Lesson of Web Agents. People, I think, till they look under the hood, don't realize how messy websites are, there is no central owner of the web. You know, people propose standards. In fact, people propose 14 different standards and then there's the 15th standard proposed, which means there's no consistency. Sometimes certain buttons are labeled one way, sometimes another way. People think that accessibility trees are consistently enforced. They're not. In some sense, the reason why all of that is happening is because that's not the source of truth. The source of truth is what the human sees. And that's what you can automate and act on.

Conor Bronsdon 20:28 How does all of this inform the decisions you've made at Utori about how you're building your agents and your scouts?

Dhruv Batra 20:40 Yeah, so first of all, the sort of core expertise that Utori has and the core product that we build are our own in-house models for computer use and browser use. We've recently put out a comparison three weeks ago, we released the agent and the model is called Navigator. It's actually a reference to Netscape Navigator from the 90s, which was the first browser that went mainstream for humans. And we make a reference that Netscape Navigator was for humans, Yotori Navigator is for machines, and we're reinventing that interaction. It's a model that we post-train in-house. The most recent comparisons were to Opus 4.7 and GPT 5.5 on standard browser use benchmarks. We are slightly more accurate. I wouldn't make a big deal about that. The gap in accuracy, what I would make a big deal about is we are somewhere between 2 and 3x faster and 4 and 5x cheaper. And that is because it is not a general purpose model. It is a specialized, small, light footprint model optimized for browser and computer use. What we've described are, you know, everything in the stage of cost training from SFT rejection sampling to on-policy asynchronous RL on synthetic and real websites. So our agents are learning from interactions with real websites and real traffic going out into the world. And we use that to get access to the long tail of the web on consumer-facing product. One of them is called Scouts, where it's a monitoring agent. It's deep research in a cron job. So you can ask things like, hey, let me know whenever my favorite artist comes into town or let me know when this product price drops below a threshold or let me know when, you know, if you're a If you're a sales team and you're monitoring a particular set of customers, let me know when they announce something. And it's a multi-agent architecture. Whenever there are APIs, we will use it. Web search, whatever. But the long tail of the web doesn't have APIs and for that we launched navigators to Patcher 7. So, in essence, to answer your question, We made a bet early on that having an in-house, lightweight, cheap, fast browser use and computer use model was necessary for unlocking operations and workflows on the web. And we're seeing that play out in the sense of we offer that both on a consumer side, but also on an API side and packaging it up at scale to enterprises.

Conor Bronsdon 23:30 I love the specificity of what you're building for here because I do think there is a little too much generalization at times as far as how people are deploying agents and models. And you have a very clear viewpoint on what you think is going to happen around the web identification project and what we're seeing on the infrasight there. Walk me through the technical side of building your new Navigator N1.5

Conor Bronsdon 23:59 model. What have you done to make it different and more successful on these tasks? What can you tell us about how it's constructed?

Dhruv Batra 24:08 Yeah. So the bet that we took on the construction side of the architecture was related to our observation. So a lot of the founding team comes from the post-training team for Lama 3.2. So a lot of us come from Meta, some from Ferris, some from Gen AI, some of us also come from Google and DeepMind and other places. The bet that we took in the construction of the model was that this would have to be a natively multimodal model. It has to perceive pixels. Our first version of the model, N1, was perceiving only pixels and producing only human-like actions, button clicks and type and scroll and so on. N1.5, which went live a couple of weeks ago, we added the capability for the model to write custom JavaScript code on the fly. And that is a fairly different and unique thing that is out there. What that does is that when the model perceives opportunities for speeding up and doing things that can be done faster through code, it does so. So if it's filling out a form, and it sees in its view that there are five fields to be filled out or there are multiple selectors of checkboxes to be selected. Instead of manually going and filling each one of them out with a for loop around the model, the model will just write a action, which is, I would like to execute JavaScript, and it will write that JavaScript function, which will have a for loop in it. And those selectors would just get selected. It can also do things like if you're asking for information from the webpage, And that information is spread across multiple tabs or clicks like this is commonly happens in e-commerce websites where you're asking for product variants and sizes. And so you'll see different buttons for like sizes of the product and variants of the product. The model will just write a custom JavaScript function. This also happens in date picker interactions. If you have to pick a date that is 12 months out and the pixel only models are sitting there clicking next 12 times to go to the next year, this one can just write a custom interaction with it. The reason why this is important is it cuts the trajectory length because ultimately what makes agentic tasks hard is long horizon trajectories and so a compound action of code cuts the trajectory length. But the pixels serve as the source of truth. No human is involved in pre-designing how to read document object models or HTMLs or accessibility trees. We're not feature engineering. We're not parsing the DOM through manual expertise. We're just handing over that capability to the model. So the model will start out, look at the pixels. It will write custom code to read and write the DOM and manipulate it. And that's what ends up being quite a valuable thing in these cases.

Conor Bronsdon 27:13 And I know your team wrote about the original Navigator model, which I think you released November of last year. that it was initialized from QEM3VL and that you were basically focusing on mid-training, supervised fine-tuning, reinforcement learning to get it to where you wanted to be on the online Mind2Web evaluations these other web evals are doing. What was the approach you took on this new Navigator 1.5 that helped you make these leaps you saw?

Dhruv Batra 27:45 Yeah, we haven't put out an associated blog describing some of those things, but you can imagine a lot of these ideas carry through across mortal families. You know, the one of the interesting, I think that is interesting that we haven't seen a lot of other people do, things that we do is We do learning by interaction with live websites. So very early on in the journey of our company, we went through the process of hiring human annotators to solve tasks by having them install Chrome extensions and having them click buttons and us recording what they were doing. That does not scale. That approach just doesn't work. And ultimately the lesson here is which that lesson is not particularly unique, that the more you can get out of the way of the model and scale data collection in an agentic way, the better it goes. Where I think we contribute a unique bit of information is to step aside even on the creation of synthetic websites. So You know, we don't, in many of these cases, we don't create synthetic websites to serve as containers, RL environments to do RL in. You can actually go out into the world, and here's a trick that we've described and we've put out. The trick is that there are some websites where after you select filters, we'll modify the URL and put in that filter choice as a query parameter. That observation, you know, the fact that that the URL has been modified and the query parameter now shows up in the URL, you can expose that information to a critic or a verifier, but you can prevent that information or not feed it to your actor or your model or your agent. That difference in privileged information means that you can write out verifiers that are extremely easy to verify, that are looking at privileged information, but your agent does not have access to that privileged information. It is still operating based on pixels. And that's, that's, you know, that does not generalize. In fact, you actually don't need verifiers to generalize, you first notice that you just need your agents to generalize. But there is a nugget of that idea that does generalize that there are sources of privileged information out there that help you write verifiers. And that is more important than cloning websites. You don't need to write synthetic websites. In fact, many websites you cannot create synthetic versions of because while coding agents may be fantastic at reproducing the front end, you do not have enough information to reproduce the backend, which is an obvious observation in hindsight. Like, how do you know how search on a particular website is implemented at the backend? So you're going to implement it differently, which means it's a different environment. [30:41] Conor Bronsdon: [OVERLAP] Super interesting. I'd love to understand more about your approach [30:46] Dhruv Batra: [OVERLAP] I [30:46] Conor Bronsdon: [OVERLAP] in post-training. You know, obviously you are doing a lot around leveraging live websites, super smart given the constraints you just described. I know you haven't shared all of this about N1.5 yet, but what can you tell us about other aspects of your approach to training the model so that it is excellent at the tasks that you are calling it to do?

Dhruv Batra 31:11 think one observation that perhaps at a high level of abstraction seems obvious in hindsight is the value of data ultimately. It's not a unique thing to say, but I think it's worth repeating that if you want to build good models and good agents, you have to invest in that. And we go out into the world, our agents go out into the world across an extremely wide variety of websites. And that I think is an extremely valuable source of data that gives us a unique leverage. When people talk to our agents and say, I would like to monitor this piece of information, really it's a multi-agent system where there is a top-level orchestrator that is launching navigators. But then the consequence of that system is that many navigators are launched, prompts are long and detailed, they are machine-written. all of those are agentic trajectories that we have access to. The breadth of the product, the fact that the product does not place any constraints on where you can send the model and the agents means that we end up hitting an extremely large number of websites. We have an API product where we actually don't control what website you send our models to. And so, that's again, the value of that diversity. And so, which is a long winded way of saying it's the data.

Conor Bronsdon 32:51 Yeah, I think it's really interesting to see how many companies are now training models or are post-training Chinese open source models. And obviously you haven't shared at this point what Navigator 1.5 is based off of, but you know, I can make some assumptions based off of Quen 3 being the base for Navigator N1. What's your perspective on this bifurcation we're seeing between, you know, Western or American open source and then the almost like open source frontier of the openweight Chinese models?

Dhruv Batra 33:29 Yeah. This is a topic that is actually close to my heart because, you know, I was at, I was at FAIR for many years. Um, the mission was basic found and is to a certain extent, but, you know, I think we can all agree it's not the same place anymore. Um, the mission was basic fundamental research, um, advancing open source shared freely to the world. Um, and you know, for a period of time, we created that world. For a period of time, [34:05] Dhruv Batra: [OVERLAP] there were major industry locking innovations that were shared freely, like the common examples shared from my colleagues, PyTorch was invented for internal [34:17] Conor Bronsdon: [OVERLAP] Totally. [34:18] Dhruv Batra: [OVERLAP] consumption. and now the industry runs on it. My computer vision colleagues invented segment anything and the world's best object detectors and that's used at a bunch of places in our own small way, not to the same level of impact. You know, the fact that we built the world's fastest 3D simulator, Hapadet, and that was used for training virtual robots both internally, but also in academia, also in other industry labs. And so it was a unique It was a unique time in a unique world.

Dhruv Batra 34:52 Today that mantle has been picked up by Chinese startups and companies. I think the fact that there are strong open source models available makes a certain kind of startup possible. If there weren't any, I think there's a clear fact that there are certain things that allows startups to quickly build out things on post-training side that if you had to invest in pre-training, the level of investment required works as a gatekeeping and a thresholding mechanism.

Dhruv Batra 35:30 So I would, like, I, there are, I'm not generally someone who thrives on fear mongering. I am generally positive towards and optimistic towards technology. I think we should be extremely skeptical when concerns about the future are interleaved with sharing of knowledge. Security through obscurity has never been a good strategy. I think that is not an argument we should appeal to.

Conor Bronsdon 36:13 I mean, there are some major frontier labs that are attempting some regulatory capture, for sure. It just is what it is. And it's an understandable economic strategy for them, even if I don't personally like it. And it is really interesting to see how we have a wave of Silicon Valley startups that are being fueled by openweight Chinese models because that's where the frontier on the openweight side is. And not to say that Gemma 4 isn't a very good model because it is, but you can look at Kimmy K 2.5, Kuen, DeepSeek, there's so many examples. And this shift has been fascinating to see, and I'm sure you have plenty more thoughts on it. The other one that I think is interesting is the rise of some of these Asian frameworks that we're seeing that are more generalized. So OpenClaw, Hermes, others. What are you seeing at the frameworks layer on top of both, you know, paid frontier models and then obviously open models as well?

Dhruv Batra 37:22 Yeah, I think, you know, the biggest innovation in the last year or so in some sense has been OpenClaw and, you know, just packed with four or five really fantastically good ideas, which is not to say there aren't other nice innovations, but just, you know, bangers, like a couple of like really good ideas in there. The fact that you can, you have internal, the agent has internal visibility into its own inner workings means that you just talk to it for making changes to it. Certainly exposes itself to a broader set of the audience. The memory system, the fact that, you know, it maintains a rich set of file-based memory systems. And the notion of a heartbeat, you know, that, that it can schedule itself. When we launched Scouts, which was last year, somewhere around June, the notion of a cron job was central to Scouts as well. you run deep research in a cron job and you have to deduplicate against that. And we thought that was the clear differentiator and value proposition, which it was for a while. And then obviously got copied at a bunch of places. The idea of routines and running cron jobs is where things have moved on to. And in some sense that has exploded the inference capacity needs as well. It's very interesting to see that play out. Which is all to say, to come back to your question, I think we're seeing really fantastic innovation at the framework harness application intersection layer. The challenge, as you alluded to earlier as well, is One risk with exposing inner workings and adjusting the framework through natural language for non-sophisticated users is that you end up breaking the system and not realizing how to fix it. Um, like if you talk to OpenCloud long enough, um, you can ask it to do things. And if you're not sophisticated enough, you will end up breaking the system. Um, which, you know, you can find yourself needing another coding agent to go and fix this, uh, agent and figure out whether, where the configs are wrong. Some people can do that. I think some other people will struggle against doing that, which isn't to say this isn't. unsolvable problem, perhaps there is a good solution for this as well. Reinstall, start from scratch, import memories, you can come up with solutions around that. But there is a risk in this layer.

Conor Bronsdon 40:13 Yeah. How do you see the

Conor Bronsdon 40:18 less technical layer of folks who are coming on board with OpenCLoud as their solutions? Maybe they're going to start using Utori for their web search needs here in the future. What do you see us needing to do to enable this kind of shift from the frontline of, oh, it's often engineers and relatively technical folks who are using agents often for coding to now We're seeing them spread throughout society. We're seeing infrastructure start to get rebuilt in some areas, as you said earlier. What does that shift look like, particularly for folks who maybe are going to be less aware of how these agents may impact their systems long term? [40:57] Dhruv Batra: [OVERLAP] Yeah. You know, making predictions about the future is, is, is hard. But one thing that comes to mind is, in some sense, the crypto analogy, that there are anytime a new technology is built, initially, there are the earlier doctors are the And I realize there's a bunch of grift in crypto and like, let's leave that aside. But, you know, the early adopters tend to be technologically savvy, forward-looking people who want to make things work, are sophisticated. And then as it's propagating across, what we're seeing now in the crypto space is adoption by extremely large incumbents, right? Like our, you know, centralized players that are offering things that have, like, if you, what you value to an unsophisticated, what you present as a value prop to an unsophisticated user is not the how, like, why do I care what a blockchain is? Like, or why does, you know, my, 80 year old father is not going to care about what the blockchain is or, you know, but if you, if the value proposition to him is, well, you can send money to someone and the cost of transfer is lower. Okay. That's something he understands. He can compare it to other service providers for that. And I think it's going to play out similarly in the agent space. Initially, it's all, you know, extremely tech savvy, extremely sophisticated users. The unsophisticated users, it's actually a surprise that they've sort of driven into the space. But ultimately, they're not going to care. Like, if it's an agent, what they care about is, I can talk to it. It responds to messages on my behalf. It can book things on my behalf. It can do things on my behalf. Fantastic. Like, do you really care whether it's a browser use model or an API or a CLI or an MCP? No, none of this stuff matters to an end user. [42:57] Conor Bronsdon: [OVERLAP] How am I problem solved?

Dhruv Batra 42:59 Yeah.

Conor Bronsdon 43:00 Interesting. Yeah. I, I do think we spend a lot of time, uh, arguing about MCP is dead. Reg is dead. CLIs are the future. No APIs like this are the way to go. We're not going to build enough APIs. And in the end, the people who will end up using these agents are not going to care for the most part. Just simply won't. Yeah, I think you pulled on a thread here that is really important, which is this idea of everything's becoming identified. We're seeing, I mean, we're seeing backlash against AI. And yet the same people who are part of that backlash are often starting to use AI-enabled systems without even necessarily realizing that they are using agentified systems. And increasingly, we're just going to see this become more and more commonplace, whether it's Google Shopping agents, whether it's, you know, Utori's new delegate agents that I saw you put out or anything else. But, you know, we've talked a lot about agents. We also briefly touched on something that I know is very near and dear to your heart, which is hardware and enabling hardware and AI together. You spent years leading embodied AI at FAIR, at MEDA, you know, training robots for simulations and getting them deployed in the real world. How do you see agents in today's AI interacting with the physical world over the next couple years?

Dhruv Batra 44:25 Yeah, this is a topic that is dear to my heart that as you mentioned, I, I led fair embodied AI. That was a interesting group in the sense that it was an, it was atypical to have a group doing both AI for robotics and AI for smart glasses. The reason for our existence and the vision was that you have to solve the same egocentric understanding and perception action loops in robotics as you need to do for smart glasses. Whether a camera is attached to the face of a robot or sitting on the glasses that a human is wearing, You have solved similar problems. There are, of course, unique differences. One of them has actuation through physical hardware and the other one, you're just assisting a human. So yes, there are differences, but there are enough similarities that that was what started this problem in a cohesive way. When I left Meta, I had posted on Twitter that I'm leaving, I'm going to do something else. One of the things that I said was, I continue to be excited about a world, a future in which I had phrased it pretty boldly. I had said every single interaction with the digital and physical world is mediated by an AI assistant.

Dhruv Batra 45:53 You can imagine how a statement like that goes on Twitter. Today, hopefully, at least people see the every single interaction with the digital world mediated by an AI assistant in non-controversial ways. But if you talk to your phone, you should talk to a browser, you should talk to like every app wants to stick a talk to the talk to our AI agent button on it. And we do. It's perhaps not controversial that you will talk to computers in the future and they will click buttons, right, call whatever, automate things. Hopefully that part is not controversial. I do imagine a future in which the interaction with the physical world is augmented and superpowered through AI. Intelligence is a lever. It is one of the greatest levers we have come up with. I think glasses are an extremely interesting form factor. At FAIR, one of my teams had built the first version of the image question answering model that had shipped on the Ray-Ban Meta sunglasses. And so you could even invoke it by saying, you know, of course, in collaboration with the product teams, we were a research team, but you could invoke it by saying, hey, Meta, can you take a picture and describe this monument and give me more detail? That is a simple image question answering task. Incidentally, in 2025, my colleagues and I received the Mark Everingham Prize for introducing a problem called visual question answering in 2015. In 2015, we wrote a paper that was titled Visual Question Answering. We introduced the problem of answering questions about images, and we introduced a dataset, we introduced, you know, an evaluation metric, we introduced a method, and that, It really blossomed into an entire subfield of what was known as vision and language research or multimodal research. And so seeing that journey through to shipping of actual products where you just talk to an assistant that's running on your glasses was quite rewarding to see. I think the idea that you should have, of course, with all privacy considerations built in, but the idea that you should have superpowers that come from the right devices that can observe the world from your perspective and tell you or even proactively take on tasks on your behalf. If I see the world from your perspective and I notice that there's, you know, the common example was if I notice that there's an item that is out in your fridge, if I just see you toss an empty milk container in your trash and you know, I can proactively reorder that item or I can come up with other things. If you meet a new person and I can note, you know, today every single meeting recording software is introducing the meeting notes or briefs feature where at the start of the meeting they will tell you who you're meeting, what this meeting is about. Why doesn't that feature exist for live interactions where in front of every interaction, like I know who this person is, if I'm meeting, that's perhaps a bit dystopian, but you know, you can see versions of that idea make sense. I really do think that the future is with superpowers. The idea of all of us being equipped with superpowers is an exciting future.

Conor Bronsdon 49:44 It's interesting you phrase that way because I think we've seen this conversation about AI being magic or superpowers on your computer already where it's like, oh, you know, I can do things with coding agents. I just simply couldn't before. I can deploy my army of robots. And that is starting to come to the physical world. But as we've talked about this whole time to enable this kind of mass transformation, and I think you and I both believe we are at the precipice and, you know, maybe a bit over of this insane re-architecting of our world, even though parts are going to take longer than we expect. Part of that comes down to having enough compute. So we're seeing this massive data center build out. We're seeing investments in edge models. Obviously our conversation with Maxine Labone over at Liquid AI is a great one that we did last fall focused on edge models. We're going to do a lot more in the future here. How do you see the next couple of years of the physical AI build out as far as the data center layer, the other infra layer that has to accompany all the work happening on the web layer, etc.? ?

Dhruv Batra 50:57 I think today we live like who could have seen it coming that just as a indicator of where the market demand is, who could have seen it coming that an H100 chip that is three or four years into its life cycle is worth more. As a startup founder, I'm realizing that I'm paying more for it today than I did a year ago when I signed some contracts. So we live in a world with almost an insatiable amount of demand for inference compute. What are the obvious consequences of that demand? There's suddenly an awareness that people are burning through their like enterprises are waking up and realizing that they're burning through their annual spend commitments in a month that does not seem sustainable to me. And so the one obvious consequence is certainly people, not suddenly, but people are going to realize that there is a need for cost efficient models, smaller, lighter footprint models that you're not going to throw the biggest, beefiest thing for every small little task that you're going to do. The fact that we want to run these models on device means that they have to be smaller. The fact that privacy constraints and considerations mean that you want to keep your data on the device and do not want to send it for some things to a remote provider means that the models have to get smaller and run on device.

Dhruv Batra 52:22 I think the world is heterogeneous and interesting in the future. We've seen a lot of progress with massive models, with remote, with just API based access and served in massive data centers. The world is, my bet on the future is digital, maybe I'm repeating myself, is digital niches. The world is heterogeneous, it's filled with the same variety of flora and fauna that we see in the natural world that we're going to see in the digital and the physical world.

Conor Bronsdon 53:01 I think that's a fantastic through line for this conversation. Drew, it's been such a pleasure chatting with you. Always great to get your insights. Any closing words for our audience as we wrap up?

Dhruv Batra 53:12 It's been fantastic chatting as well. I think these conversations really end up invigorating me. I think

Dhruv Batra 53:24 in terms of where people can learn more about us, it's utori.com. I think the last note that I would share with people is we chose the name Utori because it's a Japanese word for a sense of well-being one experiences by living with mental spaciousness. The point of AI is not to drive anxiety and to make us feel like we're on a treadmill constantly running faster. The point of AI is to give us the breathing room to focus on the things that are really important to us and to take off the mundane and the dull and the dreary and the routine and the mindless. So that's what I leave your audience with.

Conor Bronsdon 54:07 I love it. An inspirational, optimistic vision of the future, which is definitely something we try to drive towards on this show. Listeners, thank you so much for joining me today. Thanks for listening to Drew and I as we've had this conversation. Make sure that you have subscribed, liked, rated, reviewed, whatever you need to do on your podcasting app of choice on YouTube, wherever you're listening. We appreciate you so much. And Drew, thank you again for coming on the show. It's been fantastic.

Dhruv Batra 54:31 Thank you brother. Thanks for having me.

Most of the Web Will Never Get APIs for AI Agents | Dhruv Batra

Key takeaways

Frequently asked questions

Show notes

Transcript

Keep listening