AI, decoded

How do you cut the cost of running an AI agent?

Most agent cost is hidden in the steps you can't see: redundant model calls, an oversized model doing a small job, bloated context sent on every turn, and retries from failures nobody caught. You cut it by first making the costs visible with tracing, then attacking the big drivers — route easy steps to a smaller or cheaper model, trim and cache context, cut needless tool calls and loops, and fix the failure modes that cause expensive retries. You can't optimize what you can't see, so observability comes first.

· Chain of Thought

Enterprise AIAI Observability

Find the cost before you cut it

Agent bills are rarely one big line item; they’re a thousand small calls. Without tracing, teams guess at where the money goes and optimize the wrong thing. Start by instrumenting the agent so you can see cost per step, per task, and per model call. The expensive patterns — a loop that retries silently, a step that calls the model three times when once would do — only show up when you can see the trace.

Right-size the model

The most common waste is sending every step to the biggest, most expensive model. Many steps don’t need it. Routing easy or structured sub-tasks to a smaller or cheaper model, and reserving the frontier model for the hard reasoning, can cut cost sharply with little quality loss. Intercom’s move from a general frontier model to a tuned smaller one is the extreme version of this: pick the cheapest model that still passes your evals.

Trim and cache context

Every token you send is a token you pay for, on every turn. Agents accumulate context fast, and much of it is dead weight by the third step. Compacting what’s in the window, dropping what the agent no longer needs, and caching stable context instead of resending it are some of the highest-leverage cost cuts available.

Kill the failure tax

Failures are expensive twice: the wasted call that failed, and the retry. An agent that loops, picks the wrong tool, or recovers badly burns budget on work that produces nothing. Fixing the top failure modes is a cost optimization as much as a reliability one — and latency work, like Superhuman’s, often cuts cost in the same pass.

From the conversation

This explainer is drawn from these episodes — each carries its full transcript.