How do you evaluate an AI agent?

The 3 Levels of Evaluating an AI Agent

You check it at three levels: the step (did it pick the right tool), the turn (did it do the steps in the right order), and the session (did the whole thing reach the right result). A single accuracy score hides all three, which is why agents that look fine in a demo fail in production.

1. Step level: did it choose the right tool?

At each individual point in the workflow, did the agent pick the correct tool and use it correctly, with the right parameters. Most agent failures start here: it calls the wrong tool, or the right tool the wrong way.

2. Turn level: did it do the steps in the right order?

Zoom out one level. Were the steps performed in the correct sequence to get to the correct conclusion. An agent can make every individual call correctly and still fail because it did them out of order or looped without stopping.

3. Session level: was the final result actually right?

The whole workflow, end to end. Did the agent deliver the outcome the user actually wanted. This is the one that matters to the person using it, and the one a step-level score can completely miss.

The trap to avoid

Chip Huyen’s warning: a “correct” answer is not always a “good” answer. A job-fit bot that tells a candidate “you are a terrible fit” can be technically correct and completely useless. So evaluating an agent is less about one number and more about defining, at each level, what good actually looks like for your use case.

Why it matters

Only a small share of teams building agents are evaluating them well. The ones who ship reliably are the ones who stopped treating evaluation as a single score and started checking step, turn, and session separately.