Latency

Latency

Latency is how long an AI system takes to respond. For LLMs it splits into time-to-first-token (how fast output starts) and total generation time, and it's a first-class product metric — a more accurate model that's too slow can still be the wrong choice.

Jun 17, 2026 · Chain of Thought

Latency is response time, and for language models it has structure worth separating. Time-to-first-token is how long until the model starts streaming output — what users feel as “is it working?” Total latency is how long until it finishes. The two trade off against model size, prompt length, and how much the model reasons before answering.

It belongs next to accuracy in any real evaluation, because it’s a product decision as much as a technical one. A 100-millisecond interaction and a ten-second one are different products even at identical quality, which is why teams routinely trade a little accuracy for a lot of speed — a smaller model, trimmed context, or caching — when the use case is interactive.

Related terms