Inference

Inference

Inference is running a trained model to produce output — the part that happens every time a user sends a prompt. It's distinct from training (teaching the model in the first place): training is a big one-time cost, inference is the recurring cost you pay on every request, forever.

Also known as: inference vs training

Jun 17, 2026 · Chain of Thought

There are two phases in a model’s life. Training is the expensive, one-time process of learning from data — huge compute, done by whoever builds the model. Inference is what happens afterward: you feed in a prompt and the model computes an answer. Every chat message, every agent step, every API call is an inference.

For most teams building on existing models, inference is the cost that matters, because it recurs on every request and scales with usage — while training (or fine-tuning) is occasional. That’s why so much practical AI engineering is inference-cost work: picking a smaller model, quantizing, trimming context, caching. Understanding the split is the first step to controlling spend: you rarely pay to train, but you pay for inference every single time someone uses the thing.

Related terms