Perplexity

Perplexity

Perplexity measures how surprised a language model is by a piece of text — lower means the model found it more predictable. It's a quick intrinsic gauge of how well a model fits a dataset, but it says little about whether the model is actually useful or correct.

Jun 16, 2026 · Chain of Thought

Perplexity scores how well a model predicts a sample of text: it’s based on the probability the model assigned to the actual next tokens, so a lower perplexity means the text was less surprising to the model. It’s cheap to compute and handy for comparing language models on the same dataset or tracking a model during training.

Its limit is that predicting text well isn’t the same as being helpful, truthful, or good at your task. A model can have low perplexity and still hallucinate, miss instructions, or fail your use case — and perplexity isn’t comparable across different tokenizers or datasets. So treat it as an intrinsic health check, not a measure of quality. For “is this good enough to ship,” you need task-level evaluation, not perplexity.

Related terms