BLEU Score
BLEU scores machine-generated text by how much its word sequences overlap with one or more human reference texts. It was built for machine translation, runs from 0 to 1, and is fast and cheap — but it rewards surface word-matching, not meaning.
Also known as: BLEU
BLEU (Bilingual Evaluation Understudy) compares a model’s output to human reference text by counting how many short word sequences (n-grams) they share, with a penalty for outputs that are too short. It became the default for machine translation because it’s automatic, cheap, and correlates reasonably with human judgment at scale.
Its weakness is that it scores form, not meaning. A correct paraphrase that uses different words gets a low BLEU, and a fluent-but-wrong output that reuses reference words can score high. So BLEU is useful for tracking regressions on translation-style tasks where references exist, but it’s a poor judge of open-ended generation — which is why embedding-based metrics like BERTScore and LLM-as-a-judge exist alongside it.