AI Glossary

Cohen's Kappa

Cohen's Kappa measures how much two raters agree beyond what you'd expect from random chance. It matters for AI because it's how you check whether your human labels — or an LLM judge against humans — are consistent enough to trust as ground truth.

Also known as: kappa, inter-rater agreement

· Chain of Thought

AI Evaluation & Reliability

When two people label the same data, some of their agreement is just luck — with two options they’d agree half the time by chance alone. Cohen’s Kappa corrects for that, scoring agreement on a scale where 0 is chance-level and 1 is perfect. It’s the standard check on whether a labeling scheme is reliable or whether the raters are interpreting it differently.

For AI it shows up in two places. First, your evaluation is only as trustworthy as your labels, so low kappa among annotators means your “ground truth” is shaky. Second, it’s how you validate an LLM-as-a-judge: compute the kappa between the judge and human raters, and if they don’t agree well, the automated judge isn’t ready to replace people on that task.