AI Glossary

Knowledge Distillation

Knowledge distillation trains a small 'student' model to imitate a larger 'teacher' model, transferring much of the teacher's capability into a model that's cheaper and faster to run. It's a main way the strong-but-expensive becomes small-enough-to-ship.

Also known as: distillation, model distillation

· Chain of Thought

AI Engineering

Distillation takes a large, capable “teacher” model and uses its outputs to train a much smaller “student.” Instead of learning only from raw labels, the student learns from the teacher’s richer signal — its probability distributions over answers — and ends up punching above the weight its size would suggest. The result is a model that keeps a lot of the teacher’s quality at a fraction of the cost and latency.

It’s one of the main techniques behind capable small models, alongside quantization and good fine-tuning. The limits: the student rarely matches the teacher exactly, it inherits the teacher’s biases and blind spots, and distilling someone else’s model can run into licensing and terms-of-use questions. Used well, it’s how a frontier-grade capability gets cheap enough to run at scale.