Quantization
Quantization shrinks a model by storing its weights at lower numerical precision — say 4-bit integers instead of 16-bit floats. The model gets smaller and faster to run, usually with little quality loss, which is what lets large models fit on smaller hardware.
Also known as: model quantization
A model’s weights are numbers, and by default they’re stored at high precision. Quantization rounds them to fewer bits — 8-bit or 4-bit instead of 16-bit — so the model takes far less memory and runs faster. Because neural networks are robust to small numerical noise, a well-quantized model often performs nearly as well as the full-precision original.
It’s one of the main techniques that puts capable models on constrained hardware — a laptop, a phone, a cheaper GPU — and that cuts serving cost at scale. The trade-off is quality: push the precision too low and accuracy degrades, with the damage uneven across tasks. So quantization is a dial you tune against your evals, not a free win — find the lowest precision that still passes, the same way you’d pick the smallest model that works.