FlashAttention
FlashAttention is an optimized way to compute a transformer's attention that's far more memory-efficient, by avoiding writing the huge intermediate attention matrix to memory. It makes longer context windows and faster training practical without changing the model's results.
Standard attention computes a score between every pair of tokens, which means memory use grows with the square of the sequence length — the bottleneck that makes long contexts expensive. FlashAttention reorders the computation so it never materializes that full score matrix in slow memory, processing it in tiles that stay in fast on-chip memory instead. The math result is identical; the resource cost is much lower.
Why it matters to builders even if you never touch it: it’s a big part of why recent models can offer long context windows and train faster on the same hardware. It’s an implementation-level optimization, not a new architecture — one of the quiet engineering advances (alongside things like mixture-of-experts) that pushed model capability and efficiency forward at once.