Transformer

Transformer

The transformer is the neural-network architecture behind almost every modern large language model. Its key idea is attention: each token can look at every other token and weigh which ones matter, which is what lets the model handle context and long-range meaning.

Also known as: transformer architecture

Jun 16, 2026 · Chain of Thought

The transformer, introduced in 2017, is the architecture nearly all current LLMs are built on. Its central mechanism is attention: for each token, the model computes how much every other token in the context should influence it, so meaning can flow across long distances in the text. That ability to weigh the whole context at once is what made today’s models possible.

It has a known cost: standard attention scales with the square of the sequence length, so very long contexts get expensive in compute and memory. That’s the pressure behind a wave of alternatives and tweaks — state-space models like Mamba, efficiency-focused designs, and architectures explored by teams like Liquid AI — all trying to keep the transformer’s strengths while cutting its scaling cost. For now, “transformer” and “LLM” are close to synonymous, but that’s starting to loosen.

From the conversation

Related terms