The attention mechanism: how a machine learned to choose what to look at
One idea flipped the whole field of AI in 2017 and now powers every large language model: letting each word look directly at every other and decide what matters. Here is how attention works, and why it convergently rediscovered something the brain already knew.
Read this sentence: “The glass fell off the table, but it did not break.” The word it — what does it refer to? The glass, obviously. Not the table. You knew it instantly, effortlessly. But pause on what just happened. In a twelve-word sentence, your mind knew that to understand “it” you had to go and look at “glass” — and ignore “table,” “fell,” “but.” It distributed its attention. That is precisely the problem a machine must solve to understand language, and the solution is called, fittingly, the attention mechanism. It is the single idea that flipped the whole field of AI in 2017 and is, literally, the engine inside every large language model in use today.
The old problem: a memory that frays
Before 2017, machines read text the way you unspool a reel: word after word, in order, keeping a small mental “summary” they updated at each step (these were the recurrent networks — RNNs and LSTMs). The flaw is intuitive: by the time you reach “it” at the end of the sentence, the word “glass” is already far behind, half crushed under everything read since. The further away the useful information, the more diluted it becomes. The machine had the memory of a stressed goldfish: the start of the sentence fades as you advance.
Worse: because it read in series, word by word, there was no way to parallelise. Each word had to wait for the previous one to be digested. On a graphics card built to do a thousand things at once, that is waste. Two walls at the same time: a memory that frays, and the slowness of the sequential.
The breakthrough: stop reading in line, look at everything at once
In 2017, a paper from Google with a provocative title — “Attention Is All You Need” — knocked the lock off. Its proposal was radical: drop the in-line reading. Lay all the words of the sentence on the table at the same time, and let each word look directly at every other, regardless of distance, and decide for itself which ones matter to it. “It” can reach straight for “glass” without crossing the ten words in between. Distance no longer costs anything. This is the Transformer architecture, and the “T” in GPT comes directly from it (Generative Pre-trained Transformer).
The shift in one sentence: we moved from a machine that reads (in series, forgetting) to a machine that consults (everything, in parallel, weighting). Attention is the right of each word to interrogate the whole sentence and build its meaning out of what it finds there.
The heart of the engine: query, key, value
Here is the mechanism. It comes down to three roles each word plays at once, and the cleanest analogy is a kind of search engine internal to the sentence. The query is what the word is looking for. For “it,” the query is roughly “I am a pronoun, I am looking for a recent singular noun to attach to.” The key is the label each word holds up — “here is what I am and what I offer.” “Glass” holds up a key like “I am a noun, an object, singular.” The value is the actual content the word passes to whoever cares about it — the meaning that will be poured across.
The computation is then clear. The word “it” takes its query and compares it to the key of every other word. Where the match is strong (the query of “it” against the key of “glass”), the link is powerful; where it is weak (against the key of “table”), the link is faint. Then “it” collects a blend of the values of all the words, dosed by those match strengths: a lot of the value of “glass,” almost none of “table.” The result: the meaning of “it,” inside the machine, is enriched by everything “glass” means. The pronoun has “understood” its antecedent.
The canonical formula one meets everywhere is this one. No need to memorise it — the plain-language translation follows immediately:
Attention(Q, K, V) = softmax( Q·Kᵀ / √d ) · V
Four symbols, four images. Q·Kᵀ (queries times keys) is the grid of compatibilities — a table with the words down the rows and the same words across the columns, each cell scoring “how much does this one care about that one.” It is a map of who-looks-at-whom. The ÷ √d is a simple volume control: without it the scores run to extremes and the machine becomes jittery, all-or-nothing, so we divide to calm things down — an engineering detail, no concept, ignore it. The softmax turns those raw scores into percentages of attention that sum to 100%: 92% on “glass,” 5% on “table,” 3% on the rest — the gesture of dividing one cake of relevance into slices. And the · V (times values) is the final pour: those percentages are used to blend the contents, so 92% of the meaning of “glass” flows into “it.” In one image: each word poses a question to the whole room, listens for who answers best, and feeds mostly on those. That is all of attention. The rest is plumbing.
The loop that closes: attention is precision-weighting
There is a deeper rhyme here. In predictive accounts of the brain, attention is defined as precision-weighting — giving more weight to the signals you trust. Look at what the machine just did: it assigned 92% of the weight to the “glass” signal and crushed the others. It is exactly the same idea. The biological brain and the Transformer converge on the same deep definition of attention: not a spotlight you aim, but an allocation of weight — a budget of relevance you distribute. Biology found it through evolution; engineering rediscovered it through computation. When two independent paths land on the same structure, it is often a sign you are touching something real.
Why “multi-head”: several gazes in parallel
One essential and very intuitive refinement. A single attention mechanism captures only one kind of relationship at a time. But “glass” holds several links at once: grammatical (subject of “break”), referential (antecedent of “it”), semantic (a fragile object). So we do not use one attention but several in parallel — the “heads” of multi-head attention. One head specialises in grammar, another in references, another in nearby meaning; each reads the sentence with a different question, and their answers are then fused. It is the equivalent of re-reading a text several times, hunting for one specific thing on each pass — except the machine makes all the passes at the same time.
The price: the quadratic wall
This power has a cost, and it is the engineering subject of the moment. If every word must look at every other, then for a sentence of N words the number of looks is N × N. Double the length of the text and you quadruple the computation. This is quadratic complexity.
“Quadratic” sounds frightening; the image is trivial. A meeting where everyone has to shake hands with everyone. With ten people, fine. With a hundred, it is not ten times more handshakes — it is a hundred times more. The room seizes up. That is why a language model “labours” and costs more when you hand it a three-hundred-page document: you are making every word shake hands with every other word. The whole art of recent engineering is to avoid the useless handshakes without losing any of the meaning.
Keep that picture of the “quadratic wall” in mind, because it explains most of what is happening at the frontier of the field. Three threads, in particular, are worth watching. One line of work keeps the exact same result but reorganises the computation so the giant handshake grid is never written out to slow memory in full — same answer, smarter plumbing, which is much of why assistants are faster and cheaper than they were a couple of years ago. A second line shrinks the model’s working memory of the words it has already processed, which is what makes it possible to run a capable model on modest hardware. And a third, more radical line questions the 2017 bet altogether: state-space architectures train as fast as Transformers but run in linear time at inference — no quadratic wall at all. “Attention is all you need” is now, openly, being contested.
That last point is the most interesting. A single idea reorganised an entire field, gave us the architecture behind every chatbot, and convergently rediscovered a principle the brain had already found. And barely a decade later, the field is busy asking whether it really was all you need. That is not a failure of the idea; it is what a living field looks like — building on a breakthrough while already searching for the next seam.
Further reading
- Vaswani et al., “Attention Is All You Need” (2017) — the founding paper that introduced the Transformer.
- Jay Alammar, “The Illustrated Transformer” — the classic illustrated walkthrough of query-key-value, at your own pace.
- 3Blue1Brown, “Attention in transformers, step-by-step” — the best visual explanation available, one example carried all the way through.