Skip to main content

Understanding Transformers: from intuition to the math

Massimo Gollo
Author
Massimo Gollo
I like understanding why systems break, and building them so they don’t.

The paper Attention Is All You Need (Vaswani et al., 2017) redefined the field of deep learning for natural language. It introduced the Transformer, an architecture that completely removes recurrent and convolutional networks, replacing them with a single mechanism: self-attention. This article walks through the intuition behind the Transformer — starting from the problem it solves, through the math of the mechanism, and into the implications for training and inference.


Conceptual prerequisites
#

Before getting into the Transformer, three foundational concepts.

Word embeddings: words as vectors
#

A language model doesn’t operate on text strings, but on numerical vectors. Each token (word or sub-word) is mapped to a point in \(\mathbb{R}^d\) via an embedding matrix learned during training. In the original Transformer, \(d_{\text{model}} = 512\).

The intuition is geometric: words with similar meanings occupy nearby regions of vector space. Operations like \(\text{vec}(\text{king}) - \text{vec}(\text{man}) + \text{vec}(\text{woman}) \approx \text{vec}(\text{queen})\) show that directions in the space encode semantic relationships.

The sequence-to-sequence problem
#

Many NLP tasks (translation, summarization, question answering) require transforming an input sequence into an output sequence. Before the Transformer, the dominant architecture was the RNN-based encoder-decoder: an encoder that compresses the input sequence into a fixed representation, and a decoder that generates the output one token at a time. The Transformer keeps this encoder-decoder structure but radically changes the internal mechanism.

The RNN bottleneck
#

Recurrent networks (RNNs, LSTMs, GRUs) process sequences one element at a time, in order. At each position \(t\), the model computes a hidden state \(h_t\) as a function of the previous state and the current input:

$$h_t = f(h_{t-1}, x_t)$$

This serial dependency has two critical consequences:

  1. No parallelism: to compute \(h_t\) you must have completed \(h_{t-1}\). On modern GPUs this is a devastating efficiency constraint.
  2. Signal degradation: information from the first token has to travel through the entire chain to reach the last. Even though LSTM and GRU gates mitigate the vanishing gradient, the path length between two distant positions remains \(O(n)\).

Interactive visualization — Sequential RNN processing

Use the arrows to step through and watch how each hidden state \\(h_t\\) depends on \\(h_{t-1}\\) being complete.

Formula
h₀ = 0
Dependency
Initial state
Parallelizable?

The Transformer architecture: overview
#

The Transformer replaces recurrence with self-attention: a mechanism that lets every position in the sequence directly access all the others, in a single computational step.

The architecture keeps the encoder-decoder structure:

  • Encoder (6 identical layers): each layer contains a multi-head self-attention block followed by a feed-forward network. Every sub-layer is wrapped in a residual connection and layer normalization.
  • Decoder (6 identical layers): like the encoder, but with an additional cross-attention block that attends over the encoder output, plus a masked self-attention that prevents the decoder from “looking ahead” during generation.

The depth (6+6 layers) and width (\(d_{\text{model}} = 512\)) are the hyperparameters that define the capacity of the base Transformer.


Self-attention: the core mechanism
#

From O(n) to O(1): comparison with RNNs
#

The computational advantage of self-attention is immediate: the maximum path length between any two positions drops from \(O(n)\) (RNN) to \(O(1)\) (self-attention). This means information from the first token can directly influence the last one, with no signal degradation through intermediate states.

Interactive visualization — RNN vs Transformer

Click a token to compare the information path in the two architectures.

Sequential RNN — chain of hidden states
Parallel Transformer — self-attention

Click a token to compare the information path

RNN steps
Transformer steps
Complexity

Query, Key, Value: the intuition
#

The three vectors Q, K, V are the heart of self-attention. The most immediate analogy is a library:

  • The Query (\(Q\)) is the question a position asks: “what information do I need?”
  • The Key (\(K\)) is the label every position exposes: “this is what my position contains”
  • The Value (\(V\)) is the actual information content extracted when there’s a match

Q, K, and V are not defined by hand — they’re computed via learned linear projections:

$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$

where \(W^Q, W^K \in \mathbb{R}^{d_{\text{model}} \times d_k}\) and \(W^V \in \mathbb{R}^{d_{\text{model}} \times d_v}\) are weight matrices. Training optimizes these so that:

  • \(W^Q\) produces vectors that “ask the right question” for each position
  • \(W^K\) produces vectors that “describe the content” in a way compatible with the queries
  • \(W^V\) produces vectors carrying the useful information to extract

The dot product \(Q_i \cdot K_j\) measures the compatibility between position \(i\)’s query and position \(j\)’s key: aligned vectors in the space produce high scores, orthogonal vectors produce zero.

Interactive visualization — The roles of Query, Key, and Value

Click each token to see how its "question" (Q), its "label" (K), and its "content" (V) change, and how the attention weights distribute information.

Query Key Value

Scaled dot-product attention: the math
#

The full self-attention formula, as presented in the paper, is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Let’s unpack it step by step.

Step 1: Dot product \(QK^T\)
#

We compute the dot product between every query vector and every key vector, producing an \(n \times n\) matrix of raw scores. The score \(s_{ij} = Q_i \cdot K_j\) measures how much position \(i\) should “pay attention” to position \(j\).

Step 2: Scaling by \(\sqrt{d_k}\)
#

The factor \(\frac{1}{\sqrt{d_k}}\) is crucial. Without it, with large \(d_k\) the dot product tends to have high-magnitude values (variance grows linearly with \(d_k\)). This pushes the softmax into saturation regions where gradients are nearly zero, hindering training. With \(d_k = 64\), the scaling divides by \(\sqrt{64} = 8\), bringing the score variance back to 1.

Step 3: Softmax
#

The softmax normalizes each row of the score matrix into a probability distribution:

$$\alpha_{ij} = \frac{\exp(s_{ij} / \sqrt{d_k})}{\sum_{l=1}^{n} \exp(s_{il} / \sqrt{d_k})}$$

The weights \(\alpha_{ij}\) sum to 1 for every row \(i\). A high weight indicates that position \(j\) is strongly relevant to position \(i\).

Step 4: Weighted average of Values
#

The output for each position is a weighted average of the Value vectors, with the weights from the softmax:

$$z_i = \sum_{j=1}^{n} \alpha_{ij} V_j$$

Each position gets a representation that mixes the contents of all the other positions, proportional to semantic relevance.

Interactive visualization — Scaled dot-product attention step-by-step

Navigate the 5 steps with the buttons and click the tokens to change the active query.


Multi-head attention
#

A single attention head captures only one type of relation between positions. Multi-head attention runs \(h\) heads in parallel, each with its own projection matrices, and concatenates the results:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$

where each head is:

$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

In the base Transformer: \(h = 8\) heads, each with \(d_k = d_v = d_{\text{model}} / h = 64\). The total compute cost is equivalent to a single full-dimension head, but the model can simultaneously attend to different relations — one head might capture subject-verb agreement, another coreference, another syntactic structure.

The final projection \(W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}\) maps the concatenation of the \(h\) heads back into the \(d_{\text{model}}\)-dimensional space.


Positional encoding
#

Since self-attention is order-invariant (it treats the sequence as a set), the Transformer needs an explicit position signal. The paper uses sinusoidal functions:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

Each embedding dimension receives a sinusoidal signal at a different frequency. The choice of sinusoids lets the model learn to attend to relative positions: for any fixed offset \(k\), the transformation \(PE_{pos+k}\) can be expressed as a linear function of \(PE_{pos}\).

The positional encoding is added to the token embedding before entering the first layer.


Feed-forward network and residual connections
#

Every Transformer layer contains, after the attention block, a position-wise feed-forward network:

$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

Two linear transformations with a ReLU in between. The inner dimension is \(d_{ff} = 2048\) (4× the model dimension). This network is applied independently at each position — this is where the model “reasons” over a single representation after enriching it with context via attention.

Every sub-layer (attention or FFN) is wrapped by:

  1. Residual connection: \(x + \text{SubLayer}(x)\) — lets gradients flow directly through the layers, stabilizing training of deep networks.
  2. Layer normalization: normalizes activations to reduce internal covariate shift.

The full sub-layer scheme is: \(\text{LayerNorm}(x + \text{SubLayer}(x))\).


The Transformer at inference time
#

Encoder: a single parallel pass
#

At inference, the encoder processes the entire input sequence in a single forward pass. Every token “sees” all the others through self-attention, and the output is a sequence of contextualized representations — every position contains information about the whole sentence.

Decoder: autoregressive generation
#

The decoder produces the output one token at a time, autoregressively:

  1. Receives the start-of-sequence token (e.g. <sos>)
  2. Produces a probability distribution over the vocabulary for the next token
  3. Selects the token (greedy, beam search, or sampling)
  4. Appends it to the decoder input and repeats

At each step, the masked self-attention prevents the decoder from looking at future positions. For position \(t\), the attention scores toward positions \(t+1, t+2, \ldots\) are set to \(-\infty\) before the softmax, zeroing their weight. This is essential: without the mask, the model would “see the answer” and never learn to predict.

The decoder also contains a cross-attention block that operates on the encoder output: the Queries come from the decoder, while the Keys and Values come from the encoder. This lets every generated position “consult” the entire input sequence.


The Transformer at training time
#

Teacher forcing
#

During training, the decoder doesn’t generate autoregressively — it uses teacher forcing: it receives the correct target sequence (right-shifted by one position) as input and predicts every token in parallel. The masked attention still ensures position \(t\) doesn’t see future tokens.

This makes it possible to compute the loss across all tokens in a single forward pass, fully exploiting the parallelism of self-attention.

Loss function
#

The loss is the cross-entropy between the model’s predicted distribution and the correct token:

$$\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t | y_{<t}, X)$$

where \(y_t\) is the target token and \(y_{<t}\) are the previous tokens. Vaswani et al. also use label smoothing (\(\epsilon = 0.1\)): instead of assigning probability 1 to the correct token, they distribute a small amount of probability mass over the others. This penalizes the model’s overconfidence and improves generalization.

Optimization
#

The optimizer is Adam (\(\beta_1 = 0.9\), \(\beta_2 = 0.98\), \(\epsilon = 10^{-9}\)) with a learning-rate warm-up schedule:

$$lr = d_{\text{model}}^{-0.5} \cdot \min(\text{step}^{-0.5}, \text{step} \cdot \text{warmup_steps}^{-1.5})$$

The learning rate grows linearly for the first warmup_steps (4000), then decays proportionally to the inverse square root of the step number. This schedule avoids instability in the early phase of training, when parameters are still far from a good region.

Regularization
#

In addition to label smoothing, the Transformer uses dropout (\(P_{drop} = 0.1\)) applied to:

  • The output of every sub-layer (before the residual connection)
  • The sum of embedding + positional encoding
  • The attention weights themselves

Computational complexity comparison
#

Self-attentionRNNConvolution
Per-layer complexity\(O(n^2 \cdot d)\)\(O(n \cdot d^2)\)\(O(k \cdot n \cdot d^2)\)
Sequential operations\(O(1)\)\(O(n)\)\(O(1)\)
Maximum path length\(O(1)\)\(O(n)\)\(O(\log_k n)\)

Self-attention pays a quadratic cost in sequence length (\(n^2\)), but every operation is parallelizable. For typical NLP sequences (\(n < 1000\)) at the time of publication this trade-off was clearly favorable. For very long sequences, variants like sparse attention reduce the complexity.


Results and impact
#

The base Transformer (65M parameters, 6+6 layers, \(d_{\text{model}} = 512\)) reached a BLEU score of 27.3 on English-to-German translation (WMT 2014), with a training cost of 3.3 days on 8 P100 GPUs. The big Transformer (213M parameters, \(d_{\text{model}} = 1024\)) reached 28.4 BLEU on the same task — state of the art at the time of publication.

But the real impact of the paper goes far beyond machine translation. The Transformer architecture became the foundation of:

  • BERT (Devlin et al., 2019): encoder-only, bidirectional pre-training with masked language modeling
  • GPT (Radford et al., 2018-2023): decoder-only, autoregressive pre-training, scaled to hundreds of billions of parameters
  • T5 (Raffel et al., 2020): encoder-decoder, every NLP task framed as text-to-text
  • All current Large Language Models (Llama, Claude, Gemini, etc.)

References
#

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS). arXiv:1706.03762
  2. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT. arXiv:1810.04805
  3. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.
  4. Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. NeurIPS. arXiv:2005.14165
  5. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. arXiv:1409.0473 (The paper that introduced the attention mechanism for RNNs, direct precursor of self-attention.)
  6. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450
  7. Alammar, J. (2018). The Illustrated Transformer. jalammar.github.io (Excellent visual guide, complementary to this article.)

Article written with the assistance of Claude (Anthropic). The interactive visualizations were developed during a study session on the original paper.