4 min read

Understanding Transformers: The Architecture Behind Modern AI

A deep dive into the Transformer architecture, exploring attention mechanisms, positional encodings, and why this architecture revolutionized natural language processing.

Machine LearningDeep LearningNLPTransformers

The Transformer architecture, introduced in the seminal paper "Attention Is All You Need" by Vaswani et al. (2017), has fundamentally changed how we approach sequence modeling tasks. In this post, we'll explore the key components that make Transformers so powerful.

The Transformer Architecture showing encoder and decoder stacksThe Transformer Architecture showing encoder and decoder stacks

The Core Idea: Self-Attention

At the heart of the Transformer lies the self-attention mechanism. Unlike recurrent networks that process sequences step-by-step, self-attention allows the model to look at all positions in the input simultaneously.

The attention function can be described as mapping a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weights are determined by the compatibility between the query and keys.

Mathematical Formulation

The scaled dot-product attention is computed as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

  • QQ (Query), KK (Key), and VV (Value) are matrices
  • dkd_k is the dimension of the keys
  • The scaling factor 1dk\frac{1}{\sqrt{d_k}} prevents the dot products from growing too large

Multi-Head Attention

Rather than performing a single attention function, Transformers use multi-head attention to allow the model to jointly attend to information from different representation subspaces:

MultiHead(Q,K,V)=Concat(head1,...,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O

where each head is computed as:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Positional Encoding

Since Transformers don't have any recurrence or convolution, we need to inject information about the position of tokens in the sequence. The original paper uses sinusoidal positional encodings:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

This encoding has the nice property that for any fixed offset kk, PEpos+kPE_{pos+k} can be represented as a linear function of PEposPE_{pos}.

Implementation Example

Here's a simplified implementation of the attention mechanism in Python:

attention.py
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k: int):
        super().__init__()
        self.scale = d_k ** 0.5
    
    def forward(self, Q, K, V, mask=None):
        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        
        # Apply mask (optional)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Apply softmax
        attention_weights = F.softmax(scores, dim=-1)
        
        # Compute output
        output = torch.matmul(attention_weights, V)
        return output, attention_weights

The full multi-head attention implementation would look like this:

multi_head_attention.py
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
        self.attention = ScaledDotProductAttention(self.d_k)
    
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        
        # Linear projections
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        output, attention_weights = self.attention(Q, K, V, mask)
        
        # Concatenate heads and apply final linear
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(output)

Why Transformers Work So Well

Several key properties make Transformers effective:

  1. Parallelization: Unlike RNNs, all positions can be processed simultaneously
  2. Long-range dependencies: Self-attention can directly connect distant positions
  3. Constant path length: Information flows in O(1)O(1) operations between any two positions
  4. Compositionality: Multiple layers of attention create rich, hierarchical representations

"The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution." — Vaswani et al., 2017

Complexity Analysis

| Model | Complexity per Layer | Sequential Operations | Maximum Path Length | |-------|---------------------|----------------------|---------------------| | Self-Attention | O(n2d)O(n^2 \cdot d) | O(1)O(1) | O(1)O(1) | | Recurrent | O(nd2)O(n \cdot d^2) | O(n)O(n) | O(n)O(n) | | Convolutional | O(knd2)O(k \cdot n \cdot d^2) | O(1)O(1) | O(logk(n))O(\log_k(n)) |

Conclusion

The Transformer architecture represents a paradigm shift in how we approach sequence modeling. By replacing recurrence with self-attention, we gain:

  • Better parallelization for faster training
  • Stronger modeling of long-range dependencies
  • More interpretable attention patterns

This foundation has led to breakthroughs like BERT, GPT, and countless other models that continue to push the boundaries of what's possible with language understanding.


In future posts, we'll explore specific variants like Vision Transformers (ViT), efficient attention mechanisms, and the training dynamics of large Transformer models.