Understanding Transformers: The Architecture Behind Modern AI

The Transformer architecture, introduced in the seminal paper "Attention Is All You Need" by Vaswani et al. (2017), has fundamentally changed how we approach sequence modeling tasks. In this post, we'll explore the key components that make Transformers so powerful.

The Transformer Architecture showing encoder and decoder stacks

The Core Idea: Self-Attention

At the heart of the Transformer lies the self-attention mechanism. Unlike recurrent networks that process sequences step-by-step, self-attention allows the model to look at all positions in the input simultaneously.

The attention function can be described as mapping a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weights are determined by the compatibility between the query and keys.

Mathematical Formulation

The scaled dot-product attention is computed as:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

$Q$ (Query), $K$ (Key), and $V$ (Value) are matrices
$d_k$ is the dimension of the keys
The scaling factor $\frac{1}{\sqrt{d_k}}$ prevents the dot products from growing too large

Multi-Head Attention

Rather than performing a single attention function, Transformers use multi-head attention to allow the model to jointly attend to information from different representation subspaces:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O

where each head is computed as:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Positional Encoding

Since Transformers don't have any recurrence or convolution, we need to inject information about the position of tokens in the sequence. The original paper uses sinusoidal positional encodings:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

This encoding has the nice property that for any fixed offset $k$ , $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ .

Implementation Example

Here's a simplified implementation of the attention mechanism in Python:

attention.py

import torch
import torch.nn as nn
import torch.nn.functional as F
 
class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k: int):
        super().__init__()
        self.scale = d_k ** 0.5
    
    def forward(self, Q, K, V, mask=None):
        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        
        # Apply mask (optional)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Apply softmax
        attention_weights = F.softmax(scores, dim=-1)
        
        # Compute output
        output = torch.matmul(attention_weights, V)
        return output, attention_weights

The full multi-head attention implementation would look like this:

multi_head_attention.py

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
        self.attention = ScaledDotProductAttention(self.d_k)
    
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        
        # Linear projections
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        output, attention_weights = self.attention(Q, K, V, mask)
        
        # Concatenate heads and apply final linear
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(output)

Why Transformers Work So Well

Several key properties make Transformers effective:

Parallelization: Unlike RNNs, all positions can be processed simultaneously
Long-range dependencies: Self-attention can directly connect distant positions
Constant path length: Information flows in $O(1)$ operations between any two positions
Compositionality: Multiple layers of attention create rich, hierarchical representations

"The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution." — Vaswani et al., 2017

Complexity Analysis

| Model | Complexity per Layer | Sequential Operations | Maximum Path Length | |-------|---------------------|----------------------|---------------------| | Self-Attention | $O(n^2 \cdot d)$ | $O(1)$ | $O(1)$ | | Recurrent | $O(n \cdot d^2)$ | $O(n)$ | $O(n)$ | | Convolutional | $O(k \cdot n \cdot d^2)$ | $O(1)$ | $O(\log_k(n))$ |

Conclusion

The Transformer architecture represents a paradigm shift in how we approach sequence modeling. By replacing recurrence with self-attention, we gain:

Better parallelization for faster training
Stronger modeling of long-range dependencies
More interpretable attention patterns

This foundation has led to breakthroughs like BERT, GPT, and countless other models that continue to push the boundaries of what's possible with language understanding.

In future posts, we'll explore specific variants like Vision Transformers (ViT), efficient attention mechanisms, and the training dynamics of large Transformer models.