Learning to Adapt in Test-Time (Titans/MIRAS)

Table of Contents

Introduction: The Problem Titans/MIRAS Solves
Part 1: Titans Architecture - Deep Neural Memory
- The Core Innovation: Deep Memory Modules
- Test-Time Memorization
- The Surprise Metric
- Momentum and Forgetting Mechanisms
Part 2: MIRAS Framework - Theoretical Unification
- Four Key Design Choices
- Memory Architecture
- Attentional Bias
- Retention Gate
- Memory Algorithm
- Beyond Mean Squared Error
Part 3: MIRAS Variants
- YAAD: Robust to Outliers
- MONETA: Generalized Norms
- MEMORA: Probability Map Constraints
Part 4: Experimental Results
- Language Modeling Performance
- Extreme Long-Context Recall (BABILong)
- Efficiency Comparisons
- Scaling Properties
Part 5: Technical Deep Dive
- Mathematical Formulations
- Architecture Overview
- Code Examples
Comparison with Existing Methods
Implications and Future Directions
References

Introduction: The Problem Titans/MIRAS Solves

Transformers revolutionized sequence modeling with attention mechanisms that allow models to look back at earlier inputs and prioritize relevant information. But there's a fundamental limitation: computational cost scales quadratically with sequence length. This makes it prohibitively expensive to scale Transformer-based models to extremely long contexts—the kind needed for full-document understanding, genomic analysis, or codebase-wide reasoning.

The research community explored alternatives: efficient linear recurrent neural networks (RNNs) and state space models (SSMs) like Mamba-2. These models offer fast, linear scaling by compressing context into a fixed-size state. However, this fixed-size compression has a critical weakness—it cannot adequately capture the rich, nuanced information in very long sequences. You're essentially trying to summarize a novel into a single sentence: something important always gets lost.

In two groundbreaking papers, Titans and MIRAS, Google Research introduces an architecture and theoretical framework that combine the speed of RNNs with the accuracy of Transformers. Titans is the specific architecture—the practical tool. MIRAS is the theoretical blueprint—the unified framework that generalizes these approaches. Together, they advance the concept of test-time memorization: the ability of an AI model to maintain long-term memory by incorporating powerful "surprise" metrics (unexpected, important information) while the model is running, without dedicated offline retraining.

The MIRAS framework, as demonstrated by Titans, introduces a meaningful shift toward real-time adaptation. Instead of compressing information into a static state, this architecture actively learns and updates its own parameters as data streams in. This crucial mechanism enables the model to incorporate new, specific details into its core knowledge instantly—like a human who can learn and remember new facts during a conversation, not just during study sessions.

Titans Architecture - Deep Neural Memory

The Core Innovation: Deep Memory Modules

An effective learning system requires distinct yet interconnected memory modules, mirroring the human brain's separation of short-term and long-term memory. While attention mechanisms excel for precise, short-term memory, Titans introduces a novel neural long-term memory module that fundamentally differs from traditional approaches.

Unlike the fixed-size vector or matrix memory in traditional RNNs, Titans uses a deep neural network (specifically, a multi-layer perceptron) as its memory module. This provides significantly higher expressive power, allowing the model to summarize large volumes of information without losing important context. The model isn't simply taking notes; it's understanding and synthesizing the entire story.

The architecture consists of three key components:

Contextual Memory (Learning): The deep neural memory module that learns and updates during processing
Core (In-Context Learning): The main transformer-like component that processes current context
Persistent Memory (Fixed Weights): The pre-trained backbone that provides foundational knowledge

The contextual memory compresses past data into a summary, which is then incorporated into the context and passed to attention. Attention can then decide whether it needs to attend to the summary of the past or focus on recent tokens. This creates a hierarchical memory system where recent information gets precise attention, while distant information is intelligently summarized.

Test-Time Memorization

The revolutionary aspect of Titans is test-time memorization—the ability to learn and update memory while the model is actively running, not just during training. Traditional models are frozen after training: they can only use what they learned during the training phase. Titans breaks this limitation.

During inference, as new tokens stream in, Titans continuously updates its long-term memory module. This isn't just storing raw data—it's learning representations, relationships, and conceptual themes that connect tokens across the entire input. The model becomes an active learner, adapting its understanding in real-time.

This capability is crucial for handling extremely long contexts. Imagine processing a 2-million-token document: you can't keep everything in active memory, but you also can't afford to forget important details. Test-time memorization allows Titans to selectively learn what matters most, creating a compressed but rich representation of the entire sequence.

The Surprise Metric

A key aspect of Titans' learning ability is what the researchers call the "surprise metric". In human psychology, we know we quickly forget routine, expected events but remember things that break the pattern—unexpected, surprising, or highly emotional events. Titans implements a mathematical equivalent of this principle.

The surprise metric is the model detecting a large difference between what it currently remembers and what the new input is telling it. Formally, this is measured using gradients—the internal error signal that indicates how much the model's current state differs from the new information.

Low surprise: If the new word is "cat" and the model's memory state already expects an animal word, the gradient (surprise) is low. The model can safely skip memorizing the word "cat" in its permanent long-term state because it's consistent with existing knowledge.

High surprise: If the model's memory state is summarizing a serious financial report, and the new input is a picture of a banana peel (the unexpected event), the gradient (surprise) will be very high. This signals that the new input is important or anomalous, and it must be prioritized for permanent storage in the long-term memory module.

The model uses this internal error signal (the gradient) as a mathematical equivalent of saying, "This is unexpected and important!" This allows the Titans architecture to selectively update its long-term memory only with the most novel and context-breaking information, keeping the overall process fast and efficient.

Momentum and Forgetting Mechanisms

Titans refines the surprise mechanism by incorporating two critical elements:

Momentum: The model considers both "momentary surprise" (the current input) and "past surprise" (the recent context flow). This ensures relevant subsequent information is also captured, even if those tokens are not individually surprising. For example, if a surprising event occurs, the next few tokens that provide context about that event should also be remembered, even if they're not surprising themselves.

Forgetting (Weight Decay): To manage the finite capacity of the memory when dealing with extremely long sequences, Titans employs an adaptive weight decay mechanism. This acts as a forgetting gate, allowing the model to discard information that is no longer needed. The decay is adaptive—more aggressive for less important information, gentler for critical knowledge.

Together, momentum and forgetting create a balanced memory system: momentum ensures continuity and context preservation, while forgetting prevents memory overflow and maintains focus on what matters most.

MIRAS Framework - Theoretical Unification

Four Key Design Choices

MIRAS (Memory-Informed Robust Associative Sequence modeling) provides a unified theoretical framework that reveals a profound insight: every major breakthrough in sequence modeling—from modern transformers to lightning-fast linear RNNs—is essentially the same thing under the hood: a highly complex associative memory module.

What makes MIRAS both unique and practical is the way it views AI modeling. Instead of seeing diverse architectures as fundamentally different, it sees different methods of solving the same problem: efficiently combining new information with old memories without letting essential concepts be forgotten.

MIRAS defines a sequence model through four key design choices:

Memory Architecture: The structure that stores information (e.g., a vector, matrix, or a deep multi-layer perceptron, like in Titans)
Attentional Bias: The internal learning objective the model optimizes that determines what it prioritizes
Retention Gate: The memory regularizer. MIRAS reinterprets "forgetting mechanisms" as specific forms of regularization that balance new learning against retaining past knowledge
Memory Algorithm: The optimization algorithm used to update the memory

These four dimensions create a design space where existing architectures are just specific points. Transformers, Mamba, RWKV, and others all fall within this framework—they're just different combinations of these four choices.

Memory Architecture

The memory architecture determines how information is stored. Traditional approaches use:

Vectors: Simple fixed-size arrays (like in basic RNNs)
Matrices: Two-dimensional structures (like in some attention mechanisms)

Titans introduces deep neural networks (multi-layer perceptrons) as memory modules. This provides exponentially more expressive power. A vector can store $d$ values. A matrix can store $d^2$ relationships. A deep MLP can learn arbitrary functions mapping inputs to compressed representations—potentially encoding $2^d$ or more distinct patterns.

The depth of the memory architecture is crucial. Ablation studies show that deeper memory modules consistently achieve lower perplexity in language modeling and exhibit better scaling properties, maintaining performance as sequence length increases significantly.

Attentional Bias

The attentional bias determines what the model prioritizes when updating memory. This is the learning objective—what should the model optimize for?

Most existing models use mean squared error (MSE) or dot-product similarity as their bias. This works well for average cases but can be problematic:

Sensitivity to outliers: A single typo or anomaly can disproportionately affect the model
Limited expressive power: MSE assumes a Gaussian distribution of errors, which may not match real-world data distributions
Uniform weighting: All errors are treated equally, regardless of their importance

MIRAS allows exploring richer bias functions. The Titans surprise metric is one example—it prioritizes unexpected information. But MIRAS opens the door to many other possibilities: robust losses (Huber, quantile), information-theoretic objectives, or task-specific biases.

Retention Gate

The retention gate manages the balance between learning new information and retaining existing knowledge. In MIRAS, forgetting mechanisms are reinterpreted as specific forms of regularization.

Traditional approaches use simple decay: $M_{new} = \alpha \cdot M_{old} + (1-\alpha) \cdot M_{update}$ . This treats all information equally. MIRAS allows more sophisticated retention strategies:

Adaptive decay: Different decay rates for different types of information
Selective forgetting: Forgetting less important information more aggressively
Stability constraints: Ensuring memory updates don't destabilize the system

Titans uses adaptive weight decay that considers the importance of information—critical knowledge decays slowly, routine information decays quickly.

Memory Algorithm

The memory algorithm is the optimization method used to update memory. Most models use simple gradient descent or its variants. MIRAS provides a framework for exploring more sophisticated algorithms:

Momentum-based updates: Considering past gradients, not just current ones
Adaptive learning rates: Different rates for different memory components
Second-order methods: Using curvature information for more efficient updates

Titans uses gradient-based optimization with momentum, allowing it to capture not just immediate surprises but also the context around surprising events.

Beyond Mean Squared Error

Virtually all successful existing sequence models rely on mean squared error (MSE) or dot-product similarity for both their bias and retention. This reliance can make models sensitive to outliers and limit their expressive power.

MIRAS transcends this limitation by providing a generative framework to explore a richer design space informed by optimization and statistics literature. This allows for the creation of novel architectures with non-Euclidean objectives and regularization.

The framework enables exploration of:

Robust losses: Less sensitive to outliers (Huber loss, quantile loss)
Information-theoretic objectives: Maximizing mutual information, minimizing entropy
Task-specific biases: Optimizing for specific downstream tasks
Non-Euclidean geometries: Using different distance metrics and norms

This theoretical flexibility is what enables the MIRAS variants—YAAD, MONETA, and MEMORA—each exploring different points in this design space.

Part 3: MIRAS Variants

Using the MIRAS framework, researchers created three specific attention-free models, each exploring different design choices:

YAAD: Robust to Outliers

YAAD (Yet Another Attention-free Architecture with Robustness) is designed to be less sensitive to major errors or "outliers" (like a single typo in a large document). It uses a gentler mathematical penalty (Huber loss) for mistakes, so it doesn't overreact to one-off issues.

Key Innovation: Instead of MSE, YAAD uses Huber loss:

L_\delta(x) = \begin{cases} \frac{1}{2}x^2 & \text{if } |x| \leq \delta \\ \delta(|x| - \frac{\delta}{2}) & \text{otherwise} \end{cases}

For small errors (within threshold $\delta$ ), it behaves like MSE. For large errors (outliers), it uses linear penalty instead of quadratic, making it robust to anomalies.

This makes the model more robust when input data is messy or inconsistent—exactly what you need when processing real-world, imperfect data at scale.

MONETA: Generalized Norms

MONETA explores the use of more complex and strict mathematical penalties (called generalized norms). It investigates whether using these more disciplined rules for both what the model attends to and what it forgets can lead to a more powerful and stable long-term memory system overall.

Key Innovation: MONETA uses $p$ -norms and other generalized distance metrics:

\|x\|_p = \left(\sum_i |x_i|^p\right)^{1/p}

By varying $p$ , MONETA can explore different geometries:

$p=2$ : Standard Euclidean (like MSE)
$p=1$ : Manhattan distance (more robust to outliers)
$p \to \infty$ : Chebyshev distance (focuses on maximum error)

MONETA applies these norms to both the attentional bias (what to prioritize) and the retention gate (what to forget), creating a more disciplined memory system.

MEMORA: Probability Map Constraints

MEMORA focuses on achieving the best possible memory stability by forcing its memory to act like a strict probability map. By using this constraint, it ensures that every time the memory state is updated, the changes are controlled and balanced.

Key Innovation: MEMORA constrains memory updates to maintain probability distribution properties:

Non-negativity: Memory values must be non-negative
Normalization: Memory states sum to 1 (or integrate to 1 for continuous cases)
Monotonicity: Updates preserve ordering relationships

This guarantees a clean, stable process for integrating new information. The memory behaves like a probability distribution over possible states, making it interpretable and ensuring updates don't create invalid states.

Part 4: Experimental Results

Language Modeling Performance

Titans and MIRAS variants were rigorously compared against leading architectures, including Transformer++, Mamba-2, and Gated DeltaNet. Across standard language modeling datasets (C4, WikiText) and zero-shot reasoning tasks (HellaSwag, PIQA), the models consistently demonstrated:

Higher accuracy: Better performance on downstream tasks
Lower perplexity: Less "surprise" when looking at text, indicating better language modeling
Efficient training: Maintains parallelizable training despite RNN-like inference
Fast inference: Linear scaling with sequence length

The novel MIRAS variants (MONETA, YAAD, MEMORA) also achieved improved performance compared to baselines, validating the benefit of exploring robust, non-MSE optimization mechanisms.

The Power of Deep Memory

Ablation studies clearly show that the depth of the memory architecture is crucial. When comparing long-term memory modules of the same size but different depths:

Deeper memories achieve lower perplexity: More layers in the memory MLP lead to better compression and understanding
Better scaling properties: Deeper memories maintain performance as sequence length increases significantly
Consistent across model sizes: The benefit holds for both 360M and 760M parameter models

This validates the core innovation: using deep neural networks as memory modules provides exponentially more expressive power than fixed-size vectors or matrices.

Extreme Long-Context Recall

The most significant advantage of these new architectures is their ability to handle extremely long contexts. This is highlighted in the BABILong benchmark, a task requiring reasoning across facts distributed in extremely long documents.

In this challenging setting:

Titans outperforms all baselines, including extremely large models like GPT-4, despite having many fewer parameters
Scales effectively to 2M+ tokens: Demonstrates capability far beyond traditional context windows
Maintains accuracy: Performance doesn't degrade as context length increases

The ability to memorize and retrieve information from 2-million-token contexts opens new possibilities for:

Full-document understanding: Processing entire books, legal documents, or codebases
Genomic analysis: Analyzing entire genomes or large genomic datasets
Long-term reasoning: Maintaining context across extended conversations or analysis sessions

Efficiency Comparisons

Despite their powerful capabilities, Titans and MIRAS variants maintain efficient computation:

Linear inference: $O(n)$ time complexity, same as RNNs
Parallelizable training: Can still be trained efficiently, unlike sequential RNNs
Memory efficient: Deep memory modules are compact compared to full attention matrices

The models achieve the best of both worlds: Transformer-like accuracy with RNN-like efficiency.

Part 5: Technical Deep Dive

Mathematical Formulations

The core of Titans' memory update mechanism can be formalized as follows. Let $M_t$ be the memory state at time $t$ , and let $x_t$ be the new input token.

The surprise metric is computed as the gradient of the loss with respect to the memory:

\text{surprise}_t = \left\|\frac{\partial L}{\partial M_t}\right\|

where $L$ is the loss function comparing the model's prediction with the expected output.

The memory update incorporates surprise, momentum, and forgetting:

M_{t+1} = (1 - \lambda_t) \cdot M_t + \lambda_t \cdot f_\theta(x_t, M_t, \text{surprise}_t)

where:

$\lambda_t$ is the adaptive learning rate based on surprise
$f_\theta$ is the deep neural network memory module (parameterized by $\theta$ )
The forgetting term $(1 - \lambda_t)$ implements adaptive weight decay

The momentum mechanism considers recent context:

\text{surprise}_t^{\text{momentum}} = \alpha \cdot \text{surprise}_t + (1-\alpha) \cdot \text{surprise}_{t-1}^{\text{momentum}}

This ensures that surprising events and their immediate context are both captured.

Architecture Overview

The Titans architecture can be conceptualized as:

Input Sequence → [Contextual Memory (Learning)] → Summary
                                      ↓
                    [Core (In-Context Learning)] → Attention
                                      ↓
                    [Persistent Memory (Fixed)] → Output

The contextual memory compresses past tokens into a summary representation. This summary is then:

Incorporated into the current context
Passed to the attention mechanism
Used alongside recent tokens for prediction

The attention mechanism can decide dynamically whether to focus on:

Recent tokens (high precision, local context)
Memory summary (compressed, global context)
Both (hybrid approach)

This creates a hierarchical memory system where precision and efficiency are balanced.

Code Examples

Here's a simplified implementation of the Titans memory update mechanism:

titans_memory.py

import torch
import torch.nn as nn
 
class TitansMemory(nn.Module):
    def __init__(self, d_model: int, memory_dim: int, num_layers: int = 3):
        super().__init__()
        self.d_model = d_model
        self.memory_dim = memory_dim
        
        # Deep neural network memory module
        layers = []
        layers.append(nn.Linear(d_model, memory_dim))
        for _ in range(num_layers - 2):
            layers.append(nn.Linear(memory_dim, memory_dim))
            layers.append(nn.ReLU())
        layers.append(nn.Linear(memory_dim, memory_dim))
        self.memory_net = nn.Sequential(*layers)
        
        # Surprise threshold
        self.surprise_threshold = 0.1
        self.momentum_alpha = 0.9
        
    def forward(self, x: torch.Tensor, memory_state: torch.Tensor):
        """
        x: [batch, d_model] - current token embedding
        memory_state: [batch, memory_dim] - current memory state
        """
        # Compute surprise (gradient magnitude)
        x_requires_grad = x.requires_grad_(True)
        memory_requires_grad = memory_state.requires_grad_(True)
        
        # Forward through memory network
        memory_update = self.memory_net(x_requires_grad)
        
        # Compute loss (simplified - in practice, this would be 
        # the actual model loss)
        loss = torch.mean((memory_update - memory_state) ** 2)
        
        # Compute surprise as gradient magnitude
        grad = torch.autograd.grad(
            loss, memory_requires_grad, 
            create_graph=True, retain_graph=True
        )[0]
        surprise = torch.norm(grad, dim=-1)
        
        # Adaptive learning rate based on surprise
        lambda_t = torch.sigmoid(surprise - self.surprise_threshold)
        
        # Update memory with momentum
        if not hasattr(self, 'prev_surprise'):
            self.prev_surprise = surprise
        surprise_momentum = (
            self.momentum_alpha * surprise + 
            (1 - self.momentum_alpha) * self.prev_surprise
        )
        self.prev_surprise = surprise
        
        # Apply adaptive forgetting
        lambda_adaptive = lambda_t * (1 + surprise_momentum)
        lambda_adaptive = torch.clamp(lambda_adaptive, 0, 1)
        
        # Update memory state
        new_memory = (
            (1 - lambda_adaptive) * memory_state + 
            lambda_adaptive * memory_update
        )
        
        return new_memory, surprise

This simplified implementation shows the key components:

Deep neural network memory module
Surprise computation via gradients
Adaptive learning rate based on surprise
Momentum mechanism
Adaptive forgetting (weight decay)

Key Advantages of Titans/MIRAS:

Expressive Memory: Deep neural networks can represent exponentially more patterns than fixed-size vectors
Active Learning: Test-time memorization allows continuous adaptation
Selective Memorization: Surprise metric ensures only important information is stored
Efficient: Linear complexity with parallelizable training
Scalable: Handles 2M+ token contexts effectively

Trade-offs:

More complex: Deep memory modules require more parameters than simple vectors
Training complexity: Need to handle test-time updates during training
Memory overhead: Deep networks require more memory than fixed-size states

However, these trade-offs are justified by the significant gains in long-context performance and the ability to handle real-world, imperfect data robustly.

Implications and Future Directions

The introduction of Titans and the MIRAS framework marks a significant advancement in sequence modeling. By employing deep neural networks as memory modules that learn to memorize as data streams in, these approaches overcome the limitations of fixed-size recurrent states.

Key Implications:

Unified Theory: MIRAS reveals that all sequence models are variations of associative memory, providing a unified lens for understanding and designing architectures
Beyond Euclidean: Moving beyond MSE opens new possibilities for robust, task-specific, and information-theoretic objectives
Real-Time Adaptation: Test-time memorization enables models that continuously learn, adapting to new information without retraining
Practical Long Context: The ability to handle 2M+ token contexts opens new applications in document understanding, genomics, and long-term reasoning

Future Directions:

Hybrid Architectures: Combining Titans-style memory with other efficient attention mechanisms (like Infini-Attention or Ring Attention)
Task-Specific Biases: Designing attentional biases optimized for specific downstream tasks (e.g., code understanding, scientific reasoning)
Multi-Modal Memory: Extending the framework to handle images, audio, and other modalities in the memory system
Theoretical Analysis: Deeper understanding of the expressivity and limitations of deep memory modules
Efficiency Improvements: Further optimization of the memory update mechanisms for even faster inference
Robustness: Exploring more robust loss functions and retention mechanisms for noisy, real-world data

The research opens the door to a new generation of sequence models that combine the efficiency of RNNs with the expressive power needed for the era of long-context AI. As we move toward models that can understand entire codebases, analyze full genomes, or maintain context across extended conversations, architectures like Titans and frameworks like MIRAS will be essential.

References

Behrouz, A., Razaviyayn, M., & Mirrokni, V. (2024). Titans: Test-Time Adaptation for Long-Context Language Models via In-Context Memorization. arXiv preprint. arXiv:2504.13173
Behrouz, A., Razaviyayn, M., & Mirrokni, V. (2024). MIRAS: Memory-Informed Robust Associative Sequence Modeling. arXiv preprint. arXiv:2501.00663
Google Research. (2024). Titans + MIRAS: Helping AI have long-term memory. Google Research Blog. https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint. arXiv:2312.00752
Peng, B., Alcaide, E., Anthony, Q., et al. (2023). RWKV: Reinventing RNNs for the Transformer Era. EMNLP. arXiv:2305.13048
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS). arXiv:1706.03762