Learning to Adapt in Test-Time (Titans/MIRAS)
A deep dive into Titans and MIRAS architectures that enable LLMs to memorize and adapt at inference time using neural memory modules.
Table of Contents
- Introduction: The Problem Titans/MIRAS Solves
- Part 1: Titans Architecture - Deep Neural Memory
- The Core Innovation: Deep Memory Modules
- Test-Time Memorization
- The Surprise Metric
- Momentum and Forgetting Mechanisms
- Part 2: MIRAS Framework - Theoretical Unification
- Four Key Design Choices
- Memory Architecture
- Attentional Bias
- Retention Gate
- Memory Algorithm
- Beyond Mean Squared Error
- Part 3: MIRAS Variants
- YAAD: Robust to Outliers
- MONETA: Generalized Norms
- MEMORA: Probability Map Constraints
- Part 4: Experimental Results
- Language Modeling Performance
- Extreme Long-Context Recall (BABILong)
- Efficiency Comparisons
- Scaling Properties
- Part 5: Technical Deep Dive
- Mathematical Formulations
- Architecture Overview
- Code Examples
- Comparison with Existing Methods
- Implications and Future Directions
- References
Introduction: The Problem Titans/MIRAS Solves
Transformers revolutionized sequence modeling with attention mechanisms that allow models to look back at earlier inputs and prioritize relevant information. But there's a fundamental limitation: computational cost scales quadratically with sequence length. This makes it prohibitively expensive to scale Transformer-based models to extremely long contexts—the kind needed for full-document understanding, genomic analysis, or codebase-wide reasoning.
The research community explored alternatives: efficient linear recurrent neural networks (RNNs) and state space models (SSMs) like Mamba-2. These models offer fast, linear scaling by compressing context into a fixed-size state. However, this fixed-size compression has a critical weakness—it cannot adequately capture the rich, nuanced information in very long sequences. You're essentially trying to summarize a novel into a single sentence: something important always gets lost.
In two groundbreaking papers, Titans and MIRAS, Google Research introduces an architecture and theoretical framework that combine the speed of RNNs with the accuracy of Transformers. Titans is the specific architecture—the practical tool. MIRAS is the theoretical blueprint—the unified framework that generalizes these approaches. Together, they advance the concept of test-time memorization: the ability of an AI model to maintain long-term memory by incorporating powerful "surprise" metrics (unexpected, important information) while the model is running, without dedicated offline retraining.
The MIRAS framework, as demonstrated by Titans, introduces a meaningful shift toward real-time adaptation. Instead of compressing information into a static state, this architecture actively learns and updates its own parameters as data streams in. This crucial mechanism enables the model to incorporate new, specific details into its core knowledge instantly—like a human who can learn and remember new facts during a conversation, not just during study sessions.
Titans Architecture - Deep Neural Memory
The Core Innovation: Deep Memory Modules
An effective learning system requires distinct yet interconnected memory modules, mirroring the human brain's separation of short-term and long-term memory. While attention mechanisms excel for precise, short-term memory, Titans introduces a novel neural long-term memory module that fundamentally differs from traditional approaches.
Unlike the fixed-size vector or matrix memory in traditional RNNs, Titans uses a deep neural network (specifically, a multi-layer perceptron) as its memory module. This provides significantly higher expressive power, allowing the model to summarize large volumes of information without losing important context. The model isn't simply taking notes; it's understanding and synthesizing the entire story.
The architecture consists of three key components:
- Contextual Memory (Learning): The deep neural memory module that learns and updates during processing
- Core (In-Context Learning): The main transformer-like component that processes current context
- Persistent Memory (Fixed Weights): The pre-trained backbone that provides foundational knowledge
The contextual memory compresses past data into a summary, which is then incorporated into the context and passed to attention. Attention can then decide whether it needs to attend to the summary of the past or focus on recent tokens. This creates a hierarchical memory system where recent information gets precise attention, while distant information is intelligently summarized.
Test-Time Memorization
The revolutionary aspect of Titans is test-time memorization—the ability to learn and update memory while the model is actively running, not just during training. Traditional models are frozen after training: they can only use what they learned during the training phase. Titans breaks this limitation.
During inference, as new tokens stream in, Titans continuously updates its long-term memory module. This isn't just storing raw data—it's learning representations, relationships, and conceptual themes that connect tokens across the entire input. The model becomes an active learner, adapting its understanding in real-time.
This capability is crucial for handling extremely long contexts. Imagine processing a 2-million-token document: you can't keep everything in active memory, but you also can't afford to forget important details. Test-time memorization allows Titans to selectively learn what matters most, creating a compressed but rich representation of the entire sequence.
The Surprise Metric
A key aspect of Titans' learning ability is what the researchers call the "surprise metric". In human psychology, we know we quickly forget routine, expected events but remember things that break the pattern—unexpected, surprising, or highly emotional events. Titans implements a mathematical equivalent of this principle.
The surprise metric is the model detecting a large difference between what it currently remembers and what the new input is telling it. Formally, this is measured using gradients—the internal error signal that indicates how much the model's current state differs from the new information.
Low surprise: If the new word is "cat" and the model's memory state already expects an animal word, the gradient (surprise) is low. The model can safely skip memorizing the word "cat" in its permanent long-term state because it's consistent with existing knowledge.
High surprise: If the model's memory state is summarizing a serious financial report, and the new input is a picture of a banana peel (the unexpected event), the gradient (surprise) will be very high. This signals that the new input is important or anomalous, and it must be prioritized for permanent storage in the long-term memory module.
The model uses this internal error signal (the gradient) as a mathematical equivalent of saying, "This is unexpected and important!" This allows the Titans architecture to selectively update its long-term memory only with the most novel and context-breaking information, keeping the overall process fast and efficient.
Momentum and Forgetting Mechanisms
Titans refines the surprise mechanism by incorporating two critical elements:
Momentum: The model considers both "momentary surprise" (the current input) and "past surprise" (the recent context flow). This ensures relevant subsequent information is also captured, even if those tokens are not individually surprising. For example, if a surprising event occurs, the next few tokens that provide context about that event should also be remembered, even if they're not surprising themselves.
Forgetting (Weight Decay): To manage the finite capacity of the memory when dealing with extremely long sequences, Titans employs an adaptive weight decay mechanism. This acts as a forgetting gate, allowing the model to discard information that is no longer needed. The decay is adaptive—more aggressive for less important information, gentler for critical knowledge.
Together, momentum and forgetting create a balanced memory system: momentum ensures continuity and context preservation, while forgetting prevents memory overflow and maintains focus on what matters most.
MIRAS Framework - Theoretical Unification
Four Key Design Choices
MIRAS (Memory-Informed Robust Associative Sequence modeling) provides a unified theoretical framework that reveals a profound insight: every major breakthrough in sequence modeling—from modern transformers to lightning-fast linear RNNs—is essentially the same thing under the hood: a highly complex associative memory module.
What makes MIRAS both unique and practical is the way it views AI modeling. Instead of seeing diverse architectures as fundamentally different, it sees different methods of solving the same problem: efficiently combining new information with old memories without letting essential concepts be forgotten.
MIRAS defines a sequence model through four key design choices:
- Memory Architecture: The structure that stores information (e.g., a vector, matrix, or a deep multi-layer perceptron, like in Titans)
- Attentional Bias: The internal learning objective the model optimizes that determines what it prioritizes
- Retention Gate: The memory regularizer. MIRAS reinterprets "forgetting mechanisms" as specific forms of regularization that balance new learning against retaining past knowledge
- Memory Algorithm: The optimization algorithm used to update the memory
These four dimensions create a design space where existing architectures are just specific points. Transformers, Mamba, RWKV, and others all fall within this framework—they're just different combinations of these four choices.
Memory Architecture
The memory architecture determines how information is stored. Traditional approaches use:
- Vectors: Simple fixed-size arrays (like in basic RNNs)
- Matrices: Two-dimensional structures (like in some attention mechanisms)
Titans introduces deep neural networks (multi-layer perceptrons) as memory modules. This provides exponentially more expressive power. A vector can store values. A matrix can store relationships. A deep MLP can learn arbitrary functions mapping inputs to compressed representations—potentially encoding or more distinct patterns.
The depth of the memory architecture is crucial. Ablation studies show that deeper memory modules consistently achieve lower perplexity in language modeling and exhibit better scaling properties, maintaining performance as sequence length increases significantly.
Attentional Bias
The attentional bias determines what the model prioritizes when updating memory. This is the learning objective—what should the model optimize for?
Most existing models use mean squared error (MSE) or dot-product similarity as their bias. This works well for average cases but can be problematic:
- Sensitivity to outliers: A single typo or anomaly can disproportionately affect the model
- Limited expressive power: MSE assumes a Gaussian distribution of errors, which may not match real-world data distributions
- Uniform weighting: All errors are treated equally, regardless of their importance
MIRAS allows exploring richer bias functions. The Titans surprise metric is one example—it prioritizes unexpected information. But MIRAS opens the door to many other possibilities: robust losses (Huber, quantile), information-theoretic objectives, or task-specific biases.
Retention Gate
The retention gate manages the balance between learning new information and retaining existing knowledge. In MIRAS, forgetting mechanisms are reinterpreted as specific forms of regularization.
Traditional approaches use simple decay: . This treats all information equally. MIRAS allows more sophisticated retention strategies:
- Adaptive decay: Different decay rates for different types of information
- Selective forgetting: Forgetting less important information more aggressively
- Stability constraints: Ensuring memory updates don't destabilize the system
Titans uses adaptive weight decay that considers the importance of information—critical knowledge decays slowly, routine information decays quickly.
Memory Algorithm
The memory algorithm is the optimization method used to update memory. Most models use simple gradient descent or its variants. MIRAS provides a framework for exploring more sophisticated algorithms:
- Momentum-based updates: Considering past gradients, not just current ones
- Adaptive learning rates: Different rates for different memory components
- Second-order methods: Using curvature information for more efficient updates
Titans uses gradient-based optimization with momentum, allowing it to capture not just immediate surprises but also the context around surprising events.
Beyond Mean Squared Error
Virtually all successful existing sequence models rely on mean squared error (MSE) or dot-product similarity for both their bias and retention. This reliance can make models sensitive to outliers and limit their expressive power.
MIRAS transcends this limitation by providing a generative framework to explore a richer design space informed by optimization and statistics literature. This allows for the creation of novel architectures with non-Euclidean objectives and regularization.
The framework enables exploration of:
- Robust losses: Less sensitive to outliers (Huber loss, quantile loss)
- Information-theoretic objectives: Maximizing mutual information, minimizing entropy
- Task-specific biases: Optimizing for specific downstream tasks
- Non-Euclidean geometries: Using different distance metrics and norms
This theoretical flexibility is what enables the MIRAS variants—YAAD, MONETA, and MEMORA—each exploring different points in this design space.
Part 3: MIRAS Variants
Using the MIRAS framework, researchers created three specific attention-free models, each exploring different design choices:
YAAD: Robust to Outliers
YAAD (Yet Another Attention-free Architecture with Robustness) is designed to be less sensitive to major errors or "outliers" (like a single typo in a large document). It uses a gentler mathematical penalty (Huber loss) for mistakes, so it doesn't overreact to one-off issues.
Key Innovation: Instead of MSE, YAAD uses Huber loss:
For small errors (within threshold ), it behaves like MSE. For large errors (outliers), it uses linear penalty instead of quadratic, making it robust to anomalies.
This makes the model more robust when input data is messy or inconsistent—exactly what you need when processing real-world, imperfect data at scale.
MONETA: Generalized Norms
MONETA explores the use of more complex and strict mathematical penalties (called generalized norms). It investigates whether using these more disciplined rules for both what the model attends to and what it forgets can lead to a more powerful and stable long-term memory system overall.
Key Innovation: MONETA uses -norms and other generalized distance metrics:
By varying , MONETA can explore different geometries:
- : Standard Euclidean (like MSE)
- : Manhattan distance (more robust to outliers)
- : Chebyshev distance (focuses on maximum error)
MONETA applies these norms to both the attentional bias (what to prioritize) and the retention gate (what to forget), creating a more disciplined memory system.
MEMORA: Probability Map Constraints
MEMORA focuses on achieving the best possible memory stability by forcing its memory to act like a strict probability map. By using this constraint, it ensures that every time the memory state is updated, the changes are controlled and balanced.
Key Innovation: MEMORA constrains memory updates to maintain probability distribution properties:
- Non-negativity: Memory values must be non-negative
- Normalization: Memory states sum to 1 (or integrate to 1 for continuous cases)
- Monotonicity: Updates preserve ordering relationships
This guarantees a clean, stable process for integrating new information. The memory behaves like a probability distribution over possible states, making it interpretable and ensuring updates don't create invalid states.
Part 4: Experimental Results
Language Modeling Performance
Titans and MIRAS variants were rigorously compared against leading architectures, including Transformer++, Mamba-2, and Gated DeltaNet. Across standard language modeling datasets (C4, WikiText) and zero-shot reasoning tasks (HellaSwag, PIQA), the models consistently demonstrated:
- Higher accuracy: Better performance on downstream tasks
- Lower perplexity: Less "surprise" when looking at text, indicating better language modeling
- Efficient training: Maintains parallelizable training despite RNN-like inference
- Fast inference: Linear scaling with sequence length
The novel MIRAS variants (MONETA, YAAD, MEMORA) also achieved improved performance compared to baselines, validating the benefit of exploring robust, non-MSE optimization mechanisms.
The Power of Deep Memory
Ablation studies clearly show that the depth of the memory architecture is crucial. When comparing long-term memory modules of the same size but different depths:
- Deeper memories achieve lower perplexity: More layers in the memory MLP lead to better compression and understanding
- Better scaling properties: Deeper memories maintain performance as sequence length increases significantly
- Consistent across model sizes: The benefit holds for both 360M and 760M parameter models
This validates the core innovation: using deep neural networks as memory modules provides exponentially more expressive power than fixed-size vectors or matrices.
Extreme Long-Context Recall
The most significant advantage of these new architectures is their ability to handle extremely long contexts. This is highlighted in the BABILong benchmark, a task requiring reasoning across facts distributed in extremely long documents.
In this challenging setting:
- Titans outperforms all baselines, including extremely large models like GPT-4, despite having many fewer parameters
- Scales effectively to 2M+ tokens: Demonstrates capability far beyond traditional context windows
- Maintains accuracy: Performance doesn't degrade as context length increases
The ability to memorize and retrieve information from 2-million-token contexts opens new possibilities for:
- Full-document understanding: Processing entire books, legal documents, or codebases
- Genomic analysis: Analyzing entire genomes or large genomic datasets
- Long-term reasoning: Maintaining context across extended conversations or analysis sessions
Efficiency Comparisons
Despite their powerful capabilities, Titans and MIRAS variants maintain efficient computation:
- Linear inference: time complexity, same as RNNs
- Parallelizable training: Can still be trained efficiently, unlike sequential RNNs
- Memory efficient: Deep memory modules are compact compared to full attention matrices
The models achieve the best of both worlds: Transformer-like accuracy with RNN-like efficiency.
Part 5: Technical Deep Dive
Mathematical Formulations
The core of Titans' memory update mechanism can be formalized as follows. Let be the memory state at time , and let be the new input token.
The surprise metric is computed as the gradient of the loss with respect to the memory:
where is the loss function comparing the model's prediction with the expected output.
The memory update incorporates surprise, momentum, and forgetting:
where:
- is the adaptive learning rate based on surprise
- is the deep neural network memory module (parameterized by )
- The forgetting term implements adaptive weight decay
The momentum mechanism considers recent context:
This ensures that surprising events and their immediate context are both captured.
Architecture Overview
The Titans architecture can be conceptualized as:
Input Sequence → [Contextual Memory (Learning)] → Summary
↓
[Core (In-Context Learning)] → Attention
↓
[Persistent Memory (Fixed)] → OutputThe contextual memory compresses past tokens into a summary representation. This summary is then:
- Incorporated into the current context
- Passed to the attention mechanism
- Used alongside recent tokens for prediction
The attention mechanism can decide dynamically whether to focus on:
- Recent tokens (high precision, local context)
- Memory summary (compressed, global context)
- Both (hybrid approach)
This creates a hierarchical memory system where precision and efficiency are balanced.
Code Examples
Here's a simplified implementation of the Titans memory update mechanism:
import torch
import torch.nn as nn
class TitansMemory(nn.Module):
def __init__(self, d_model: int, memory_dim: int, num_layers: int = 3):
super().__init__()
self.d_model = d_model
self.memory_dim = memory_dim
# Deep neural network memory module
layers = []
layers.append(nn.Linear(d_model, memory_dim))
for _ in range(num_layers - 2):
layers.append(nn.Linear(memory_dim, memory_dim))
layers.append(nn.ReLU())
layers.append(nn.Linear(memory_dim, memory_dim))
self.memory_net = nn.Sequential(*layers)
# Surprise threshold
self.surprise_threshold = 0.1
self.momentum_alpha = 0.9
def forward(self, x: torch.Tensor, memory_state: torch.Tensor):
"""
x: [batch, d_model] - current token embedding
memory_state: [batch, memory_dim] - current memory state
"""
# Compute surprise (gradient magnitude)
x_requires_grad = x.requires_grad_(True)
memory_requires_grad = memory_state.requires_grad_(True)
# Forward through memory network
memory_update = self.memory_net(x_requires_grad)
# Compute loss (simplified - in practice, this would be
# the actual model loss)
loss = torch.mean((memory_update - memory_state) ** 2)
# Compute surprise as gradient magnitude
grad = torch.autograd.grad(
loss, memory_requires_grad,
create_graph=True, retain_graph=True
)[0]
surprise = torch.norm(grad, dim=-1)
# Adaptive learning rate based on surprise
lambda_t = torch.sigmoid(surprise - self.surprise_threshold)
# Update memory with momentum
if not hasattr(self, 'prev_surprise'):
self.prev_surprise = surprise
surprise_momentum = (
self.momentum_alpha * surprise +
(1 - self.momentum_alpha) * self.prev_surprise
)
self.prev_surprise = surprise
# Apply adaptive forgetting
lambda_adaptive = lambda_t * (1 + surprise_momentum)
lambda_adaptive = torch.clamp(lambda_adaptive, 0, 1)
# Update memory state
new_memory = (
(1 - lambda_adaptive) * memory_state +
lambda_adaptive * memory_update
)
return new_memory, surpriseThis simplified implementation shows the key components:
- Deep neural network memory module
- Surprise computation via gradients
- Adaptive learning rate based on surprise
- Momentum mechanism
- Adaptive forgetting (weight decay)
Key Advantages of Titans/MIRAS:
- Expressive Memory: Deep neural networks can represent exponentially more patterns than fixed-size vectors
- Active Learning: Test-time memorization allows continuous adaptation
- Selective Memorization: Surprise metric ensures only important information is stored
- Efficient: Linear complexity with parallelizable training
- Scalable: Handles 2M+ token contexts effectively
Trade-offs:
- More complex: Deep memory modules require more parameters than simple vectors
- Training complexity: Need to handle test-time updates during training
- Memory overhead: Deep networks require more memory than fixed-size states
However, these trade-offs are justified by the significant gains in long-context performance and the ability to handle real-world, imperfect data robustly.
Implications and Future Directions
The introduction of Titans and the MIRAS framework marks a significant advancement in sequence modeling. By employing deep neural networks as memory modules that learn to memorize as data streams in, these approaches overcome the limitations of fixed-size recurrent states.
Key Implications:
-
Unified Theory: MIRAS reveals that all sequence models are variations of associative memory, providing a unified lens for understanding and designing architectures
-
Beyond Euclidean: Moving beyond MSE opens new possibilities for robust, task-specific, and information-theoretic objectives
-
Real-Time Adaptation: Test-time memorization enables models that continuously learn, adapting to new information without retraining
-
Practical Long Context: The ability to handle 2M+ token contexts opens new applications in document understanding, genomics, and long-term reasoning
Future Directions:
-
Hybrid Architectures: Combining Titans-style memory with other efficient attention mechanisms (like Infini-Attention or Ring Attention)
-
Task-Specific Biases: Designing attentional biases optimized for specific downstream tasks (e.g., code understanding, scientific reasoning)
-
Multi-Modal Memory: Extending the framework to handle images, audio, and other modalities in the memory system
-
Theoretical Analysis: Deeper understanding of the expressivity and limitations of deep memory modules
-
Efficiency Improvements: Further optimization of the memory update mechanisms for even faster inference
-
Robustness: Exploring more robust loss functions and retention mechanisms for noisy, real-world data
The research opens the door to a new generation of sequence models that combine the efficiency of RNNs with the expressive power needed for the era of long-context AI. As we move toward models that can understand entire codebases, analyze full genomes, or maintain context across extended conversations, architectures like Titans and frameworks like MIRAS will be essential.
References
-
Behrouz, A., Razaviyayn, M., & Mirrokni, V. (2024). Titans: Test-Time Adaptation for Long-Context Language Models via In-Context Memorization. arXiv preprint. arXiv:2504.13173
-
Behrouz, A., Razaviyayn, M., & Mirrokni, V. (2024). MIRAS: Memory-Informed Robust Associative Sequence Modeling. arXiv preprint. arXiv:2501.00663
-
Google Research. (2024). Titans + MIRAS: Helping AI have long-term memory. Google Research Blog. https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/
-
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint. arXiv:2312.00752
-
Peng, B., Alcaide, E., Anthony, Q., et al. (2023). RWKV: Reinventing RNNs for the Transformer Era. EMNLP. arXiv:2305.13048
-
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS). arXiv:1706.03762