Hybrid Associative Memory (HAM) layers allow the KV cache to grow dynamically based only on information that an internal RNN cannot predict.
March 25, 2026
Original Paper
Hybrid Associative Memories
arXiv · 2603.22325
The Takeaway
It solves the KV cache memory explosion by selectively storing tokens. Users can tune a single threshold to trade off memory usage and performance, providing a much more flexible alternative to fixed-window or purely linear attention models.
From the abstract
Recurrent neural networks (RNNs) and self-attention are both widely used sequence-mixing layers that maintain an internal memory. However, this memory is constructed using two orthogonal mechanisms: RNNs compress the entire past into a fixed-size state, whereas self-attention's state stores every past time step growing its state (the KV cache) linearly with the sequence length. This results in orthogonal strengths and weaknesses. Self-attention layers excel at retrieving information in the conte