Trained Persistent Memory for Frozen Decoder-Only LLMs
arXiv:2603.22329v1 Announce Type: new Abstract: Decoder-only language models are stateless: hidden representations are discarded after every forward pass and nothing persists across sessions. Jeong (2026a) showed that trained memory adapters give a frozen encoder-decoder backbone persistent latent-space memory, building on the lateral-memory framework of Jeong (2026b,c). Here we ask whether the same principle transfers to the decoder-only setting, where no cross-attention pathway exists and memory must enter through self-attention alone. We adapt six methods -- prefix, parallel cross-attention, KV extension, Hebbian memory, context-gated branch, and slot-based sparse write -- to a frozen GPT-2, training only a small adapter $\theta_{mem}$. The write rule is shared; only the read injection changes from decoder cross-attention to self-attention KV prefix or parallel branch. On LoCoMo we find a striking inductive-bias dichotomy: at $1\times$ capacity, three methods with strong architectu
arXiv:2603.22329v1 Announce Type: new Abstract: Decoder-only language models are stateless: hidden representations are discarded after every forward pass and nothing persists across sessions. Jeong (2026a) showed that trained memory adapters give a frozen encoder-decoder backbone persistent latent-space memory, building on the lateral-memory framework of Jeong (2026b,c). Here we ask whether the same principle transfers to the decoder-only setting, where no cross-attention pathway exists and memory must enter through self-attention alone. We adapt six methods -- prefix, parallel cross-attention, KV extension, Hebbian memory, context-gated branch, and slot-based sparse write -- to a frozen GPT-2, training only a small adapter $\theta_{mem}$. The write rule is shared; only the read injection changes from decoder cross-attention to self-attention KV prefix or parallel branch. On LoCoMo we find a striking inductive-bias dichotomy: at $1\times$ capacity, three methods with strong architectural priors -- cross-attention (M.2), Hebbian (M.4), and slot write (M.6) -- achieve retained-memory scores of $7-18\%$ and knowledge gains $\Delta K$ of $7-10$, while the other three fail ($< 0.4\%$). At $10\times$ capacity all six converge, showing the gap is architectural, not fundamental. Together with the encoder-decoder results of Jeong (2026a) and the brain-inspired modules of Jeong (2026b,c), these findings establish persistent latent-space memory as a general paradigm spanning major transformer families.
Executive Summary
This study explores the concept of persistent memory in decoder-only language models, a stateless architecture commonly used in transformer-based language understanding. By adapting various memory injection methods to a frozen GPT-2 model, the researchers demonstrate that certain architectures can achieve retained-memory scores and knowledge gains, even at low capacity. However, the success of these methods is highly dependent on the architectural priors, suggesting that the gap between successful and unsuccessful methods is not fundamental, but rather a result of design choices. These findings have implications for the development of more efficient and effective transformer-based models, and highlight the importance of understanding the underlying architecture in achieving persistent memory.
Key Points
- ▸ The study demonstrates the feasibility of persistent memory in decoder-only language models.
- ▸ Architectural priors play a crucial role in the success of memory injection methods.
- ▸ The gap between successful and unsuccessful methods is not fundamental, but rather a result of design choices.
Merits
Strength in Architectural Understanding
The study provides valuable insights into the importance of architectural priors in achieving persistent memory, highlighting the need for a deeper understanding of the underlying architecture in transformer-based models.
Methodological Innovation
The study adapts various memory injection methods to a frozen GPT-2 model, demonstrating the versatility of these methods and their potential applications in transformer-based models.
Demerits
Limited Generalizability
The study's findings may not generalize to other transformer-based models, highlighting the need for further research to confirm the results and explore their applicability to other architectures.
Dependence on Architectural Priors
The success of memory injection methods is highly dependent on the architectural priors, which may limit their applicability in practice and highlight the importance of careful design choices.
Expert Commentary
The study's findings are significant, as they demonstrate the feasibility of persistent memory in decoder-only language models and highlight the importance of architectural priors in achieving this goal. However, the study's limitations and dependence on architectural priors may limit its applicability in practice. Further research is needed to confirm the results and explore their applicability to other architectures. Additionally, the study's findings may have implications for the development of brain-inspired modules in transformers, highlighting the potential applications of persistent memory in these architectures.
Recommendations
- ✓ Further research is needed to confirm the results and explore their applicability to other architectures.
- ✓ Careful design choices are necessary to achieve persistent memory in decoder-only language models, highlighting the importance of understanding the underlying architecture.
Sources
Original: arXiv - cs.LG