Did You Forget What I Asked? Prospective Memory Failures in Large Language Models
arXiv:2603.23530v1 Announce Type: new Abstract: Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93%
arXiv:2603.23530v1 Announce Type: new Abstract: Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.
Executive Summary
This article investigates prospective memory failures in large language models (LLMs) using a prospective memory inspired lens from cognitive psychology. The study reveals that LLMs often fail to meet formatting instructions when performing concurrent tasks, with compliance rates dropping by 2-21% across three model families. The research finds that vulnerability to forgetting is highly type-dependent, with terminal constraints degrading the most, and that a salience-enhanced format can recover lost compliance. The study also highlights bidirectional interference, where formatting constraints can reduce task accuracy. The findings have significant implications for the development and deployment of LLMs in various applications, including text generation and summarization.
Key Points
- ▸ Large language models often fail to satisfy formatting instructions when performing concurrent tasks.
- ▸ Vulnerability to forgetting is highly type-dependent, with terminal constraints degrading the most.
- ▸ A salience-enhanced format can recover lost compliance and restore performance to 90-100% in many settings.
Merits
Strength
The study employs a controlled paradigm and a prospective memory inspired lens to investigate LLMs, providing a novel and insightful perspective on their limitations.
Demerits
Limitation
The study is limited to a specific set of model families and datasets, which may not be representative of all LLMs and applications.
Expert Commentary
This study provides a timely and insightful perspective on the limitations of large language models. The use of a prospective memory inspired lens from cognitive psychology offers a novel and effective approach to understanding the complexities of LLMs. The findings highlight the importance of considering cognitive biases and limitations in the design and deployment of AI systems. While the study has some limitations, its implications are significant and warrant further investigation. The development of task management strategies and salience-enhanced formats can help mitigate the effects of forgetting and improve the performance of LLMs in various applications.
Recommendations
- ✓ Developers and researchers should prioritize the development of task management strategies and salience-enhanced formats to improve the performance and compliance of LLMs.
- ✓ LLMs should be designed and deployed with consideration for the potential consequences of forgetting and forgetting-related errors.
Sources
Original: arXiv - cs.CL