Academic

Multi-Agent Debate with Memory Masking

arXiv:2603.20215v1 Announce Type: new Abstract: Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, *multi-agent debate* (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the prev

arXiv:2603.20215v1 Announce Type: new Abstract: Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, multi-agent debate (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, multi-agent debate with memory masking (MAD-M$^2$), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD-M$^2$ can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M$^2$ can identify the erroneous memories and achieve better performance in reasoning than MAD.

Executive Summary

The article introduces a novel framework, Multi-Agent Debate with Memory Masking (MAD-M²), to enhance the robustness of multi-agent debate (MAD) in large language model (LLM) reasoning. While MAD leverages multiple LLMs for iterative reasoning via memory access, the authors identify a critical vulnerability: erroneous memories can persist and degrade performance. Theoretically, the authors establish a dependency between MAD’s efficacy and memory quality, prompting the development of MAD-M², which enables agents to mask erroneous memories at the start of each round. Empirical evaluations on mathematical and logical reasoning benchmarks confirm that MAD-M² outperforms MAD by preserving informative content and filtering out distortions. This advancement addresses a fundamental limitation in current MAD-based reasoning architectures by introducing a proactive, context-aware filtering mechanism.

Key Points

  • MAD’s reliance on previous memories introduces vulnerability to erroneous content
  • Erroneous memories negatively impact reasoning accuracy and consistency
  • MAD-M² introduces memory masking to selectively preserve accurate memories and discard erroneous ones

Merits

Theoretical Insight

The authors provide a clear theoretical foundation linking memory quality to MAD performance, establishing a causal relationship that justifies intervention.

Empirical Validation

Extensive benchmark testing demonstrates tangible improvements in reasoning accuracy when using MAD-M², validating the effectiveness of the proposed solution.

Demerits

Implementation Complexity

While MAD-M² is described as ‘simple,’ the integration of memory masking may introduce computational overhead or require additional tuning in real-world deployment scenarios.

Expert Commentary

This work represents a significant step forward in the evolution of multi-agent reasoning architectures. The authors correctly identify a subtle but pervasive flaw in MAD: the persistence of erroneous memories acts as a systemic bias, undermining the credibility of iterative reasoning outputs. The introduction of memory masking as a preemptive filter is both elegant and pragmatic. Unlike prior approaches that rely on post-hoc correction or iterative refinement alone, MAD-M² operates at the source—during memory ingestion—making it a more robust, scalable solution. Importantly, the framework preserves the core MAD innovation—multi-round debate—without introducing radical architectural changes, thereby maintaining compatibility with existing systems. The empirical validation on rigorous benchmarks adds credibility to the claim of improved performance. This paper should be considered a foundational contribution to the field of AI reasoning, particularly for applications where accuracy is paramount, such as legal argumentation or scientific inquiry.

Recommendations

  • 1. Researchers should adopt MAD-M² as a baseline framework for future MAD-based reasoning experiments to improve reproducibility and accuracy.
  • 2. Evaluators of LLM reasoning systems should incorporate memory integrity assessment protocols—such as memory fidelity scoring or error-detection benchmarks—to better gauge system reliability.

Sources

Original: arXiv - cs.CL