Academic

Context-Length Robustness in Question Answering Models: A Comparative Empirical Study

arXiv:2603.15723v1 Announce Type: new Abstract: Large language models are increasingly deployed in settings where relevant information is embedded within long and noisy contexts. Despite this, robustness to growing context length remains poorly understood across different question answering tasks. In this work, we present a controlled empirical study of context-length robustness in large language models using two widely used benchmarks: SQuAD and HotpotQA. We evaluate model accuracy as a function of total context length by systematically increasing the amount of irrelevant context while preserving the answer-bearing signal. This allows us to isolate the effect of context length from changes in task difficulty. Our results show a consistent degradation in performance as context length increases, with substantially larger drops observed on multi-hop reasoning tasks compared to single-span extraction tasks. In particular, HotpotQA exhibits nearly twice the accuracy degradation of SQuAD

T
Trishita Dhara, Siddhesh Sheth
· · 1 min read · 13 views

arXiv:2603.15723v1 Announce Type: new Abstract: Large language models are increasingly deployed in settings where relevant information is embedded within long and noisy contexts. Despite this, robustness to growing context length remains poorly understood across different question answering tasks. In this work, we present a controlled empirical study of context-length robustness in large language models using two widely used benchmarks: SQuAD and HotpotQA. We evaluate model accuracy as a function of total context length by systematically increasing the amount of irrelevant context while preserving the answer-bearing signal. This allows us to isolate the effect of context length from changes in task difficulty. Our results show a consistent degradation in performance as context length increases, with substantially larger drops observed on multi-hop reasoning tasks compared to single-span extraction tasks. In particular, HotpotQA exhibits nearly twice the accuracy degradation of SQuAD under equivalent context expansions. These findings highlight task-dependent differences in robustness and suggest that multi-hop reasoning is especially vulnerable to context dilution. We argue that context-length robustness should be evaluated explicitly when assessing model reliability, especially for applications involving long documents or retrieval-augmented generation.

Executive Summary

This study investigates the context-length robustness of large language models on question answering tasks, particularly in scenarios where relevant information is embedded in long and noisy contexts. The authors present a controlled empirical study using two benchmarks, SQuAD and HotpotQA, to evaluate model accuracy as context length increases. The results show a consistent degradation in performance as context length grows, with larger drops observed on multi-hop reasoning tasks. The findings highlight task-dependent differences in robustness and suggest that multi-hop reasoning is vulnerable to context dilution. The study's implications emphasize the need to evaluate context-length robustness explicitly when assessing model reliability, especially in applications involving long documents or retrieval-augmented generation.

Key Points

  • The study evaluates context-length robustness of large language models on question answering tasks
  • Results show consistent degradation in performance as context length increases
  • Multi-hop reasoning tasks exhibit larger drops in accuracy compared to single-span extraction tasks

Merits

Strength

The study provides a controlled empirical evaluation of context-length robustness, offering a clear understanding of the phenomenon.

Strength

The use of two benchmarks, SQuAD and HotpotQA, allows for a comprehensive analysis of task-dependent differences in robustness.

Strength

The results have practical implications for the development and deployment of language models in real-world applications.

Demerits

Limitation

The study focuses on two specific benchmarks, limiting the generalizability of the findings to other question answering tasks.

Limitation

The evaluation of context-length robustness is based on a simple experiment, which might not capture the full complexity of real-world scenarios.

Expert Commentary

The study's findings highlight the importance of context-length robustness in language models, particularly in scenarios with long and noisy contexts. While the results are encouraging, they also underscore the need for further research to better understand the underlying mechanisms driving context dilution. The use of more comprehensive evaluation frameworks and the incorporation of additional benchmarks are essential steps towards developing more robust language models. Furthermore, the study's implications for the development and deployment of language models in real-world applications are significant, and policymakers should prioritize the development of evaluation frameworks that assess context-length robustness.

Recommendations

  • Developers should incorporate more robust evaluation frameworks that assess context-length robustness in language models.
  • Researchers should investigate the underlying mechanisms driving context dilution in language models.

Sources