Academic

ABD: Default Exception Abduction in Finite First Order Worlds

arXiv:2602.18843v1 Announce Type: new Abstract: We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions sparse. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.

S
Serafim Batzoglou
· · 1 min read · 17 views

arXiv:2602.18843v1 Announce Type: new Abstract: We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions sparse. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.

Executive Summary

This article introduces ABD, a benchmark for default-exception abduction in finite first-order worlds. The authors propose a novel framework for evaluating the performance of Large Language Models (LLMs) in solving abduction tasks. They formalize three observation regimes and demonstrate the effectiveness of their approach with exact SMT verification. However, the results indicate that LLMs struggle with achieving both high validity and parsimony, and distinct generalization failure modes are observed across regimes. This study provides valuable insights into the strengths and limitations of current LLMs in abduction tasks and highlights the need for further research in this area.

Key Points

  • Introduction of the ABD benchmark for default-exception abduction in finite first-order worlds
  • Formalization of three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification
  • Evaluation of ten frontier LLMs on 600 instances, revealing high validity but parsimony gaps and distinct generalization failure modes

Merits

Contribution to the field of abduction

The introduction of the ABD benchmark and the formalization of observation regimes provide a significant contribution to the field of abduction, enabling more accurate evaluation and comparison of LLMs.

Insights into LLM limitations

The study reveals the limitations of current LLMs in achieving both high validity and parsimony, providing valuable insights for future research and development.

Demerits

Limited generalizability

The results may not generalize to other domains or abduction tasks, limiting the applicability of the findings.

Need for further research

The study highlights the need for further research in developing more effective LLMs for abduction tasks and addressing the observed generalization failure modes.

Expert Commentary

This study provides a significant contribution to the field of abduction, highlighting the strengths and limitations of current LLMs. The results indicate that LLMs struggle with achieving both high validity and parsimony, and distinct generalization failure modes are observed across regimes. This suggests that further research is needed to develop more effective LLMs for abduction tasks. The study also highlights the need for more accurate evaluation and comparison of LLMs, which can be addressed through the introduction of benchmarks like ABD. Overall, this study provides valuable insights into the current state of LLMs in abduction tasks and highlights the potential impact on the design of AI systems for real-world applications.

Recommendations

  • Develop more effective LLMs for abduction tasks by addressing the observed generalization failure modes
  • Introduce more benchmarks and evaluation regimes to enable more accurate comparison of LLMs

Sources