Academic

The ARC of Progress towards AGI: A Living Survey of Abstraction and Reasoning

arXiv:2603.13372v1 Announce Type: new Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) has become a key benchmark for fluid intelligence in AI. This survey presents the first cross-generation analysis of 82 approaches across three benchmark versions and the ARC Prize 2024-2025 competitions. Our central finding is that performance degradation across versions is consistent across all paradigms: program synthesis, neuro-symbolic, and neural approaches all exhibit 2-3x drops from ARC-AGI-1 to ARC-AGI-2, indicating fundamental limitations in compositional generalization. While systems now reach 93.0% on ARC-AGI-1 (Opus 4.6), performance falls to 68.8% on ARC-AGI-2 and 13% on ARC-AGI-3, as humans maintain near-perfect accuracy across all versions. Cost fell 390x in one year (o3's $4,500/task to GPT-5.2's $12/task), although this largely reflects reduced test-time parallelism. Trillion-scale models vary widely in score and cost, while Kaggle-constrained entries (660M-8B) achieve comp

S
Sahar Vahdati, Andrei Aioanei, Haridhra Suresh, Jens Lehmann
· · 1 min read · 3 views

arXiv:2603.13372v1 Announce Type: new Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) has become a key benchmark for fluid intelligence in AI. This survey presents the first cross-generation analysis of 82 approaches across three benchmark versions and the ARC Prize 2024-2025 competitions. Our central finding is that performance degradation across versions is consistent across all paradigms: program synthesis, neuro-symbolic, and neural approaches all exhibit 2-3x drops from ARC-AGI-1 to ARC-AGI-2, indicating fundamental limitations in compositional generalization. While systems now reach 93.0% on ARC-AGI-1 (Opus 4.6), performance falls to 68.8% on ARC-AGI-2 and 13% on ARC-AGI-3, as humans maintain near-perfect accuracy across all versions. Cost fell 390x in one year (o3's $4,500/task to GPT-5.2's $12/task), although this largely reflects reduced test-time parallelism. Trillion-scale models vary widely in score and cost, while Kaggle-constrained entries (660M-8B) achieve competitive results, aligning with Chollet's thesis that intelligence is skill-acquisition efficiency. Test-time adaptation and refinement loops emerge as critical success factors, while compositional reasoning and interactive learning remain unsolved. ARC Prize 2025 winners needed hundreds of thousands of synthetic examples to reach 24% on ARC-AGI-2, confirming that reasoning remains knowledge-bound. This first release of the ARC-AGI Living Survey captures the field as of February 2026, with updates at https://nimi-ai.com/arc-survey/

Executive Summary

This article presents a comprehensive survey of the Abstraction and Reasoning Corpus (ARC-AGI) benchmark, analyzing 82 approaches across three benchmark versions and the ARC Prize 2024-2025 competitions. The study reveals a consistent performance degradation across paradigms, indicating fundamental limitations in compositional generalization. While significant cost reductions have been achieved, with some models demonstrating near-human accuracy on ARC-AGI-1, performance drops significantly across versions, highlighting the need for improved compositional reasoning and interactive learning. The results underscore the complexity of achieving Artificial General Intelligence (AGI) and the importance of continued research in this area.

Key Points

  • The ARC-AGI benchmark experiences consistent performance degradation across paradigms, indicating fundamental limitations in compositional generalization.
  • Significant cost reductions have been achieved, with some models demonstrating near-human accuracy on ARC-AGI-1.
  • Trillion-scale models vary widely in score and cost, while Kaggle-constrained entries achieve competitive results.

Merits

Comprehensive Analysis

The study presents a thorough analysis of 82 approaches across three benchmark versions and the ARC Prize 2024-2025 competitions, providing a comprehensive understanding of the current state of the field.

Identification of Key Challenges

The study highlights the need for improved compositional reasoning and interactive learning, underscoring the complexity of achieving Artificial General Intelligence (AGI).

Demerits

Limited Generalizability

The study's findings may not be generalizable to other benchmarks or tasks, highlighting the need for further research to confirm these results.

Lack of Clear Solutions

The study does not provide clear solutions to the challenges highlighted, leaving researchers with a more nuanced understanding of the difficulties involved in achieving AGI.

Expert Commentary

The study presents a critical analysis of the current state of the field, highlighting the need for continued research in the development of Artificial General Intelligence (AGI). The findings underscore the complexity of achieving AGI, with significant performance degradation observed across paradigms. While significant cost reductions have been achieved, the study emphasizes the need for improved compositional reasoning and interactive learning. The study's emphasis on these challenges highlights the importance of continued research in this area, with policymakers and researchers alike recognizing the potential benefits and challenges associated with this technology.

Recommendations

  • Researchers should prioritize the development of compositional reasoning and interactive learning approaches, recognizing the need for improved generalization and adaptability in AGI systems.
  • Policymakers should consider prioritizing research funding for AGI-related initiatives, recognizing the potential benefits and challenges associated with this technology.

Sources