GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams
arXiv:2603.19252v1 Announce Type: cross Abstract: Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choic
arXiv:2603.19252v1 Announce Type: cross Abstract: Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.
Executive Summary
The article introduces GeoChallenge, a novel dataset of 90K automatically generated multiple-choice geometry proof problems that require multi-step reasoning over aligned textual descriptions and diagrams. Experiments reveal a clear performance gap between large language models (LLMs) and humans, highlighting three common failure patterns of LLMs: exact match failures, weak visual reliance, and overextended reasoning without convergence. The GeoChallenge dataset provides a valuable tool for evaluating the symbolic reasoning of LLMs, but its limitations, such as a narrow focus on geometry, must be acknowledged. The findings have significant implications for the development of more robust LLMs and highlight the need for further research in geometric reasoning and visual grounding.
Key Points
- ▸ GeoChallenge is a novel dataset for evaluating the symbolic reasoning of LLMs
- ▸ The dataset consists of 90K automatically generated multiple-choice geometry proof problems
- ▸ Experiments reveal a significant performance gap between LLMs and humans
Merits
Strength: Comprehensive Evaluation Tool
GeoChallenge provides a comprehensive evaluation tool for LLMs, enabling researchers to assess their symbolic reasoning capabilities in a controlled setting.
Strength: Fine-Grained Complexity Ratings
The dataset includes fine-grained complexity ratings, allowing researchers to tailor their evaluation to specific tasks and models.
Strength: Visual Grounding
GeoChallenge requires multi-step reasoning over aligned textual descriptions and diagrams, providing a more nuanced evaluation of LLMs' visual grounding abilities.
Demerits
Limitation: Narrow Focus on Geometry
The GeoChallenge dataset focuses exclusively on geometry, limiting its applicability to other domains and areas of symbolic reasoning.
Limitation: Limited Generalizability
The findings may not generalize to other languages or cultural contexts, highlighting the need for further research and validation.
Limitation: Potential for Overfitting
The automated generation of problems may lead to overfitting, which can compromise the validity and reliability of the results.
Expert Commentary
The article presents a novel and relevant contribution to the field of AI research, highlighting the importance of visual reasoning and symbolic reasoning in LLMs. However, the limitations of the dataset and the findings must be acknowledged and addressed through further research. The development of more robust and versatile LLMs is critical for real-world applications, and the findings have significant implications for policy-making and investment in AI research.
Recommendations
- ✓ Develop more comprehensive and diverse datasets that cover a broader range of domains and areas of symbolic reasoning.
- ✓ Invest in research and development of more robust and versatile LLMs that can handle complex geometric reasoning tasks.
Sources
Original: arXiv - cs.AI