Academic

Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models

arXiv:2604.00890v1 Announce Type: new Abstract: Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper proposes MARS-GPS, that generates multiple parallel reasoning rollouts augmented with Python code execution for numerical verification, ranks them using token-level entropy as a confidence signal, a

arXiv:2604.00890v1 Announce Type: new Abstract: Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper proposes MARS-GPS, that generates multiple parallel reasoning rollouts augmented with Python code execution for numerical verification, ranks them using token-level entropy as a confidence signal, and aggregates answers through a multi-stage voting and self-verification pipeline. Empirical results show that MARS-GPS with 8 parallel rollouts achieves 88.8% on Geometry3K, a nearly +11% improvement over the prior state-of-the-art, with accuracy scaling consistently as the number of rollouts increases from 1 to 16 (+6.0% on ablation subset). We provide our code and data in an anonymous repository: https://anonymous.4open.science/r/MARS-GPS-DE55.

Executive Summary

This article proposes MARS-GPS, a novel approach to geometric problem solving in large language models. It addresses the limitation of existing models by generating multiple parallel reasoning rollouts, ranking them using token-level entropy, and aggregating answers through a multi-stage voting and self-verification pipeline. Empirical results show that MARS-GPS achieves a significant improvement in accuracy, with 88.8% on Geometry3K, a nearly +11% improvement over the state-of-the-art. The approach has the potential to enhance mathematical reasoning in large language models and has significant implications for various applications, including education and robotics.

Key Points

  • MARS-GPS generates multiple parallel reasoning rollouts
  • Token-level entropy is used to rank rollouts
  • Multi-stage voting and self-verification pipeline is used to aggregate answers

Merits

Strength in Addressing Logical Inference

MARS-GPS effectively addresses the limitation of existing models by incorporating logical inference in its pipeline, which is essential for geometric problem solving.

Improved Accuracy

The empirical results show a significant improvement in accuracy, with 88.8% on Geometry3K, a nearly +11% improvement over the state-of-the-art.

Scalability

The approach has shown consistent accuracy scaling with an increase in the number of rollouts, from 1 to 16.

Demerits

Computational Complexity

The multi-stage voting and self-verification pipeline may increase the computational complexity of the approach, which could be a limitation for large-scale applications.

Data Requirements

The approach requires a significant amount of data for training and validation, which could be a limitation for applications with limited data availability.

Expert Commentary

The article presents a significant contribution to the field of geometric problem solving in large language models. The proposed approach, MARS-GPS, effectively addresses the limitation of existing methods by incorporating logical inference in its pipeline. The empirical results demonstrate a significant improvement in accuracy, with 88.8% on Geometry3K, a nearly +11% improvement over the state-of-the-art. However, the approach may have limitations, including increased computational complexity and data requirements. Nevertheless, the potential benefits of MARS-GPS make it a promising direction for future research.

Recommendations

  • Future research should investigate the scalability of MARS-GPS for large-scale applications and explore ways to reduce computational complexity and data requirements.
  • The approach should be evaluated on a broader range of geometric problems and datasets to assess its generalizability and robustness.

Sources

Original: arXiv - cs.AI