FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
arXiv:2604.03893v1 Announce Type: new Abstract: Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information extraction rather than the global structural logic inherent in formal scientific notations. In this work, we introduce FeynmanBench, the first benchmark centered on Feynman diagram tasks. It is designed to evaluate AI's capacity for multistep diagrammatic reasoning, which requires satisfying conservation laws and symmetry constraints, identifying graph topology, converting between diagrammatic and algebraic representations, and constructing scattering amplitudes under specific conventions and gauges. To support large-scale and reproducible evaluation, we developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude r
arXiv:2604.03893v1 Announce Type: new Abstract: Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information extraction rather than the global structural logic inherent in formal scientific notations. In this work, we introduce FeynmanBench, the first benchmark centered on Feynman diagram tasks. It is designed to evaluate AI's capacity for multistep diagrammatic reasoning, which requires satisfying conservation laws and symmetry constraints, identifying graph topology, converting between diagrammatic and algebraic representations, and constructing scattering amplitudes under specific conventions and gauges. To support large-scale and reproducible evaluation, we developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude results. Our database spans the electromagnetic, weak, and strong interactions of the Standard Model, encompasses over 100 distinct types and includes more than 2000 tasks. Experiments on state-of-the-art MLLMs reveal systematic failure modes, including unstable enforcement of physical constraints and violations of global topological conditions, highlighting the need for physics-grounded benchmarks for visual reasoning over scientific notation. FeynmanBench provides a logically rigorous test of whether AI can effectively engage in scientific discovery, particularly within theoretical physics.
Executive Summary
FeynmanBench, a novel benchmark, evaluates the capacity of multimodal large language models (MLLMs) for multistep diagrammatic reasoning in physics. The benchmark assesses AI's ability to satisfy conservation laws, identify graph topology, convert between diagrammatic and algebraic representations, and construct scattering amplitudes. Developed to support large-scale and reproducible evaluation, FeynmanBench comprises a database of over 2000 tasks across the electromagnetic, weak, and strong interactions of the Standard Model. Experiments on state-of-the-art MLLMs reveal systematic failure modes, underscoring the need for physics-grounded benchmarks. FeynmanBench provides a rigorous test of AI's capacity for scientific discovery in theoretical physics.
Key Points
- ▸ FeynmanBench introduces a novel benchmark for evaluating multimodal LLMs on diagrammatic physics reasoning.
- ▸ The benchmark assesses AI's capacity for multistep diagrammatic reasoning, including satisfaction of conservation laws and identification of graph topology.
- ▸ FeynmanBench reveals systematic failure modes in state-of-the-art MLLMs, highlighting the need for physics-grounded benchmarks.
Merits
Strength
FeynmanBench provides a rigorous and comprehensive evaluation of AI's capacity for diagrammatic physics reasoning, addressing a significant gap in current benchmarks.
Strength
The benchmark's large-scale and reproducible evaluation pipeline enables reliable assessment of AI's performance across a diverse range of tasks.
Strength
FeynmanBench's focus on physics-grounded tasks highlights the importance of domain-specific knowledge in evaluating AI's scientific reasoning abilities.
Demerits
Limitation
The benchmark's reliance on a specific framework (Feynman diagrams) may limit its generalizability to other areas of physics or scientific domains.
Limitation
The evaluation pipeline's complexity may pose challenges for researchers without expertise in physics or computational methods.
Expert Commentary
FeynmanBench represents a significant advancement in the evaluation of AI's capacity for scientific reasoning in theoretical physics. By providing a rigorous and comprehensive benchmark, the authors have highlighted the need for domain-specific knowledge in AI evaluation. However, the benchmark's reliance on a specific framework and complex evaluation pipeline may pose challenges for researchers without expertise in physics or computational methods. As the field of AI continues to evolve, FeynmanBench serves as a valuable resource for researchers seeking to develop AI-powered tools in scientific domains.
Recommendations
- ✓ Develop AI-powered tools that incorporate physics-grounded knowledge and evaluation frameworks to improve the accuracy and reliability of AI-driven scientific discovery.
- ✓ Establish a community-driven initiative to develop and maintain a suite of physics-grounded benchmarks for AI evaluation, ensuring the development of AI-powered tools that prioritize scientific accuracy and domain-specific knowledge.
Sources
Original: arXiv - cs.AI