Academic

MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

Michiko Yoshitake, Yuta Suzuki, Ryo Igarashi, Yoshitaka Ushiku, Keisuke Nagato · March 13, 2026 · 1 min read · 25 views

#cs.CL #cond-mat.mtrl-sci

arXiv:2603.11414v1 Announce Type: new Abstract: We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely on textual representations, MaterialFigBench focuses on problems in which figures such as phase diagrams, stress-strain curves, Arrhenius plots, diffraction patterns, and microstructural schematics are indispensable for deriving correct answers. The dataset consists of 137 free-response problems adapted from standard materials science textbooks, covering a broad range of topics including crystal structures, mechanical properties, diffusion, phase diagrams, phase transformations, and electronic properties of materials. To address unavoidable ambiguity in reading numerical values from images, expert-defined answer ranges are provided where appropriate. We evaluate several state-of-the-art multimodal LLMs, including ChatGPT and GPT models accessed via OpenAI APIs, and analyze their performance across problem categories and model versions. The results reveal that, although overall accuracy improves with model updates, current LLMs still struggle with genuine visual understanding and quantitative interpretation of materials science figures. In many cases, correct answers are obtained by relying on memorized domain knowledge rather than by reading the provided images. MaterialFigBench highlights persistent weaknesses in visual reasoning, numerical precision, and significant-digit handling, while also identifying problem types where performance has improved. This benchmark provides a systematic and domain-specific foundation for advancing multimodal reasoning capabilities in materials science and for guiding the development of future LLMs with stronger figure-based understanding.

Executive Summary

MaterialFigBench introduces a novel benchmark dataset tailored to assess multimodal large language models’ capacity to solve university-level materials science problems involving visual interpretation of figures such as phase diagrams, stress-strain curves, and diffraction patterns. With 137 free-response problems adapted from standard textbooks, the dataset uniquely emphasizes visual reliance—unlike prior benchmarks dominated by textual data. Expert-defined answer ranges mitigate ambiguity in numerical extraction from images. Evaluation of state-of-the-art multimodal LLMs, including ChatGPT and GPT variants, reveals persistent gaps: while accuracy improves with newer models, current systems still rely heavily on memorized domain knowledge rather than genuine visual comprehension. Specifically, weaknesses persist in visual reasoning, numerical precision, and significant-digit interpretation. The dataset thus serves as a critical diagnostic tool for identifying domain-specific limitations and guiding future LLM development toward stronger figure-based reasoning. This work fills a critical gap in evaluating multimodal capabilities in scientific domains.

Key Points

▸ MaterialFigBench targets visual interpretation of materials science figures as a primary evaluation axis
▸ 137 problems derived from textbooks cover diverse materials science topics
▸ Expert-defined ranges reduce ambiguity in image-based numerical extraction

Merits

Domain-Specific Innovation

MaterialFigBench fills a void by introducing a benchmark focused on visual-heavy problems in materials science, aligning evaluation with real-world academic demands.

Demerits

Limited Scope of Generalization

Performance improvements noted are tied to model updates; the dataset does not address broader multimodal reasoning beyond materials science or beyond figure-based content.

Expert Commentary

The introduction of MaterialFigBench represents a significant methodological advance in evaluating AI’s capacity to engage with scientific content through visual media. Historically, benchmarks have privileged textual narratives, inadvertently masking the critical gap between linguistic fluency and quantitative visual comprehension. Materials science, with its rich interplay between symbolic diagrams and numerical data, demands a level of interpretive precision that current LLMs largely evade by leveraging surface-level memorization. The dataset’s design—anchored in authentic textbook problems and anchored by expert-validated answer ranges—provides a rigorous, authentic assessment framework. Its findings, while sobering, are instructive: they expose a persistent disconnect between LLMs’ linguistic competence and their capacity to reason with visual information. Importantly, the identification of improved performance in select categories signals incremental progress, suggesting that targeted architectural or training interventions—such as multimodal attention modules or fine-tuned visual embeddings—may yield measurable gains. As such, MaterialFigBench does more than quantify limitations; it maps a roadmap for future research in multimodal AI, particularly in STEM fields where visual literacy is inseparable from analytical competence.

Recommendations

✓ Develop fine-grained multimodal architectures that integrate specialized visual encoders for scientific diagrams
✓ Create annotated training corpora with aligned textual and visual annotations for materials science figures

Sources

arXiv - cs.CL

MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

AI Commentary

Executive Summary

Key Points

Merits

Domain-Specific Innovation

Demerits

Limited Scope of Generalization

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.