Academic

Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

arXiv:2603.02663v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect

arXiv:2603.02663v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.

Executive Summary

The article introduces M3IRT, a novel multimodal and multidimensional item response theory framework, to address the persistent issue of unreliable benchmark rankings in multimodal large language models (MLLMs) caused by shortcut questions that can be solved via a single modality. M3IRT extends classical IRT by decomposing ability and difficulty into image-only, text-only, and cross-modal components, thereby enabling the identification and prioritization of genuinely cross-modal questions. Empirical results across 24 VLMs on three benchmarks demonstrate that M3IRT improves evaluation quality by filtering out low-quality items without compromising ranking fidelity, even with 50% artificially generated low-quality content. This innovation offers a scalable, cost-effective tool for refining multimodal benchmarks and advancing accurate cross-modal reasoning assessments.

Key Points

  • M3IRT decomposes cross-modal ability and item difficulty into image, text, and cross-modal dimensions.
  • Current benchmarks are compromised by shortcut questions that undermine ranking reliability.
  • M3IRT preserves ranking fidelity while reducing evaluation cost through targeted filtering of low-quality items.

Merits

Strength

M3IRT provides a rigorous, scalable framework for distinguishing genuine cross-modal reasoning from shortcut solutions, enhancing benchmark validity.

Demerits

Limitation

Implementation may require additional computational resources for decomposing modalities, potentially affecting scalability in large-scale benchmark evaluations.

Expert Commentary

This work represents a significant methodological advancement in the evaluation of multimodal AI capabilities. The decomposition of ability into modality-specific components is both theoretically sound and practically effective. The empirical validation across multiple benchmarks adds credibility to the claims of improved ranking fidelity and reduced evaluation burden. Notably, the ability to maintain integrity of rankings despite synthetic low-quality item injection is a compelling indicator of robustness. M3IRT fills a critical gap in the current landscape of multimodal assessment—where benchmark reliability has been compromised by superficial question design. Its potential for adoption in future evaluation frameworks is substantial, particularly as multimodal reasoning becomes a central pillar in AI development. The authors should consider extending M3IRT to include temporal or interaction-based modalities for further generalization.

Recommendations

  • 1. Adopt M3IRT as a standard tool for evaluating cross-modal reasoning in multimodal AI benchmarks.
  • 2. Encourage interdisciplinary collaboration between cognitive science and AI evaluation teams to refine modality decomposition algorithms for broader applicability.

Sources