Academic

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

arXiv:2604.01634v1 Announce Type: new Abstract: Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark revea

arXiv:2604.01634v1 Announce Type: new Abstract: Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.

Executive Summary

The article introduces CRIT, a novel dataset and benchmark designed to evaluate and enhance cross-modal multi-hop reasoning in Vision-Language Models (VLMs). By leveraging a graph-based automatic pipeline, CRIT generates complex, interleaved image-text tasks spanning diverse domains, addressing critical gaps in existing multimodal benchmarks where reasoning often relies on a single modality. The dataset includes a manually verified test set for rigorous evaluation. Empirical results demonstrate that state-of-the-art VLMs struggle with CRIT’s reasoning tasks, but models trained on CRIT exhibit significant improvements in cross-modal reasoning and generalize to standard benchmarks like SPIQA. This work underscores the need for more sophisticated, multimodal reasoning datasets and offers a scalable solution to bridge the training-evaluation gap in VLMs.

Key Points

  • CRIT addresses a fundamental limitation in multimodal benchmarks by generating tasks requiring *multi-hop reasoning* across interleaved image-text content, rather than relying on single-modality inference.
  • The graph-based pipeline enables automatic synthesis of complex, cross-modal reasoning tasks across diverse domains (natural images, videos, text-rich sources), ensuring scalability and domain diversity.
  • Empirical validation shows that even advanced VLMs struggle with CRIT’s tasks, highlighting the need for improved multimodal reasoning capabilities; however, training on CRIT yields measurable gains in reasoning and generalization to other benchmarks.
  • The inclusion of a manually verified test set ensures reliability in evaluation, addressing concerns about noise or bias in synthetic data.

Merits

Innovative Dataset Design

CRIT’s graph-based synthesis pipeline represents a significant advancement in creating scalable, complex cross-modal reasoning tasks, addressing a critical gap in current multimodal benchmarks.

Rigorous Evaluation Framework

The manually verified test set ensures high-quality evaluation, mitigating risks of synthetic data artifacts or biases that could skew results.

Empirical Validation and Generalization

The study demonstrates that training on CRIT improves performance not only on CRIT-specific tasks but also on existing benchmarks like SPIQA, suggesting broad applicability of the approach.

Demerits

Synthetic Data Limitations

While the graph-based pipeline is scalable, synthetic data may lack the nuance or real-world variability of naturally occurring interleaved image-text content, potentially limiting generalization to unstructured, real-world scenarios.

Dependency on Graph Quality

The quality of CRIT’s tasks is contingent on the underlying graph structure and edge definitions; suboptimal graphs could lead to unnatural or trivial reasoning chains.

Manual Verification Overhead

The inclusion of a manually verified test set, while valuable, introduces scalability challenges for future expansions or updates to the dataset.

Expert Commentary

The introduction of CRIT marks a significant step toward addressing a longstanding challenge in multimodal AI: the lack of benchmarks that genuinely reflect the complexity of cross-modal reasoning. By leveraging graph-based synthesis, the authors have created a scalable and diverse dataset that forces VLMs to engage in *true* multi-hop reasoning—combining evidence from images, text, and their interactions—rather than relying on superficial correlations. This is a critical advancement, as it exposes the brittleness of current state-of-the-art models, which often perform well on existing benchmarks but fail when faced with tasks requiring deeper integration of modalities. The empirical finding that training on CRIT improves performance on other benchmarks (e.g., SPIQA) is particularly noteworthy, suggesting that the skills learned are transferable rather than task-specific. However, the reliance on synthetic data introduces potential limitations. While the manual verification of the test set mitigates some concerns, the broader question of whether synthetic reasoning chains can fully capture the messiness of real-world data remains open. Future work should explore hybrid approaches that combine synthetic data with real-world interleaved content to test generalization more rigorously. Additionally, the graph-based pipeline’s dependency on predefined structures may inadvertently bias the reasoning tasks toward certain types of logical flows, which could limit the diversity of reasoning patterns captured. Nonetheless, CRIT represents a paradigm shift in how we evaluate and train VLMs, and its impact on the field is likely to be substantial.

Recommendations

  • Integrate CRIT into multimodal AI training pipelines as a standard component for evaluating and improving cross-modal reasoning, particularly for high-stakes applications where hallucinations could have severe consequences.
  • Develop hybrid datasets that combine synthetic reasoning chains (like those in CRIT) with real-world interleaved image-text data to better assess generalization and robustness.
  • Expand CRIT’s graph-based pipeline to include more diverse reasoning patterns (e.g., abductive, counterfactual) and domains (e.g., scientific diagrams, historical documents) to enhance its applicability across disciplines.
  • Establish collaborative efforts between academia and industry to standardize benchmarks for multimodal reasoning, ensuring that future VLMs are evaluated on tasks that reflect real-world complexity rather than artificial constraints.
  • Investigate the ethical implications of synthetic data in multimodal AI, including potential biases and unintended consequences, and develop frameworks for auditing such datasets before deployment in critical applications.

Sources

Original: arXiv - cs.LG