Academic

VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

arXiv:2603.11631v1 Announce Type: new Abstract: Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot

E
Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim
· · 1 min read · 8 views

arXiv:2603.11631v1 Announce Type: new Abstract: Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.

Executive Summary

The article introduces VisDoT, a novel framework designed to enhance visual reasoning in large vision-language models by grounding visual primitives through human-like interpretation and decomposing questions into perception and logic sub-questions. By formalizing four perceptual tasks aligned with graphical perception theory and introducing Decomposition-of-Thought (DoT) prompting, the authors address a critical bottleneck in chart-based reasoning—lack of perceptual grounding. Empirical results demonstrate significant improvements across multiple benchmarks, including +11.2% on ChartQA, +33.2% on VisDoTQA, and competitiveness with GPT-4o on ChartQAPro. The generalizability of the strategy is further validated through consistent zero-shot gains on open-domain VQA benchmarks. VisDoT represents a meaningful advancement in aligning perceptual interpretation with semantic understanding in vision-language models.

Key Points

  • Formalization of perceptual tasks based on graphical perception theory
  • Introduction of DoT prompting to separate perception and logic sub-questions
  • Achievement of measurable performance gains on chart-specific and general VQA benchmarks

Merits

Empirical Validation

The reported performance improvements across multiple benchmarks provide strong evidence of the effectiveness of the VisDoT framework.

Conceptual Innovation

The integration of human-like perception grounding with structured prompting represents a novel approach to enhancing visual reasoning.

Demerits

Scalability Concern

While performance gains are notable, the article does not address potential computational overhead or scalability limitations of the framework in large-scale deployment.

Generalizability Caveat

The effectiveness of the strategy on non-chart-based VQA domains remains to be validated beyond the newly introduced VisDoTQA benchmark.

Expert Commentary

VisDoT represents a sophisticated synthesis of perceptual theory and prompting engineering, addressing a persistent gap in vision-language model capabilities. The formalization of perceptual tasks and the application of Decomposition-of-Thought prompting demonstrate a level of methodological rigor that aligns with the demands of academic and industry research on visual reasoning. The reported benchmarks, particularly the +33.2% improvement on VisDoTQA, suggest that the framework taps into a substantial underutilized dimension of model performance. However, the absence of discussion on scalability or cost implications of implementing human-like perception grounding in production environments warrants further scrutiny. Moreover, the absence of comparative analysis against other competing prompting frameworks (e.g., Chain-of-Thought, Tree-of-Thought) limits the ability to assess relative innovation. Overall, VisDoT is a compelling contribution that bridges a critical gap between human cognition and machine perception, and its impact on the broader field of multimodal AI is likely to be substantial.

Recommendations

  • Researchers and practitioners should evaluate VisDoT in their own chart-based reasoning pipelines to assess applicability
  • Future work should extend validation of the DoT prompting strategy to non-chart-based VQA domains and compare it against alternative prompting architectures.

Sources