Can VLMs Reason Robustly? A Neuro-Symbolic Investigation
arXiv:2603.23867v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have been applied to a wide range of reasoning tasks, yet it remains unclear whether they can reason robustly under distribution shifts. In this paper, we study covariate shifts in which the perceptual input distribution changes while the underlying prediction rules do not. To investigate this question, we consider visual deductive reasoning tasks, where a model is required to answer a query given an image and logical rules defined over the object concepts in the image. Empirically, we find that VLMs fine-tuned through gradient-based end-to-end training can achieve high in-distribution accuracy but fail to generalize under such shifts, suggesting that fine-tuning does not reliably induce the underlying reasoning function. This motivates a neuro-symbolic perspective that decouples perception from reasoning. However, we further observe that recent neuro-symbolic approaches that rely on black-box components for
arXiv:2603.23867v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have been applied to a wide range of reasoning tasks, yet it remains unclear whether they can reason robustly under distribution shifts. In this paper, we study covariate shifts in which the perceptual input distribution changes while the underlying prediction rules do not. To investigate this question, we consider visual deductive reasoning tasks, where a model is required to answer a query given an image and logical rules defined over the object concepts in the image. Empirically, we find that VLMs fine-tuned through gradient-based end-to-end training can achieve high in-distribution accuracy but fail to generalize under such shifts, suggesting that fine-tuning does not reliably induce the underlying reasoning function. This motivates a neuro-symbolic perspective that decouples perception from reasoning. However, we further observe that recent neuro-symbolic approaches that rely on black-box components for reasoning can still exhibit inconsistent robustness across tasks. To address this issue, we propose VLC, a neuro-symbolic method that combines VLM-based concept recognition with circuit-based symbolic reasoning. In particular, task rules are compiled into a symbolic program, specifically a circuit, which executes the rules exactly over the object concepts recognized by the VLM. Experiments on three visual deductive reasoning tasks with distinct rule sets show that VLC consistently achieves strong performance under covariate shifts, highlighting its ability to support robust reasoning.
Executive Summary
This article investigates the robust reasoning capabilities of Vision-Language Models (VLMs) in visual deductive reasoning tasks. The authors find that VLMs fine-tuned through gradient-based end-to-end training achieve high in-distribution accuracy but fail to generalize under distribution shifts. To address this issue, they propose VLC, a neuro-symbolic method that combines VLM-based concept recognition with circuit-based symbolic reasoning. Experiments show that VLC consistently achieves strong performance under covariate shifts, highlighting its ability to support robust reasoning. The study provides valuable insights into the limitations of VLMs and the potential benefits of neuro-symbolic approaches in addressing these limitations.
Key Points
- ▸ VLMs fine-tuned through gradient-based end-to-end training achieve high in-distribution accuracy but fail to generalize under distribution shifts.
- ▸ VLC, a neuro-symbolic method, combines VLM-based concept recognition with circuit-based symbolic reasoning and achieves strong performance under covariate shifts.
- ▸ The study highlights the importance of decoupling perception from reasoning in VLMs and the potential benefits of neuro-symbolic approaches.
Merits
Strength in Addressing Limitations of VLMs
The study provides a comprehensive analysis of the limitations of VLMs in robust reasoning and proposes a novel neuro-symbolic approach to address these limitations.
Methodological Rigor
The authors employ a rigorous methodology, including experiments on three visual deductive reasoning tasks, to evaluate the performance of VLC and VLMs.
Demerits
Limited Generalizability
The study focuses on visual deductive reasoning tasks and may not generalize to other domains or tasks.
Technical Complexity
The neuro-symbolic approach proposed in the study may be technically complex and challenging to implement in practice.
Expert Commentary
The study provides a timely and comprehensive analysis of the limitations of VLMs in robust reasoning and proposes a novel neuro-symbolic approach to address these limitations. The authors' use of a rigorous methodology and experiments on three visual deductive reasoning tasks adds to the credibility of the study. The neuro-symbolic approach proposed in the study has the potential to support robust reasoning in AI systems and addresses an important issue in the field. However, the study's focus on visual deductive reasoning tasks may limit its generalizability to other domains or tasks. Furthermore, the technical complexity of the neuro-symbolic approach may make it challenging to implement in practice.
Recommendations
- ✓ Future studies should aim to generalize the findings of this study to other domains or tasks to further validate the potential benefits of neuro-symbolic approaches.
- ✓ Developers should consider incorporating neuro-symbolic approaches into AI systems that require robust reasoning capabilities to improve their performance and reliability.
Sources
Original: arXiv - cs.LG