Academic

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

arXiv:2603.18656v1 Announce Type: new Abstract: Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long traces overshadow short but task-critical segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the segment, SCALe-SFT gradually shifts the focus from to throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT

arXiv:2603.18656v1 Announce Type: new Abstract: Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long traces overshadow short but task-critical segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the segment, SCALe-SFT gradually shifts the focus from to throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

Executive Summary

This article proposes SCALe, a novel approach to chain of thought training in vision-language models (VLMs). SCALe addresses the issue of token-imbalanced reasoning data by introducing dynamic, length-independent weighting, which shifts the focus from long, overshadowing segments to task-critical segments throughout training. This improvement enables concise and well-grounded reasoning, outperforming vanilla supervised fine-tuning (SFT) on diverse benchmarks and architectures. When combined with reinforcement learning, SCALe achieves the best overall performance, making it a valuable contribution to the field of VLMs. The proposed method is lightweight and efficient, requiring only about one-seventh of the training time of traditional SFT + GRPO pipelines. This advancement has significant implications for the development of more effective and efficient VLMs.

Key Points

  • SCALe proposes a novel approach to chain of thought training in VLMs
  • SCALe addresses the issue of token-imbalanced reasoning data through dynamic weighting
  • SCALe outperforms vanilla SFT on diverse benchmarks and architectures
  • SCALe is lightweight and efficient, requiring less training time than traditional pipelines

Merits

Strength in Addressing Token-Imbalanced Reasoning Data

SCALe's dynamic weighting mechanism effectively addresses the issue of token-imbalanced reasoning data, enabling more accurate and concise reasoning in VLMs.

Efficient Training Time

SCALe requires significantly less training time than traditional SFT + GRPO pipelines, making it a more efficient and practical solution for VLM development.

Improved Performance

SCALe consistently outperforms vanilla SFT on diverse benchmarks and architectures, demonstrating its effectiveness in improving the performance of VLMs.

Demerits

Limited Evaluation on Real-World Applications

The article focuses primarily on benchmark evaluations, and further research is needed to assess the practical implications of SCALe in real-world applications.

Potential Overreliance on Reinforcement Learning

While SCALe achieves excellent results when combined with reinforcement learning, its reliance on this additional step may limit its applicability in scenarios where reinforcement learning is not feasible or desirable.

Expert Commentary

The proposed SCALe approach represents a significant advancement in the field of VLMs, addressing the critical issue of token-imbalanced reasoning data through dynamic weighting. This improvement enables more accurate and concise reasoning, outperforming vanilla SFT on diverse benchmarks and architectures. The lightweight and efficient nature of SCALe makes it a valuable contribution to the field, with significant implications for the development of more effective and efficient VLMs. While further research is needed to assess the practical implications of SCALe in real-world applications, this article provides a solid foundation for the continued exploration of this promising approach.

Recommendations

  • Further evaluation of SCALe on real-world applications is necessary to assess its practical implications
  • Future research should focus on exploring the potential of SCALe in scenarios where reinforcement learning is not feasible or desirable

Sources