Academic

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

arXiv:2603.16140v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is "contaminated" with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world n

Y
Yuxuan Zhu, Daniel Kang
· · 1 min read · 6 views

arXiv:2603.16140v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is "contaminated" with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world noise in Text2SQL tasks, where training on real-world, human annotation errors cause 5-12% lower accuracy than clean data. Our results show that current RLVR methods cannot yet compensate for poor data quality. High-quality data remains essential.

Executive Summary

This study challenges the notion that reinforcement learning with verifiable rewards (RLVR) can effectively learn from noisy data. The authors rectify a previously identified issue with a dataset, demonstrating that existing RLVR algorithm improvements fail to mitigate the impact of noise. The results show that training on truly incorrect annotations performs significantly worse than training on clean data, highlighting the importance of high-quality data in RLVR. The study's findings have significant implications for the development of large language models and the potential consequences of poor data quality. The results suggest that current RLVR methods are not robust to noise and that further research is needed to address this limitation.

Key Points

  • Existing RLVR algorithm improvements fail to mitigate the impact of noise
  • Training on truly incorrect annotations performs 8-10% worse than training on clean data
  • Current RLVR methods are not robust to noise

Merits

Strength in methodology

The study uses a rigorous re-verification pipeline to rectify the dataset and ensure the validity of the results.

Strength in findings

The study provides a clear and significant demonstration of the impact of noise on RLVR, highlighting the importance of high-quality data.

Demerits

Limitation in generalizability

The study's findings may not generalize to other domains or tasks beyond mathematical reasoning and Text2SQL.

Limitation in scope

The study focuses primarily on the impact of noise on RLVR and does not explore other potential limitations of the approach.

Expert Commentary

This study provides a significant contribution to the field of RLVR by challenging the notion that existing methods can effectively learn from noisy data. The findings have important implications for the development of large language models and highlight the need for further research on robust RLVR methods. The study's methodology is rigorous and well-executed, and the results are clear and significant. However, the study's limitations in generalizability and scope should be considered in future research. Overall, the study provides a valuable contribution to the field and highlights the importance of high-quality data in RLVR.

Recommendations

  • Future research should prioritize the development of robust RLVR methods that can handle noisy data.
  • Researchers should explore alternative approaches to data annotation and verification that can ensure high-quality data.

Sources