Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design
arXiv:2603.12826v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-q
arXiv:2603.12826v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.
Executive Summary
This article explores the potential of Multiple-Choice Questions (MCQs) in Reinforcement Learning with Verifiable Rewards (RLVR) by examining the impact of option design on performance. The study reveals that mismatches in option counts between training and testing can degrade performance, while strong distractors can mitigate random guessing. The authors propose Iterative Distractor Curation (IDC), a framework for constructing high-quality distractors to promote deep reasoning. Experiments demonstrate significant gains in RLVR training using IDC.
Key Points
- ▸ Mismatches in option counts between training and testing degrade performance
- ▸ Strong distractors can mitigate random guessing and promote deep reasoning
- ▸ Iterative Distractor Curation (IDC) framework constructs high-quality distractors for effective RLVR training
Merits
Improved Performance
The proposed IDC framework enhances distractor quality, leading to significant gains in RLVR training
Demerits
Limited Generalizability
The study's findings may not generalize to all types of MCQs or RLVR applications, requiring further research
Expert Commentary
The article presents a significant contribution to the field of RLVR, highlighting the importance of distractor design in MCQs. The proposed IDC framework offers a promising approach to constructing high-quality distractors, which can promote deep reasoning and mitigate reward hacking. However, further research is needed to fully explore the potential of MCQs in RLVR and to address the limitations of the current study. The implications of this research are far-reaching, with potential applications in various fields, including education and natural language processing.
Recommendations
- ✓ Further research should be conducted to explore the generalizability of the IDC framework to different types of MCQs and RLVR applications
- ✓ The development of more effective evaluation metrics is necessary to fully assess the performance of RLVR systems