GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models
arXiv:2603.14041v1 Announce Type: new Abstract: The enhancement of reasoning capabilities in large language models (LLMs) has garnered significant attention, with supervised fine-tuning (SFT) and reinforcement …
Zhijie Wang
6 views