MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning
arXiv:2603.16929v1 Announce Type: new Abstract: Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing …
Hongjun Wang, Wei Liu, Weibo Gu, Xing Sun, Kai Han
17 views