Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training
arXiv:2604.01597v1 Announce Type: new Abstract: Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that …
Dong Shu, Denghui Zhang, Jessica Hullman
13 views