Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF
arXiv:2603.10279v1 Announce Type: new Abstract: Aligning generative recommender systems to user preferences via post-training is critical for closing the gap between next-item prediction and actual recommendation quality. Existing post-training methods are ill-suited for production-scale systems: RLHF methods reward hack due to noisy user feedback and unreliable reward models, offline RL alternatives require propensity scores that are unavailable, and online interaction is infeasible. We identify exponential reward-weighted SFT with weights $w = \exp(r/\lambda)$ as uniquely suited to this setting, and provide the theoretical and empirical foundations that explain why. By optimizing directly on observed rewards without querying a learned reward model, the method is immune to reward hacking, requires no propensity scores, and is fully offline. We prove the first policy improvement guarantees for this setting under noisy rewards, showing that the gap scales only logarithmically with cata
arXiv:2603.10279v1 Announce Type: new Abstract: Aligning generative recommender systems to user preferences via post-training is critical for closing the gap between next-item prediction and actual recommendation quality. Existing post-training methods are ill-suited for production-scale systems: RLHF methods reward hack due to noisy user feedback and unreliable reward models, offline RL alternatives require propensity scores that are unavailable, and online interaction is infeasible. We identify exponential reward-weighted SFT with weights $w = \exp(r/\lambda)$ as uniquely suited to this setting, and provide the theoretical and empirical foundations that explain why. By optimizing directly on observed rewards without querying a learned reward model, the method is immune to reward hacking, requires no propensity scores, and is fully offline. We prove the first policy improvement guarantees for this setting under noisy rewards, showing that the gap scales only logarithmically with catalog size and remains informative even for large item catalogs. Crucially, we show that temperature $\lambda$ explicitly and quantifiably controls the robustness-improvement tradeoff, providing practitioners with a single interpretable regularization hyperparameter with theoretical grounding. Experiments on three open-source and one proprietary dataset against four baselines confirm that exponential reward weighting is simple, scalable, and consistently outperforms RLHF-based alternatives.
Executive Summary
This article presents a novel post-training method for generative recommender systems, leveraging exponential reward-weighted Soft Q-Learning (SFT) to optimize observed rewards without relying on learned reward models or propensity scores. The method is theoretically grounded and experimentally validated, outperforming Reinforcement Learning with Human Feedback (RLHF) alternatives. The use of temperature λ as a single interpretable regularization hyperparameter offers a controlled tradeoff between robustness and improvement. The proposed approach addresses the limitations of existing post-training methods, providing a scalable and robust solution for production-scale systems.
Key Points
- ▸ The proposed method, exponential reward-weighted SFT, is immune to reward hacking and requires no propensity scores.
- ▸ The method is fully offline, making it feasible for large item catalogs.
- ▸ The temperature λ explicitly controls the robustness-improvement tradeoff.
Merits
Strength in addressing existing limitations
The proposed method effectively addresses the limitations of existing post-training methods, including reward hacking, reliance on propensity scores, and offline feasibility.
Theoretical grounding
The method is supported by theoretical guarantees, including policy improvement bounds that scale logarithmically with catalog size.
Scalability and robustness
Exponential reward-weighted SFT is shown to be simple, scalable, and robust, making it suitable for production-scale systems.
Demerits
Assumption on noisy rewards
The method's theoretical guarantees assume noisy rewards, which may not accurately reflect real-world scenarios.
Dependence on temperature λ
The method's performance is sensitive to the choice of temperature λ, which requires careful tuning.
Expert Commentary
This article presents a significant contribution to the field of Reinforcement Learning and Recommender Systems. The proposed method addresses critical limitations of existing post-training approaches, providing a robust and scalable solution for large item catalogs. The use of temperature λ as a single interpretable regularization hyperparameter is a noteworthy feature, offering a controlled tradeoff between robustness and improvement. However, the method's dependence on noisy rewards and sensitivity to temperature λ require careful consideration. Despite these limitations, the proposed method has the potential to significantly impact the field of Recommender Systems, making it a valuable contribution to the literature.
Recommendations
- ✓ Future research should explore the application of exponential reward-weighted SFT to other domains beyond Recommender Systems.
- ✓ Careful tuning of temperature λ is essential to achieving optimal performance, highlighting the need for further research on hyperparameter optimization.