Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization
arXiv:2603.18258v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing …