Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization
arXiv:2603.18258v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing …
Haocheng Luo, Zehang Deng, Thanh-Toan Do, Mehrtash Harandi, Dinh Phung, Trung Le
6 views