M

Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye

Articles by Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye

Academic · 1 min

SteerRM: Debiasing Reward Models via Sparse Autoencoders

arXiv:2603.12795v1 Announce Type: new Abstract: Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses …

Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye
8 views