GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning
arXiv:2603.10243v1 Announce Type: new Abstract: Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial …
Zhouxiang Fang, Jiawei Zhou, Hanjie Chen
15 views