Generalization Limits of Reinforcement Learning Alignment
arXiv:2604.02652v1 Announce Type: new Abstract: The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, …
Haruhi Shida, Koo Imai, Keigo Kansa
3 views