Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling
arXiv:2603.14355v1 Announce Type: new Abstract: Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models …
Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty
8 views