Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries
arXiv:2604.00228v1 Announce Type: new Abstract: Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? …
Tanay Gondil
3 views