How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models
arXiv:2604.00021v1 Announce Type: cross Abstract: Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. …
Hiroki Fukui
5 views