Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing
arXiv:2603.17199v1 Announce Type: new Abstract: Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their …
Parsa Mirtaheri, Mikhail Belkin
11 views