Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations
arXiv:2603.18353v1 Announce Type: new Abstract: Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested. We compared four mechanistic interpretability methods -- concept bottleneck steering (Steerling-8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct) -- for correcting false-negative triage errors using 400 physician-adjudicated clinical vignettes (144 hazards, 256 benign). Linear probes discriminated hazardous from benign cases with 98.2% AUROC, yet the model's output sensitivity was only 45.1%, a 53-percentage-point knowledge-action gap. Concept bottleneck steering corrected 20% of missed hazards but disrupted 53% of correct detections, indistinguishable from random perturbation (p=0.84). SA
arXiv:2603.18353v1 Announce Type: new Abstract: Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested. We compared four mechanistic interpretability methods -- concept bottleneck steering (Steerling-8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct) -- for correcting false-negative triage errors using 400 physician-adjudicated clinical vignettes (144 hazards, 256 benign). Linear probes discriminated hazardous from benign cases with 98.2% AUROC, yet the model's output sensitivity was only 45.1%, a 53-percentage-point knowledge-action gap. Concept bottleneck steering corrected 20% of missed hazards but disrupted 53% of correct detections, indistinguishable from random perturbation (p=0.84). SAE feature steering produced zero effect despite 3,695 significant features. TSV steering at high strength corrected 24% of missed hazards while disrupting 6% of correct detections, but left 76% of errors uncorrected. Current mechanistic interpretability methods cannot reliably translate internal knowledge into corrected outputs, with implications for AI safety frameworks that assume interpretability enables effective error correction.
Executive Summary
This article challenges the notion that mechanistic interpretability methods can bridge the knowledge-action gap in language models. The study compares four such methods, finding that they fail to reliably correct false-negative errors in a clinical vignette task. The results suggest that current interpretability methods cannot translate internal knowledge into corrected outputs, with implications for AI safety frameworks. The research highlights the need for more effective interpretability methods that can bridge the knowledge-action gap.
Key Points
- ▸ Mechanistic interpretability methods may not be effective in correcting language model errors
- ▸ Current methods cannot translate internal knowledge into corrected outputs
- ▸ The knowledge-action gap remains a significant challenge for AI safety frameworks
Merits
Methodological rigor
The study employs a robust experimental design, using 400 physician-adjudicated clinical vignettes and four mechanistic interpretability methods to evaluate their effectiveness.
Relevance to AI safety frameworks
The research has significant implications for AI safety frameworks that assume interpretability enables effective error correction, highlighting the need for more effective interpretability methods.
Demerits
Limited generalizability
The study's focus on a specific task and language model architecture may limit the generalizability of its findings to other applications and models.
Methodological limitations
The study's reliance on a single evaluation metric (AUROC) may not fully capture the complexity of the knowledge-action gap.
Expert Commentary
This study's findings are significant because they challenge a fundamental assumption in the field of AI interpretability. While mechanistic interpretability methods have shown promise in certain applications, this research suggests that they may not be effective in correcting language model errors. The knowledge-action gap remains a significant challenge for AI safety frameworks, and this study highlights the need for more effective interpretability methods that can bridge this gap. The research has significant implications for the development of more robust AI systems, and its findings should be taken into account by policymakers and researchers in the field.
Recommendations
- ✓ Develop and evaluate more effective interpretability methods that can bridge the knowledge-action gap
- ✓ Re-evaluate AI safety frameworks to ensure they account for the limitations of current interpretability methods