Academic

When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning

arXiv:2603.22816v1 Announce Type: new Abstract: Language models increasingly "show their work" by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? Consider: a medical AI writes "The patient's eosinophilia and livedo reticularis following catheterization suggest cholesterol embolization syndrome. Answer: B." If we remove the eosinophilia observation, does the diagnosis change? For most frontier models, the answer is no - the step was decorative. We introduce step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. This simple test requires only API access -- no model weights -- and costs approximately $1-2 per model per task. Testing 10 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, MiniMax-M2.5, Kimi-K2.5, and others) across sentiment, mathematics, topic classification, and medical QA (N=376-500 each), the majority

A
Abhinaba Basu, Pavan Chakraborty
· · 1 min read · 18 views

arXiv:2603.22816v1 Announce Type: new Abstract: Language models increasingly "show their work" by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? Consider: a medical AI writes "The patient's eosinophilia and livedo reticularis following catheterization suggest cholesterol embolization syndrome. Answer: B." If we remove the eosinophilia observation, does the diagnosis change? For most frontier models, the answer is no - the step was decorative. We introduce step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. This simple test requires only API access -- no model weights -- and costs approximately $1-2 per model per task. Testing 10 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, MiniMax-M2.5, Kimi-K2.5, and others) across sentiment, mathematics, topic classification, and medical QA (N=376-500 each), the majority produce decorative reasoning: removing any step changes the answer less than 17% of the time, while any single step alone recovers the answer. This holds even on math, where smaller models (0.8-8B) show genuine step dependence (55% necessity). Two models break the pattern: MiniMax-M2.5 on sentiment (37% necessity) and Kimi-K2.5 on topic classification (39%) - but both shortcut other tasks. Faithfulness is model-specific and task-specific. We also discover "output rigidity": on the same medical questions, Claude Opus writes 11 diagnostic steps while GPT-OSS-120B outputs a single token. Mechanistic analysis (attention patterns) confirms that CoT attention drops more in late layers for decorative tasks (33%) than faithful ones (20%). Implications: step-by-step explanations from frontier models are largely decorative, per-model per-domain evaluation is essential, and training objectives - not scale - determine whether reasoning is genuine.

Executive Summary

This study rigorously investigates the authenticity of step-by-step reasoning in frontier AI models, revealing a critical flaw: most models generate decorative reasoning steps that do not influence their final outputs. Using a novel 'step-level evaluation' methodology—removing individual reasoning sentences and observing whether answers shift—the authors demonstrate that across 10 leading models (e.g., GPT-5.4, Claude Opus) and multiple domains, the removal of a reasoning step alters the answer in less than 17% of cases. Notably, even in math domains where one might expect genuine computational dependency, the majority of models remain unaffected by step removal, indicating that reasoning is often post-hoc. The findings challenge the prevailing assumption that large models inherently produce authentic reasoning and expose a systemic issue in the evaluation of AI transparency. Two exceptions—MiniMax-M2.5 and Kimi-K2.5—show partial genuineness, but only in specific tasks, underscoring that faithfulness is neither universal nor inherent.

Key Points

  • Step-level evaluation reveals decorative reasoning in majority of frontier models.
  • Removal of reasoning steps rarely alters answers (<17%), indicating post-hoc generation.
  • Faithfulness is task- and model-specific; exceptions exist but are inconsistent.
  • Output rigidity differences (e.g., step count variation) correlate with attention pattern anomalies.

Merits

Methodological Innovation

Introduces a low-cost, scalable evaluation technique (step-level removal) requiring only API access, enabling replicable and objective assessment of reasoning authenticity without model weights.

Empirical Rigor

Tests across diverse domains (sentiment, math, classification, medical QA) with substantial sample sizes (N=376–500), lending credibility to generalized conclusions.

Demerits

Scope Limitation

Study does not explore underlying causal mechanisms (e.g., training architecture, loss functions) that drive decorative reasoning; thus, explanatory depth is limited.

Generalizability Concern

Results may not apply to future models with fundamentally different prompting or reasoning architectures, potentially limiting applicability beyond current frontier LMs.

Expert Commentary

This work represents a pivotal shift in how we assess AI reasoning—moving from qualitative interpretation to quantifiable, functional validation. The step-level removal test is elegant in its simplicity and powerful in its implications: if a model’s reasoning is genuinely causal, its removal should perturb the outcome. The fact that this perturbation rarely occurs reveals a fundamental mismatch between the public perception of AI as ‘thoughtful’ and its actual operational behavior. The discovery of output rigidity further enriches the analysis by linking structural behavior (attention patterns) to functional outcomes (step influence). Importantly, the authors avoid conflating correlation with causation, instead presenting a clear, empirically grounded critique of the ‘black box’ transparency narrative. This paper should become a canonical reference in AI ethics and evaluation literature, compelling both researchers and practitioners to reconsider how they validate claims of reasoning capability. The implications extend beyond academia—into clinical decision support, legal AI, and financial forecasting—where reliance on AI explanations carries real-world consequences.

Recommendations

  • Adopt step-level evaluation as a standard metric in AI model audits, particularly for high-stakes applications.
  • Encourage reproducible research by open-sourcing the evaluation framework for broader adoption.

Sources

Original: arXiv - cs.CL