Academic

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

arXiv:2603.16017v1 Announce Type: new Abstract: Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation

F
Fan Huang, Haewoon Kwak, Jisun An
· · 1 min read · 13 views

arXiv:2603.16017v1 Announce Type: new Abstract: Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ($r=0.715$, $p<0.0001$) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity $= 0.859$).

Executive Summary

This article introduces the concept of moral reasoning trajectories to analyze how large language models (LLMs) navigate ethical frameworks across reasoning steps. The study reveals that ethical decision-making in LLMs is predominantly characterized by frequent framework switches—approximately 55.4–57.7% of consecutive steps involve such switches—indicating a complex, dynamic interplay among ethical perspectives. The findings also highlight vulnerabilities: unstable trajectories (those with frequent framework changes) are more susceptible to persuasive attacks. Importantly, the paper introduces a novel metric, Moral Representation Consistency (MRC), which correlates strongly with LLM coherence ratings, offering a quantifiable indicator of ethical coherence. These insights advance the field by providing both theoretical depth and practical tools for evaluating ethical reasoning in AI systems.

Key Points

  • Moral reasoning trajectories identify systematic multi-framework deliberation patterns across LLMs.
  • Framework switches occur in over 55% of consecutive reasoning steps, undermining consistency.
  • Unstable trajectories increase susceptibility to attacks, and MRC metric correlates strongly with coherence.

Merits

Strength in Conceptual Innovation

The introduction of moral reasoning trajectories and MRC metric constitutes a significant conceptual advancement, offering new frameworks for evaluating ethical reasoning in LLMs.

Empirical Robustness

The study leverages multiple models and benchmarks, providing statistically validated data on framework switching rates and attack susceptibility.

Demerits

Limited Generalizability Concern

The analysis focuses on specific models and benchmarks; findings may not fully extend to newer or niche LLM architectures without further validation.

Metric Complexity

While MRC is well-validated, its interpretability for non-technical stakeholders may require additional explanatory work.

Expert Commentary

The article represents a timely and substantive contribution to the intersection of ethics and AI reasoning. The concept of moral reasoning trajectories effectively captures the fluidity of ethical decision-making in LLMs, which has previously been underanalyzed. The empirical validation of framework switching rates—particularly the 1.29× increased attack susceptibility—provides actionable insights for both researchers and practitioners. Moreover, the MRC metric’s strong correlation with human-annotated coherence ratings suggests it may become a standard tool in AI ethics evaluation. However, the authors should consider expanding validation across diverse model families (e.g., open-source, multilingual) to enhance generalizability. The work bridges a critical gap between theoretical ethics and applied AI, and its impact is likely to be enduring in both academic and industrial domains.

Recommendations

  • Researchers should extend the moral reasoning trajectory framework to open-source and multilingual LLMs for broader applicability.
  • Industry stakeholders should integrate MRC into evaluation protocols for ethical AI deployment, especially in high-stakes domains.

Sources