Academic

SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

arXiv:2604.02423v1 Announce Type: new Abstract: Large language models exhibit sycophancy: the tendency to shift outputs toward user-expressed stances, regardless of correctness or consistency. While prior work has studied this issue and its impacts, rigorous computational linguistic metrics are needed to identify when models are being sycophantic. Here, we introduce SWAY, an unsupervised computational linguistic measure of sycophancy. We develop a counterfactual prompting mechanism to identify how much a model's agreement shifts under positive versus negative linguistic pressure, isolating framing effects from content. Applying this metric to benchmark 6 models, we find that sycophancy increases with epistemic commitment. Leveraging our metric, we introduce a counterfactual mitigation strategy teaching models to consider what the answer would be if opposite assumptions were suggested. While baseline mitigation instructing to be explicitly anti-sycophantic yields moderate reductions, a

Joy Bhalla, Kristina Gligori\'c · April 6, 2026 · 1 min read · 3 views

#cs.CL #cs.CY

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

Sources

Related Articles

The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State …

Weight-Informed Self-Explaining Clustering for Mixed-Type Tabular Data

Modeling Patient Care Trajectories with Transformer Hawkes Processes

EEG-MFTNet: An Enhanced EEGNet Architecture with Multi-Scale Temporal Convolutions and …

JCG, PC

HSOLLC Co., Ltd.