Academic

When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

arXiv:2603.18530v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demons

A
Abhinaba Basu, Pavan Chakraborty
· · 1 min read · 21 views

arXiv:2603.18530v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.

Executive Summary

This study introduces ICE-Guard, a framework for detecting spurious feature reliance in large language models (LLMs). The researchers evaluated 11 LLMs from 8 families across 10 high-stakes domains, finding significant biases in authority, framing, and demographic features. They demonstrate a detect-diagnose-mitigate-verify loop that achieves 78% bias reduction and validate their results using real COMPAS recidivism data. This study highlights the importance of considering multiple types of biases in LLM decision-making and provides a valuable tool for mitigating these biases. The findings have significant implications for the development and deployment of AI systems in high-stakes applications.

Key Points

  • ICE-Guard framework detects spurious feature reliance in LLMs
  • Significant biases found in authority, framing, and demographic features
  • Detect-diagnose-mitigate-verify loop achieves 78% bias reduction
  • Validation using real COMPAS recidivism data provides conservative estimate of real-world bias

Merits

Comprehensive evaluation of LLM biases

The study evaluates 11 LLMs from 8 families across 10 high-stakes domains, providing a comprehensive understanding of the biases present in these models.

Development of a practical mitigation framework

The detect-diagnose-mitigate-verify loop provides a practical solution for mitigating biases in LLMs, making it a valuable tool for developers and deployers of AI systems.

Demerits

Limited scope to real-world applications

While the study provides a conservative estimate of real-world bias using COMPAS recidivism data, the results may not generalize to all high-stakes applications.

Requires significant computational resources

The evaluation of multiple LLMs across 10 domains requires significant computational resources, which may be a limitation for some researchers or organizations.

Expert Commentary

The study's findings on LLM biases have significant implications for the development and deployment of AI systems in high-stakes applications. The ICE-Guard framework provides a valuable tool for detecting and mitigating these biases, but its application in real-world scenarios requires careful consideration of the computational resources and data requirements. Furthermore, the study highlights the need for policymakers to prioritize the development of fairness and accountability standards for AI systems, including LLMs. The researchers' use of intervention consistency testing to detect spurious feature reliance provides a valuable tool for understanding the behavior of LLMs and improving their explainability and interpretability.

Recommendations

  • Developers and deployers of AI systems should prioritize the use of tools like ICE-Guard to detect and mitigate biases in LLMs.
  • Policymakers should prioritize the development of fairness and accountability standards for AI systems, including LLMs, to ensure their safe and responsible deployment in high-stakes applications.

Sources