Academic

Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures

arXiv:2603.18729v1 Announce Type: new Abstract: Many works in the literature show that LLM outputs exhibit discriminatory behaviour, triggering stereotype-based inferences based on the dialect in which the inputs are written. This bias has been shown to be particularly pronounced when the same inputs are provided to LLMs in Standard American English (SAE) and African-American English (AAE). In this paper, we replicate existing analyses of dialect-sensitive stereotype generation in LLM outputs and investigate the effects of mitigation strategies, including prompt engineering (role-based and Chain-Of-Thought prompting) and multi-agent architectures composed of generate-critique-revise models. We define eight prompt templates to analyse different ways in which dialect bias can manifest, such as suggested names, jobs, and adjectives for SAE or AAE speakers. We use an LLM-as-judge approach to evaluate the bias in the results. Our results show that stereotype-bearing differences emerge betw

arXiv:2603.18729v1 Announce Type: new Abstract: Many works in the literature show that LLM outputs exhibit discriminatory behaviour, triggering stereotype-based inferences based on the dialect in which the inputs are written. This bias has been shown to be particularly pronounced when the same inputs are provided to LLMs in Standard American English (SAE) and African-American English (AAE). In this paper, we replicate existing analyses of dialect-sensitive stereotype generation in LLM outputs and investigate the effects of mitigation strategies, including prompt engineering (role-based and Chain-Of-Thought prompting) and multi-agent architectures composed of generate-critique-revise models. We define eight prompt templates to analyse different ways in which dialect bias can manifest, such as suggested names, jobs, and adjectives for SAE or AAE speakers. We use an LLM-as-judge approach to evaluate the bias in the results. Our results show that stereotype-bearing differences emerge between SAE- and AAE-related outputs across all template categories, with the strongest effects observed in adjective and job attribution. Baseline disparities vary substantially by model, with the largest SAE-AAE differential observed in Claude Haiku and the smallest in Phi-4 Mini. Chain-Of-Thought prompting proved to be an effective mitigation strategy for Claude Haiku, whereas the use of a multi-agent architecture ensured consistent mitigation across all the models. These findings suggest that for intersectionality-informed software engineering, fairness evaluation should include model-specific validation of mitigation strategies, and workflow-level controls (e.g., agentic architectures involving critique models) in high-impact LLM deployments. The current results are exploratory in nature and limited in scope, but can lead to extensions and replications by increasing the dataset size and applying the procedure to different languages or dialects.

Executive Summary

This article presents a comprehensive analysis of linguistic stereotypes in single and multi-agent generative AI architectures, with a focus on dialect-sensitive stereotype generation in Large Language Model (LLM) outputs. The authors replicate existing analyses and investigate the effects of mitigation strategies, including prompt engineering and multi-agent architectures. The results show that stereotype-bearing differences emerge between Standard American English (SAE) and African-American English (AAE) outputs, with the strongest effects observed in adjective and job attribution. The study highlights the importance of model-specific validation of mitigation strategies and workflow-level controls in high-impact LLM deployments. The findings have significant implications for intersectionality-informed software engineering and fairness evaluation in AI development.

Key Points

  • Dialect-sensitive stereotype generation in LLM outputs is a persistent issue
  • Mitigation strategies, such as prompt engineering and multi-agent architectures, can reduce bias
  • Model-specific validation of mitigation strategies is crucial for fairness evaluation

Merits

Strength in methodology

The authors employ a rigorous methodology, including the use of an LLM-as-judge approach and multiple prompt templates, to evaluate dialect bias in LLM outputs.

Innovative application of mitigation strategies

The study investigates the effects of both prompt engineering and multi-agent architectures, providing a comprehensive understanding of their potential to reduce bias in LLM outputs.

Demerits

Limited scope and dataset

The current results are exploratory in nature and limited in scope, with the authors suggesting the need for extensions and replications with larger datasets and different languages or dialects.

Lack of generalizability

The findings may not be generalizable to other LLM models or domains, highlighting the need for further research and validation.

Expert Commentary

While the study provides valuable insights into the persistence of dialect-sensitive stereotype generation in LLM outputs, its limitations and lack of generalizability highlight the need for further research and validation. The findings suggest that a more comprehensive approach to fairness evaluation in AI development is necessary, one that takes into account the complex interactions between language, culture, and identity. As AI systems become increasingly ubiquitous, it is essential that we prioritize the development of fair and equitable AI that reflects the diversity of human experience.

Recommendations

  • Further research is needed to validate the findings and explore their generalizability to other LLM models and domains
  • AI developers and policymakers should prioritize the development of intersectionality-informed software engineering practices and fairness evaluation frameworks

Sources