LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
arXiv:2604.00259v1 Announce Type: new Abstract: Despite growing interest in using Large Language Models (LLMs) for educational assessment, it remains unclear how closely they align with human scoring. We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring. We analyze agreement with human consensus scores, directional bias, and the stability of bias estimates. Our results show that strong open-weight models achieve moderate to high agreement with humans on holistic scoring (Quadratic Weighted Kappa about 0.6), but this does not transfer uniformly to analytic scoring. In particular, we observe large and stable negative directional bias on Lower-Order Concern (LOC) traits, such as Grammar and Conventions, meaning that models often score these traits more harshly than human raters. We also find that concise keyword-based prompts generally outperform longer rubric-sty
arXiv:2604.00259v1 Announce Type: new Abstract: Despite growing interest in using Large Language Models (LLMs) for educational assessment, it remains unclear how closely they align with human scoring. We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring. We analyze agreement with human consensus scores, directional bias, and the stability of bias estimates. Our results show that strong open-weight models achieve moderate to high agreement with humans on holistic scoring (Quadratic Weighted Kappa about 0.6), but this does not transfer uniformly to analytic scoring. In particular, we observe large and stable negative directional bias on Lower-Order Concern (LOC) traits, such as Grammar and Conventions, meaning that models often score these traits more harshly than human raters. We also find that concise keyword-based prompts generally outperform longer rubric-style prompts in multi-trait analytic scoring. To quantify the amount of data needed to detect these systematic deviations, we compute the minimum sample size at which a 95% bootstrap confidence interval for the mean bias excludes zero. This analysis shows that LOC bias is often detectable with very small validation sets, whereas Higher-Order Concern (HOC) traits typically require much larger samples. These findings support a bias-correction-first deployment strategy: instead of relying on raw zero-shot scores, systematic score offsets can be estimated and corrected using small human-labeled bias-estimation sets, without requiring large-scale fine-tuning.
Executive Summary
This study evaluates the performance of Large Language Models (LLMs) in scoring essays under both holistic and analytic rubrics, revealing significant biases in their scoring. The findings suggest that concise keyword-based prompts outperform longer rubric-style prompts in multi-trait analytic scoring, and that Lower-Order Concern (LOC) traits are often scored more harshly by models than human raters. The study proposes a bias-correction-first deployment strategy, which involves estimating and correcting systematic score offsets using small human-labeled bias-estimation sets. This approach has significant implications for the practical and policy considerations of using LLMs in educational assessment.
Key Points
- ▸ LLMs achieve moderate to high agreement with humans on holistic scoring, but not on analytic scoring.
- ▸ Models exhibit large and stable negative directional bias on LOC traits, such as Grammar and Conventions.
- ▸ Concise keyword-based prompts outperform longer rubric-style prompts in multi-trait analytic scoring.
Merits
Systematic evaluation of LLMs across multiple datasets
The study provides a comprehensive evaluation of LLMs across three open essay-scoring datasets, covering both holistic and analytic scoring.
Identification of bias in LLM scoring
The study reveals significant biases in LLM scoring, including large and stable negative directional bias on LOC traits.
Proposal of bias-correction-first deployment strategy
The study proposes a practical approach to mitigating bias in LLM scoring, involving the estimation and correction of systematic score offsets.
Demerits
Limited generalizability of findings
The study's findings may not generalize to other datasets or assessment contexts.
Need for larger-scale validation
The study suggests that Higher-Order Concern (HOC) traits may require larger samples to detect bias, highlighting the need for further research.
Technical complexity of bias correction
The proposed bias-correction approach may require technical expertise and computational resources.
Expert Commentary
This study provides a significant contribution to the field of AI in educational assessment, highlighting the need for careful consideration of bias and validation. The proposed bias-correction-first deployment strategy offers a practical approach to mitigating bias in LLM scoring, but further research is needed to fully understand the technical complexities of this approach. The study's findings have broader implications for the development and deployment of AI systems, emphasizing the need for ongoing monitoring and mitigation of bias.
Recommendations
- ✓ Further research is needed to explore the generalizability of the study's findings across different datasets and assessment contexts.
- ✓ Developers of LLMs should prioritize the development of bias-aware and validation-focused approaches to LLM scoring.
Sources
Original: arXiv - cs.CL