Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
arXiv:2603.20562v1 Announce Type: new Abstract: Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averagi
arXiv:2603.20562v1 Announce Type: new Abstract: Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable.
Executive Summary
The article introduces PCFJudge, an innovative inference-time method designed to mitigate candidate-order sensitivity in listwise factuality evaluation by LLM judges. By rerunning the same prompt across multiple permutations of candidate answers and aggregating the results into a consensus decision, PCFJudge achieves up to a 7-point absolute improvement over direct judging on the RewardBench 2 Factuality benchmark. The authors demonstrate that the gains stem primarily from permutation consensus rather than additional arbitration mechanisms, highlighting a hitherto underappreciated source of instability in LLM-based evaluation—order-induced bias—which contributes meaningfully to factuality-judging errors. The study underscores the importance of robustness in LLM judging pipelines and offers a straightforward yet effective solution to enhance reliability in factuality assessments.
Key Points
- ▸ LLM judges exhibit instability in listwise factuality evaluation due to candidate-order sensitivity, where presentation order affects perceived hallucination risk despite identical content.
- ▸ PCFJudge addresses this by aggregating judgments across multiple permutations of candidate answers, producing a consensus score that reduces order-induced noise.
- ▸ Empirical evaluation on RewardBench 2 Factuality shows a 7-point absolute improvement over direct judging, with ablations indicating that permutation consensus is the primary driver of performance gains.
- ▸ The method introduces negligible computational overhead compared to heavier arbitration layers, making it a scalable solution for improving LLM judging reliability.
Merits
Methodological Rigor
The study rigorously isolates the effect of permutation consensus through controlled ablations, demonstrating that the observed improvements are attributable to the core innovation rather than ancillary factors.
Practical Relevance
PCFJudge is computationally lightweight and easily integrable into existing LLM judging pipelines, offering a deployable solution to a critical challenge in automated factuality evaluation.
Empirical Robustness
The use of a well-established benchmark (RewardBench 2 Factuality) and the demonstration of consistent gains across permutations strengthen the validity of the findings.
Demerits
Limited Generalizability
The study focuses exclusively on factuality evaluation, leaving open questions about whether permutation consensus would yield similar improvements in other domains (e.g., preference alignment, toxicity detection).
Scalability Concerns
While permutation consensus is efficient for small candidate sets, the computational cost scales factorially with the number of candidates, potentially limiting applicability in high-throughput scenarios.
Dependence on LLM Stability
The method assumes that the underlying LLM judge is sufficiently stable across runs; if the LLM itself exhibits high variance in responses, permutation consensus may not fully mitigate the issue.
Expert Commentary
The authors present a compelling case for addressing order-induced instability in LLM-based factuality evaluation, a problem that has received limited attention despite its potential to undermine the reliability of automated judging systems. The introduction of PCFJudge is timely, as LLMs are increasingly deployed in high-stakes applications where factual accuracy is paramount. The empirical demonstration of a 7-point improvement on a widely used benchmark is noteworthy, particularly given that the gains are achieved with a computationally efficient method. However, the study also raises important questions about the broader applicability of permutation consensus. For instance, while the method is effective for small candidate sets, its scalability to larger or more complex evaluation scenarios remains untested. Moreover, the dependence on the stability of the underlying LLM judge suggests that permutation consensus may not be a panacea for all forms of judging instability. That said, the work makes a significant contribution to the field by highlighting a critical, yet often overlooked, source of bias in LLM evaluation and proposing a pragmatic solution. Future research should explore the method’s applicability to other domains and its potential integration with uncertainty-aware judging frameworks.
Recommendations
- ✓ Conduct further ablation studies to isolate the effects of permutation consensus across different LLM architectures and benchmarks, including those beyond factuality (e.g., preference alignment, reasoning tasks).
- ✓ Investigate the scalability of PCFJudge by testing its performance on larger candidate sets and in real-time evaluation scenarios to assess its practical deployment limits.
- ✓ Explore hybrid approaches that combine permutation consensus with other robustness techniques, such as ensemble judging or uncertainty quantification, to further enhance the reliability of LLM evaluations.
- ✓ Engage with standard-setting bodies and policymakers to advocate for the inclusion of order-sensitivity tests in formal evaluation frameworks for LLMs, ensuring that future benchmarks and regulations account for this form of bias.
- ✓ Develop open-source implementations of PCFJudge and integrate it into popular LLM evaluation toolkits (e.g., Hugging Face’s Evaluate, EleutherAI’s LM Evaluation Harness) to facilitate widespread adoption and community-driven validation.
Sources
Original: arXiv - cs.CL