Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable
arXiv:2603.20450v1 Announce Type: new Abstract: A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none me
arXiv:2603.20450v1 Announce Type: new Abstract: A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.
Executive Summary
This study assesses the enforceability of policies prohibiting the use of Large Language Models (LLMs) in peer review, except for polishing, paraphrasing, and grammar correction. The authors assembled a dataset of peer reviews simulating human-AI collaboration and evaluated five state-of-the-art detectors. The results show that all detectors misclassify LLM-polished reviews as AI-generated, highlighting the limitations of current detection methods. The study suggests that recent estimates of AI use in peer reviews may be overstated due to the misclassification of mixed reviews. The findings have significant implications for the academic community and policy makers, underscoring the need for more accurate detection methods and revised policies.
Key Points
- ▸ Current LLM detectors are not effective in distinguishing between human-written and LLM-polished peer reviews.
- ▸ The use of peer-review-specific signals can improve detection, but current methods still fall short of required accuracy standards.
- ▸ Recent estimates of AI use in peer reviews may be overstated due to the misclassification of mixed reviews.
Merits
Contribution to the field
The study provides a comprehensive evaluation of current LLM detection methods and highlights the need for more accurate approaches.
Methodological rigor
The authors assembled a dataset of peer reviews simulating human-AI collaboration and evaluated multiple state-of-the-art detectors.
Demerits
Limited generalizability
The study focused on a specific dataset and it is unclear whether the findings can be generalized to other contexts.
Need for further research
The study highlights the limitations of current detection methods, but does not provide a clear solution for improving accuracy.
Expert Commentary
This study is a significant contribution to the field, providing a comprehensive evaluation of current LLM detection methods. However, the limitations of the study should be acknowledged, including the limited generalizability of the findings and the need for further research. The study highlights the need for more accurate detection methods and raises questions about the ethics of AI use in academic settings. The findings have significant implications for the academic community and policy makers, underscoring the need for revised policies and more accurate detection methods.
Recommendations
- ✓ Develop and evaluate more accurate LLM detection methods that can distinguish between human-written and LLM-polished peer reviews.
- ✓ Reconsider the accuracy standards required for identifying AI use in peer reviews and revise policies accordingly.
Sources
Original: arXiv - cs.CL