Academic

Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

arXiv:2603.20450v1 Announce Type: new Abstract: A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none me

Rounak Saha, Gurusha Juneja, Dayita Chaudhuri, Naveeja Sajeevan, Nihar B Shah, Danish Pruthi · March 24, 2026 · 1 min read · 16 views

#cs.CL #cs.AI #cs.CY #cs.LG

Executive Summary

This study assesses the enforceability of policies prohibiting the use of Large Language Models (LLMs) in peer review, except for polishing, paraphrasing, and grammar correction. The authors assembled a dataset of peer reviews simulating human-AI collaboration and evaluated five state-of-the-art detectors. The results show that all detectors misclassify LLM-polished reviews as AI-generated, highlighting the limitations of current detection methods. The study suggests that recent estimates of AI use in peer reviews may be overstated due to the misclassification of mixed reviews. The findings have significant implications for the academic community and policy makers, underscoring the need for more accurate detection methods and revised policies.

Key Points

▸ Current LLM detectors are not effective in distinguishing between human-written and LLM-polished peer reviews.
▸ The use of peer-review-specific signals can improve detection, but current methods still fall short of required accuracy standards.
▸ Recent estimates of AI use in peer reviews may be overstated due to the misclassification of mixed reviews.

Merits

Contribution to the field

The study provides a comprehensive evaluation of current LLM detection methods and highlights the need for more accurate approaches.

Methodological rigor

The authors assembled a dataset of peer reviews simulating human-AI collaboration and evaluated multiple state-of-the-art detectors.

Demerits

Limited generalizability

The study focused on a specific dataset and it is unclear whether the findings can be generalized to other contexts.

Need for further research

The study highlights the limitations of current detection methods, but does not provide a clear solution for improving accuracy.

Expert Commentary

This study is a significant contribution to the field, providing a comprehensive evaluation of current LLM detection methods. However, the limitations of the study should be acknowledged, including the limited generalizability of the findings and the need for further research. The study highlights the need for more accurate detection methods and raises questions about the ethics of AI use in academic settings. The findings have significant implications for the academic community and policy makers, underscoring the need for revised policies and more accurate detection methods.

Recommendations

✓ Develop and evaluate more accurate LLM detection methods that can distinguish between human-written and LLM-polished peer reviews.
✓ Reconsider the accuracy standards required for identifying AI use in peer reviews and revise policies accordingly.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

AI Commentary

Executive Summary

Key Points

Merits

Contribution to the field

Methodological rigor

Demerits

Limited generalizability

Need for further research

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.