FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
arXiv:2604.04074v1 Announce Type: new Abstract: Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper's own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence-grounded reviewing system that combines claim extraction, literature positioning, and execution-based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper's technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially
arXiv:2604.04074v1 Announce Type: new Abstract: Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper's own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence-grounded reviewing system that combines claim extraction, literature positioning, and execution-based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper's technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive. In a case study on CompGCN, FactReview reproduces results that closely match those reported for link prediction and node classification, yet also shows that the paper's broader performance claim across tasks is not fully sustained: on MUTAG graph classification, the reproduced result is 88.4%, whereas the strongest baseline reported in the paper remains 92.6%. The claim is therefore only partially supported. More broadly, this case suggests that AI is most useful in peer review not as a final decision-maker, but as a tool for gathering evidence and helping reviewers produce more evidence-grounded assessments. The code is public at https://github.com/DEFENSE-SEU/Review-Assistant.
Executive Summary
FactReview addresses critical inefficiencies in machine learning peer review by introducing an evidence-grounded system that mitigates biases arising from manuscript presentation and limited reviewer time. The system extracts major claims, contextualizes them within related work, and verifies empirical claims through execution-based testing when code is available. In a case study of CompGCN, FactReview successfully reproduced results for link prediction and node classification but identified partial support for the paper’s broader performance claim, particularly in graph classification where reproduced performance (88.4%) fell short of the baseline (92.6%). The authors argue that AI should augment, rather than replace, human reviewers by automating evidence gathering and enabling more rigorous, evidence-based assessments. This approach has significant potential to enhance the reliability and objectivity of peer review in high-volume research domains.
Key Points
- ▸ FactReview automates evidence-grounded peer review by combining claim extraction, literature positioning, and execution-based verification.
- ▸ The system addresses limitations of LLM-based reviewers by reducing sensitivity to manuscript presentation and enabling verification of claims against external evidence, including released code.
- ▸ A case study on CompGCN demonstrates FactReview’s ability to reproduce results while identifying gaps in broader performance claims, highlighting its utility in evidence-based assessment.
Merits
Innovative Integration of Verification Methods
FactReview uniquely combines claim extraction, literature positioning, and execution-based verification, addressing key gaps in traditional LLM-based reviewing systems that rely solely on manuscript narrative.
Empirical Rigor and Reproducibility
The system’s execution-based verification, particularly when code is available, provides a robust mechanism for validating empirical claims, reducing reliance on potentially biased or incomplete reporting.
Scalability and Efficiency
By automating evidence gathering and preliminary analysis, FactReview mitigates the growing burden of peer review in high-volume fields like machine learning, where reviewer time is a limiting factor.
Demerits
Dependence on Code Availability and Quality
FactReview’s execution-based verification is contingent on the availability and quality of released code, which may not always be the case for many submissions, limiting its applicability in some scenarios.
Potential for Over-Reliance on Automated Assessments
While FactReview is designed as a tool to assist reviewers, there is a risk that its outputs could be misinterpreted or over-relied upon in decision-making processes, particularly if reviewers lack the expertise to contextualize its findings.
Limited Generalizability of Case Study Evidence
The case study on CompGCN provides valuable insights but may not fully capture the system’s performance across diverse machine learning subfields, where claim structures and evaluation protocols vary significantly.
Expert Commentary
FactReview represents a significant step forward in the evolution of AI-augmented peer review, offering a novel approach to evidence gathering and claim verification that addresses critical gaps in traditional reviewing systems. Its integration of literature positioning and execution-based verification is particularly commendable, as it moves beyond the limitations of LLM-based reviewers that often operate in a vacuum, disconnected from external evidence. The case study on CompGCN is insightful, demonstrating not only the system’s ability to reproduce results but also its capacity to identify nuanced gaps in broader performance claims—a level of granularity that human reviewers might overlook due to time constraints or cognitive biases. However, the system’s dependence on code availability and quality is a notable limitation, as it may exclude many submissions from rigorous verification. Additionally, while FactReview is positioned as a tool to assist reviewers, the potential for over-reliance on automated outputs—even if unintentional—poses ethical and practical risks. The paper’s emphasis on AI as an augmentation tool rather than a replacement for human judgment is well-founded, but further research is needed to explore how such systems can be deployed responsibly in high-stakes academic environments. Overall, FactReview is a valuable contribution to the field, with the potential to enhance the objectivity and efficiency of peer review, provided its limitations are carefully managed.
Recommendations
- ✓ Develop hybrid review workflows that integrate FactReview’s automated evidence gathering with human expert judgment to ensure robustness and accountability in decision-making.
- ✓ Expand the system’s capabilities to include broader forms of evidence verification, such as statistical re-analysis of reported results or cross-validation against external datasets, to further enhance its utility.
- ✓ Conduct large-scale evaluations across diverse machine learning subfields to assess FactReview’s generalizability and identify areas for improvement, particularly for submissions lacking code or data.
Sources
Original: arXiv - cs.AI