Academic

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

arXiv:2603.11442v1 Announce Type: new Abstract: Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents. Human annotators exhibit the largest visual discrimination gap of any evaluator, yet their binary detection F1 falls well below Claude Sonnet 4 and below Gemini 2.5 Flash. This paradox resolves once the mechanism is understood: the dominant forensic signals in AI-generated receipts are arithmetic errors -- invisible to visual inspection but systematically verifiable by LLMs. Humans cannot perceive that a subtotal is incorrect; LLMs verify it in milliseconds. Beyond the human--LLM comparison, our five-model eva

Yan Zhang, Simiao Ren, Ankit Raj, En Wei, Dennis Ng, Alex Shen, Jiayue Xu, Yuxin Zhang, Evelyn Marotta · March 13, 2026 · 1 min read · 15 views

#cs.AI #cs.CV

Executive Summary

This article presents GPT4o-Receipt, a novel dataset and human study for AI-generated document forensics. The authors investigated the ability of both humans and machines to detect AI-generated financial documents. The results reveal a paradox: humans are better at perceiving AI artifacts, but worse at detecting AI documents. This discrepancy is attributed to the dominant forensic signals in AI-generated receipts being arithmetic errors, which are invisible to human visual inspection but verifiable by machines in milliseconds. The study highlights the need for a more nuanced evaluation framework, as simple accuracy metrics are insufficient for detector selection. The authors release their framework, dataset, and results publicly to support future research in AI document forensics.

Key Points

▸ GPT4o-Receipt is a benchmark dataset for AI-generated document forensics
▸ Humans are better at perceiving AI artifacts, but worse at detecting AI documents
▸ Arithmetic errors are the dominant forensic signals in AI-generated receipts
▸ Simple accuracy metrics are insufficient for detector selection

Merits

Innovative Dataset

GPT4o-Receipt is a novel and comprehensive dataset for AI-generated document forensics, providing a valuable resource for researchers and practitioners.

Insightful Results

The study's findings reveal a significant paradox in AI document forensics, highlighting the need for a more nuanced evaluation framework.

Demerits

Methodological Limitations

The study relies on a relatively small sample size of 30 annotators, which may limit the generalizability of the results.

Lack of Contextual Understanding

The study focuses primarily on visual and arithmetic aspects of AI-generated documents, neglecting potential contextual and semantic differences.

Expert Commentary

The GPT4o-Receipt study offers a significant contribution to the field of AI document forensics, highlighting the need for a more nuanced evaluation framework. However, the study's methodological limitations and lack of contextual understanding should be addressed in future research. The implications of the study's findings are far-reaching, with practical and policy implications for the development and deployment of AI-generated content detection frameworks. As AI-generated document forensics continues to evolve, it is essential to prioritize the development of more accurate and reliable evaluation frameworks, as well as regulatory frameworks to ensure responsible development and deployment.

Recommendations

✓ Future research should focus on developing more sophisticated evaluation frameworks that account for contextual and semantic differences in AI-generated documents.
✓ Regulatory frameworks should be established to ensure the responsible development and deployment of AI-generated document forensics tools.

Sources

arXiv - cs.AI

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

AI Commentary

Executive Summary

Key Points

Merits

Innovative Dataset

Insightful Results

Demerits

Methodological Limitations

Lack of Contextual Understanding

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs