GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics
arXiv:2603.11442v1 Announce Type: new Abstract: Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents. Human annotators exhibit the largest visual discrimination gap of any evaluator, yet their binary detection F1 falls well below Claude Sonnet 4 and below Gemini 2.5 Flash. This paradox resolves once the mechanism is understood: the dominant forensic signals in AI-generated receipts are arithmetic errors -- invisible to visual inspection but systematically verifiable by LLMs. Humans cannot perceive that a subtotal is incorrect; LLMs verify it in milliseconds. Beyond the human--LLM comparison, our five-model eva
arXiv:2603.11442v1 Announce Type: new Abstract: Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents. Human annotators exhibit the largest visual discrimination gap of any evaluator, yet their binary detection F1 falls well below Claude Sonnet 4 and below Gemini 2.5 Flash. This paradox resolves once the mechanism is understood: the dominant forensic signals in AI-generated receipts are arithmetic errors -- invisible to visual inspection but systematically verifiable by LLMs. Humans cannot perceive that a subtotal is incorrect; LLMs verify it in milliseconds. Beyond the human--LLM comparison, our five-model evaluation reveals dramatic performance disparities and calibration differences that render simple accuracy metrics insufficient for detector selection. GPT4o-Receipt, the evaluation framework, and all results are released publicly to support future research in AI document forensics.
Executive Summary
This article presents GPT4o-Receipt, a novel dataset and human study for AI-generated document forensics. The authors investigated the ability of both humans and machines to detect AI-generated financial documents. The results reveal a paradox: humans are better at perceiving AI artifacts, but worse at detecting AI documents. This discrepancy is attributed to the dominant forensic signals in AI-generated receipts being arithmetic errors, which are invisible to human visual inspection but verifiable by machines in milliseconds. The study highlights the need for a more nuanced evaluation framework, as simple accuracy metrics are insufficient for detector selection. The authors release their framework, dataset, and results publicly to support future research in AI document forensics.
Key Points
- ▸ GPT4o-Receipt is a benchmark dataset for AI-generated document forensics
- ▸ Humans are better at perceiving AI artifacts, but worse at detecting AI documents
- ▸ Arithmetic errors are the dominant forensic signals in AI-generated receipts
- ▸ Simple accuracy metrics are insufficient for detector selection
Merits
Innovative Dataset
GPT4o-Receipt is a novel and comprehensive dataset for AI-generated document forensics, providing a valuable resource for researchers and practitioners.
Insightful Results
The study's findings reveal a significant paradox in AI document forensics, highlighting the need for a more nuanced evaluation framework.
Demerits
Methodological Limitations
The study relies on a relatively small sample size of 30 annotators, which may limit the generalizability of the results.
Lack of Contextual Understanding
The study focuses primarily on visual and arithmetic aspects of AI-generated documents, neglecting potential contextual and semantic differences.
Expert Commentary
The GPT4o-Receipt study offers a significant contribution to the field of AI document forensics, highlighting the need for a more nuanced evaluation framework. However, the study's methodological limitations and lack of contextual understanding should be addressed in future research. The implications of the study's findings are far-reaching, with practical and policy implications for the development and deployment of AI-generated content detection frameworks. As AI-generated document forensics continues to evolve, it is essential to prioritize the development of more accurate and reliable evaluation frameworks, as well as regulatory frameworks to ensure responsible development and deployment.
Recommendations
- ✓ Future research should focus on developing more sophisticated evaluation frameworks that account for contextual and semantic differences in AI-generated documents.
- ✓ Regulatory frameworks should be established to ensure the responsible development and deployment of AI-generated document forensics tools.