Academic

DISCO: Document Intelligence Suite for COmparative Evaluation

arXiv:2603.23511v1 Announce Type: new Abstract: Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for COmparative Evaluation}, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document t

K
Kenza Benkirane, Dan Goldwater, Martin Asenov, Aneiss Ghodsi
· · 1 min read · 18 views

arXiv:2603.23511v1 Announce Type: new Abstract: Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for COmparative Evaluation}, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.

Executive Summary

DISCO (Document Intelligence Suite for COmparative Evaluation) presents a novel framework for benchmarking document intelligence systems, specifically optical character recognition (OCR) pipelines and vision-language models (VLMs), across heterogeneous document types such as handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. By decoupling parsing and question-answering tasks, DISCO reveals significant performance disparities contingent on document structure and reasoning demands. The study demonstrates that OCR excels in text-heavy, long, or multi-page documents where grounded text extraction is critical, whereas VLMs demonstrate superior adaptability in visually rich or multilingual contexts. Task-aware prompting yields inconsistent benefits, suggesting that model selection and prompting strategies must be tailored to document-specific characteristics. This research provides actionable empirical guidance for practitioners and researchers in optimizing document processing workflows.

Key Points

  • DISCO introduces a modular evaluation framework that separately assesses OCR and VLM performance across diverse document types, enabling task-specific benchmarking.
  • Empirical results indicate OCR pipelines outperform VLMs in text-heavy, long, or multi-page documents due to their reliance on explicit text grounding for reasoning.
  • VLMs demonstrate superior adaptability in visually complex or multilingual documents, highlighting their strength in layout and semantic understanding.
  • Task-aware prompting produces mixed outcomes, with improvements in some document types but degradation in others, underscoring the need for context-sensitive prompting strategies.

Merits

Comprehensive Benchmarking Framework

DISCO provides a nuanced, task- and document-aware evaluation suite that decouples parsing and reasoning, offering granular insights into the strengths and weaknesses of OCR and VLM systems across diverse document types.

Empirical Rigor and Practical Relevance

The study’s systematic evaluation across a wide range of document types—including underrepresented categories like handwritten text and infographics—delivers actionable guidance for real-world document processing applications.

Paradigm Shift in Model Selection

By empirically demonstrating the contextual superiority of OCR for text-heavy tasks and VLMs for visually rich or multilingual content, DISCO challenges one-size-fits-all approaches and advocates for adaptive, document-aware model selection.

Demerits

Limited Generalizability to Extremely Niche Document Types

While DISCO covers a broad range of document types, it may not fully capture the nuances of highly specialized or domain-specific documents (e.g., legal contracts with intricate clause structures or scientific papers with dense mathematical notations), where performance could deviate significantly.

Overreliance on Synthetic or Semi-Synthetic Datasets

The robustness of DISCO’s findings depends on the representativeness of its benchmark datasets. If these datasets lack real-world variability or edge cases (e.g., heavily degraded documents, rare scripts), the generalizability of the results may be compromised.

Ambiguity in Task-Aware Prompting Mechanisms

The study notes mixed effects of task-aware prompting without delving into the underlying mechanisms. Further research is needed to identify why prompting helps or hinders performance in specific contexts, which could inform the development of more stable prompting strategies.

Expert Commentary

DISCO represents a significant advancement in the evaluation of document intelligence systems by moving beyond monolithic benchmarks to a granular, task- and document-aware framework. The study’s empirical rigor is commendable, particularly in its inclusion of handwritten text and infographics—categories often overlooked in mainstream evaluations. The findings challenge the prevailing assumption that VLMs are uniformly superior across all document types, instead advocating for a hybrid approach where OCR’s text-grounding capabilities are leveraged for dense, multi-page documents, while VLMs are reserved for visually complex or multilingual content. This paradigm shift is timely, given the growing adoption of document intelligence in high-stakes sectors like legal and healthcare. However, the study’s reliance on synthetic or semi-synthetic datasets introduces a potential blind spot; real-world documents often exhibit noise, distortions, or idiosyncratic structures that may not be fully captured. Future work should expand DISCO’s scope to include edge cases and domain-specific documents to enhance its practical applicability. Additionally, the mixed effects of task-aware prompting warrant deeper investigation, as a more systematic understanding of prompting mechanisms could unlock further performance gains.

Recommendations

  • Expand DISCO’s benchmarking suite to include highly specialized document types (e.g., legal contracts, scientific papers) and real-world edge cases (e.g., degraded documents, rare scripts) to improve robustness and generalizability.
  • Develop a standardized protocol for task-aware prompting that defines when and how to apply prompting strategies based on document characteristics, enabling more consistent and replicable results across applications.
  • Collaborate with standards bodies (e.g., ISO, NIST) to integrate DISCO’s framework into certification processes for document processing systems, ensuring alignment with industry best practices and regulatory requirements.
  • Invest in research to improve OCR and VLM performance for underrepresented scripts and languages, reducing disparities in document intelligence accessibility and supporting global digital inclusion efforts.
  • Explore the integration of explainability tools within DISCO to provide users with transparent insights into model decisions, particularly in high-stakes applications where interpretability is critical.

Sources

Original: arXiv - cs.CL