Academic

Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

Yurui Zhu, Giovanni Colavizza, Matteo Romanello · March 17, 2026 · 1 min read · 39 views

#cs.CL #cs.AI #cs.IR

arXiv:2603.13651v1 Announce Type: new Abstract: Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and end-to-end document parsing -- under a schema-constrained setup that enables direct comparison between a strong supervised pipeline baseline (GROBID) and contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, and Qwen3-VL (4B-32B variants)). Across datasets, extraction largely saturates beyond a moderate capability threshold, while parsing and end-to-end parsing remain the primary bottlenecks due to structured-output brittleness under noisy layouts. We further show that lightweight LoRA adaptation yields consistent gains -- especially on SSH-heavy benchmarks -- and that segmentation/pipelining can substantially improve robustness. Finally, we argue for hybrid deployment via routing: leveraging GROBID for well-structured, in-distribution PDFs while escalating multilingual and footnote-heavy documents to task-adapted LLMs.

Executive Summary

This article presents a benchmarking study on large language models (LLMs) for reference extraction and parsing in the Social Sciences and Humanities (SSH). The study evaluates three tasks across three datasets, comparing a supervised pipeline baseline with contemporary LLMs. The results show that while extraction capabilities are largely saturated, parsing and end-to-end parsing remain significant bottlenecks due to structured-output brittleness. The study also explores the benefits of lightweight LoRA adaptation and segmentation/pipelining for improving robustness.

Key Points

▸ Benchmarking study on LLMs for reference extraction and parsing in SSH
▸ Evaluation of three tasks across three datasets with varying complexities
▸ Comparison of supervised pipeline baseline with contemporary LLMs

Merits

Comprehensive Evaluation

The study provides a thorough evaluation of LLMs across multiple tasks and datasets, offering valuable insights into their strengths and weaknesses.

Realistic Dataset Selection

The datasets used in the study reflect the complexities of SSH references, including multilingualism, footnote-only regimes, and heterogeneous historical conventions.

Demerits

Limited Generalizability

The study's focus on SSH may limit the generalizability of its findings to other disciplines or domains.

Brittleness of Structured-Output

The study highlights the brittleness of structured-output under noisy layouts, which may be a limitation of current LLM architectures.

Expert Commentary

This study contributes significantly to our understanding of the capabilities and limitations of LLMs in reference extraction and parsing. The findings highlight the importance of considering the complexities of SSH references and the need for more robust and adaptable LLM architectures. The study's emphasis on hybrid deployment via routing, leveraging both supervised pipeline baselines and task-adapted LLMs, offers a promising approach to improving the accuracy and efficiency of citation indexing and knowledge-graph construction.

Recommendations

✓ Further research should focus on developing more robust and adaptable LLM architectures that can handle the complexities of SSH references.
✓ The development of hybrid deployment strategies that leverage the strengths of both supervised pipeline baselines and task-adapted LLMs should be prioritized.

Sources

arXiv - cs.CL

Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation

Realistic Dataset Selection

Demerits

Limited Generalizability

Brittleness of Structured-Output

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs