Academic

Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies

arXiv:2603.22651v1 Announce Type: new Abstract: The adoption of large language models (LLMs) for structured information extraction from financial documents has accelerated rapidly, yet production deployments face fundamental architectural decisions with limited empirical guidance. We present a systematic benchmark comparing four multi-agent orchestration architectures: sequential pipeline, parallel fan-out with merge, hierarchical supervisor-worker and reflexive self-correcting loop. These are evaluated across five frontier and open-weight LLMs on a corpus of 10,000 SEC filings (10-K, 10-Q and 8-K forms). Our evaluation spans 25 extraction field types covering governance structures, executive compensation and financial metrics, measured along five axes: field-level F1, document-level accuracy, end-to-end latency, cost per document and token efficiency. We find that reflexive architectures achieve the highest field-level F1 (0.943) but at 2.3x the cost of sequential baselines, while hi

S
Siddhant Kulkarni, Yukta Kulkarni
· · 1 min read · 3 views

arXiv:2603.22651v1 Announce Type: new Abstract: The adoption of large language models (LLMs) for structured information extraction from financial documents has accelerated rapidly, yet production deployments face fundamental architectural decisions with limited empirical guidance. We present a systematic benchmark comparing four multi-agent orchestration architectures: sequential pipeline, parallel fan-out with merge, hierarchical supervisor-worker and reflexive self-correcting loop. These are evaluated across five frontier and open-weight LLMs on a corpus of 10,000 SEC filings (10-K, 10-Q and 8-K forms). Our evaluation spans 25 extraction field types covering governance structures, executive compensation and financial metrics, measured along five axes: field-level F1, document-level accuracy, end-to-end latency, cost per document and token efficiency. We find that reflexive architectures achieve the highest field-level F1 (0.943) but at 2.3x the cost of sequential baselines, while hierarchical architectures occupy the most favorable position on the cost-accuracy Pareto frontier (F1 0.921 at 1.4x cost). We further present ablation studies on semantic caching, model routing and adaptive retry strategies, demonstrating that hybrid configurations can recover 89\% of the reflexive architecture's accuracy gains at only 1.15x baseline cost. Our scaling analysis from 1K to 100K documents per day reveals non-obvious throughput-accuracy degradation curves that inform capacity planning. These findings provide actionable guidance for practitioners deploying multi-agent LLM systems in regulated financial environments.

Executive Summary

This article presents a rigorous comparative benchmark of four multi-agent LLM orchestration architectures—sequential pipeline, parallel fan-out with merge, hierarchical supervisor-worker, and reflexive self-correcting loop—evaluated across 10,000 SEC filings using five frontier LLMs. The study systematically assesses performance across five metrics: field-level F1, document-level accuracy, end-to-end latency, cost per document, and token efficiency. The findings reveal nuanced tradeoffs: reflexive architectures deliver the highest field-level F1 (0.943) but at a 2.3x cost premium over sequential models, while hierarchical architectures offer a more balanced cost-accuracy tradeoff (F1 0.921 at 1.4x cost). Crucially, the authors demonstrate that hybrid configurations can approximate reflexive accuracy gains at significantly lower cost, offering actionable insights for regulated financial deployments. The scaling analysis further informs capacity planning by revealing non-obvious throughput-accuracy degradation curves beyond linear scaling assumptions.

Key Points

  • Reflexive architectures achieve highest field-level F1 (0.943) at 2.3x cost
  • Hierarchical architectures provide optimal cost-accuracy Pareto position (F1 0.921 at 1.4x cost)
  • Hybrid configurations can recover 89% of reflexive accuracy gains at 1.15x cost

Merits

Comprehensive Benchmark Design

The study employs a multi-metric evaluation across diverse LLM variants and document types, establishing a robust empirical foundation for architectural comparison.

Actionable Tradeoff Insights

By quantifying cost-accuracy-latency intersections, the research provides concrete guidance for production deployment decisions.

Hybrid Solution Validation

The ablation studies validate the viability of hybrid architectures as cost-effective alternatives without sacrificing proportional accuracy.

Demerits

Limited Generalizability Risk

Evaluation is constrained to specific SEC filings and LLM versions; results may not extrapolate directly to other regulatory contexts or newer model iterations.

Cost Metrics Ambiguity

Cost-per-document calculations lack granularity on underlying infrastructure or token pricing assumptions, potentially affecting reproducibility.

Expert Commentary

This work represents a significant advancement in empirical evaluation of multi-agent LLM systems for financial information extraction. The authors successfully navigate the tension between algorithmic superiority and operational feasibility by identifying architectural sweet spots that align with real-world constraints. The reflexive architecture’s dominance in accuracy but prohibitive cost underscores a fundamental tradeoff that has been under-acknowledged in prior literature. More importantly, the hybrid architecture ablation studies offer a pragmatic path forward—demonstrating that regulatory compliance need not come at the expense of near-optimal performance. The scaling analysis’s revelation of non-linear throughput degradation curves is particularly valuable; it challenges the assumption that linear scaling equates to linear performance and provides a nuanced framework for enterprise-scale deployment. This research fills a critical gap between academic optimization and production pragmatism, making it indispensable for legal, compliance, and tech teams deploying LLMs in financial domains.

Recommendations

  • Adopt hierarchical supervisor-worker architecture as default baseline for cost-sensitive financial document processing.
  • Integrate hybrid architecture prototypes into pilot deployments to validate cost-accuracy tradeoffs before scaling.
  • Develop standardized metrics for reporting hybrid architecture performance relative to reflexive baselines to enable comparative benchmarking.

Sources

Original: arXiv - cs.AI