MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG
arXiv:2603.23533v1 Announce Type: new Abstract: RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native sema
arXiv:2603.23533v1 Announce Type: new Abstract: RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.
Executive Summary
The article introduces MDKeyChunker, a novel three-stage pipeline for enriching Markdown documents using Large Language Models (LLMs). The pipeline performs structure-aware chunking, single-call LLM enrichment, and key-based restructuring to improve retrieval accuracy. The authors demonstrate the effectiveness of MDKeyChunker through an empirical evaluation on a 30-query corpus. The results show that MDKeyChunker outperforms traditional RAG pipelines, achieving high recall and mean reciprocal rank. The implementation is available in Python, supporting any OpenAI-compatible endpoint. This innovation has significant implications for natural language processing, information retrieval, and knowledge management.
Key Points
- ▸ Structure-aware chunking treats headers, code blocks, tables, and lists as atomic units
- ▸ Single-call LLM enrichment extracts seven metadata fields in one invocation
- ▸ Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching
Merits
Strength
MDKeyChunker offers a significant improvement in retrieval accuracy compared to traditional RAG pipelines. The single-call design and rolling key propagation enable efficient and effective metadata extraction.
Demerits
Limitation
The current implementation only supports OpenAI-compatible endpoints, limiting its applicability to other LLM platforms. Additionally, the evaluation corpus is relatively small, and further testing is required to generalize the results.
Expert Commentary
The article presents a timely and innovative contribution to the field of natural language processing and information retrieval. The authors demonstrate a deep understanding of the challenges faced by traditional RAG pipelines and provide a well-structured solution to address these limitations. While the current implementation has some limitations, the potential of MDKeyChunker is substantial. As the field continues to evolve, it is essential to build upon this work and explore its application to more complex and diverse datasets. The availability of the implementation in Python and the support for OpenAI-compatible endpoints are significant advantages, making it accessible to a broader audience. Nevertheless, further testing and evaluation are necessary to fully understand the capabilities and limitations of MDKeyChunker.
Recommendations
- ✓ Future research should investigate the generalizability of MDKeyChunker to other LLM platforms and datasets.
- ✓ The authors should consider incorporating additional features, such as entity recognition and relation extraction, to further enhance the pipeline's capabilities.
Sources
Original: arXiv - cs.CL