Academic

DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona

arXiv:2603.22765v1 Announce Type: new Abstract: Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas--such as attorneys, prosecutors, and judges--to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on pers

J
Janghyeok Choi, Jaewon Lee, Sungzoon Cho
· · 1 min read · 13 views

arXiv:2603.22765v1 Announce Type: new Abstract: Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas--such as attorneys, prosecutors, and judges--to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on persona-augmented data consistently achieve competitive or superior recall performance compared to those trained on original data or generic augmentations. These findings establish persona-based prompting as an effective strategy for generating high-quality training data in specialized, low-resource domains.

Executive Summary

The article introduces DALDALL, a novel framework that addresses data scarcity in legal domains by leveraging LLM personas to generate synthetic legal queries with enhanced lexical and semantic diversity. Unlike conventional augmentation methods that prioritize volume over quality, DALDALL employs domain-specific personas (e.g., attorneys, prosecutors, judges) to produce more nuanced and contextually relevant synthetic data. Experimental results on CLERC and COLIEE benchmarks confirm that this persona-based approach improves lexical diversity without compromising semantic coherence, and supports improved recall performance in dense retrievers. This marks a significant shift from generic augmentation to domain-aware, persona-driven content generation.

Key Points

  • Use of domain-specific personas for legal query generation
  • Enhanced lexical diversity via persona-based prompting
  • Improved recall performance in dense retrieval models

Merits

Domain-Specific Innovation

The framework’s use of professional personas introduces a targeted, context-aware augmentation strategy that aligns with legal domain semantics, offering a more effective alternative to generic LLM-based augmentation.

Demerits

Limited Extensibility

While effective in legal IR, the persona-based model may face scalability challenges when applied to non-legal or multi-domain legal contexts requiring broader persona diversification or adaptation.

Expert Commentary

DALDALL represents a sophisticated evolution in data augmentation by integrating domain-specific identity modeling into LLM prompting. The innovation lies not merely in generating more data, but in generating more legally plausible and semantically coherent data—this distinction is critical in legal domains where precision and interpretability carry weight. The empirical validation on recognized benchmarks adds credibility to the claims, and the observed gains in recall suggest tangible operational benefits for downstream applications. This work bridges a gap between theoretical AI augmentation and practical legal application, offering a model that can be adapted to other specialized fields such as healthcare, finance, or intellectual property. Importantly, the authors avoid overstating their findings; instead, they present a pragmatic solution with clear boundaries and measurable outcomes. Their contribution is timely, given the growing reliance on synthetic data in AI-driven legal research and decision support systems.

Recommendations

  • 1. Legal AI teams should pilot DALDALL or similar persona-based frameworks in their IR training pipelines to evaluate impact on recall and diversity metrics.
  • 2. Researchers should extend the methodology to incorporate multi-perspective personas (e.g., client vs. regulator) and evaluate cross-domain applicability to other low-resource sectors.

Sources

Original: arXiv - cs.CL