Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG
arXiv:2603.23562v1 Announce Type: new Abstract: Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6\% relative gain on QuaLITY, a long-document reading comprehension benchmark. In addition, we introduce Focal Rewriting, a simple technique for synthetic document generation that explicitly conditions document generation on specific questions, improving the diversity of synthetic documents and yielding a steeper log-linear scaling curve. On QuaLITY, o
arXiv:2603.23562v1 Announce Type: new Abstract: Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6\% relative gain on QuaLITY, a long-document reading comprehension benchmark. In addition, we introduce Focal Rewriting, a simple technique for synthetic document generation that explicitly conditions document generation on specific questions, improving the diversity of synthetic documents and yielding a steeper log-linear scaling curve. On QuaLITY, our final recipe trains a Llama 8B model that outperforms RAG by 4.4\% relatively. Across models and benchmarks (QuaLITY, LongHealth, FinanceBench), our training enables models to beat RAG in five of six settings, outperforms by 2.6\%, and achieves a 9.1\% gain when combined with RAG.
Executive Summary
This article introduces Synthetic Mixed Training, a novel approach that combines synthetic question-answering (QAs) and synthetic documents to improve language model knowledge acquisition. By leveraging complementary training signals from both data sources, Synthetic Mixed Training enables log-linear improvements in model performance as synthetic data volume and generator strength increase. The authors also propose Focal Rewriting, a simple technique to condition document generation on specific questions, enhancing diversity and scaling curve. Experimental results on QuaLITY and other benchmarks demonstrate significant gains over the RAG baseline, showcasing the potential of Synthetic Mixed Training to break the RAG ceiling.
Key Points
- ▸ Synthetic Mixed Training combines synthetic QAs and synthetic documents for improved language model knowledge acquisition.
- ▸ Focal Rewriting technique enhances diversity and scaling curve of synthetic document generation.
- ▸ Experimental results demonstrate log-linear improvements and significant gains over RAG baseline.
Merits
Strength in scalability
Synthetic Mixed Training enables log-linear improvements as synthetic data volume and generator strength increase, allowing for scalable knowledge acquisition.
Enhanced diversity in synthetic documents
Focal Rewriting technique conditions document generation on specific questions, leading to increased diversity in synthetic documents and a steeper scaling curve.
Demerits
Limited domain-specific applicability
The effectiveness of Synthetic Mixed Training may be domain-specific, and its applicability may be limited to certain domains where synthetic data can be effectively generated.
Potential over-reliance on synthetic data
The reliance on synthetic data may lead to overfitting or biased models, highlighting the need for careful evaluation and validation of the generated data.
Expert Commentary
The article presents a significant step forward in language model training, introducing Synthetic Mixed Training as a novel approach to improve knowledge acquisition. The experimental results demonstrate the effectiveness of the method, and the Focal Rewriting technique offers a promising direction for enhancing diversity and scalability. However, the limitations of domain-specific applicability and potential over-reliance on synthetic data highlight the need for careful evaluation and validation of the generated data. The implications of this research are far-reaching, with potential applications in various domains and a re-evaluation of the current state of language model training.
Recommendations
- ✓ Future research should explore the applicability of Synthetic Mixed Training to different domains and language models.
- ✓ Careful evaluation and validation of synthetic data are essential to ensure the reliability and effectiveness of the generated data in language model training.
Sources
Original: arXiv - cs.LG