Academic

Replaying pre-training data improves fine-tuning

arXiv:2603.04964v1 Announce Type: new Abstract: To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter mod

S
Suhas Kotha, Percy Liang
· · 1 min read · 2 views

arXiv:2603.04964v1 Announce Type: new Abstract: To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5\%$ and Basque question-answering accuracy by $2\%$.

Executive Summary

This study investigates the conventional approach of pre-training language models on generic web text and fine-tuning on target domain data. Contrary to expectations, the researchers find that replaying pre-training data during fine-tuning improves performance on the target task. The results show that generic replay increases target data efficiency by up to 1.87 times for fine-tuning and 2.06 times for mid-training. The study also examines data schedules that introduce target data during pre-training, revealing that replay is more beneficial with less target data present. The findings are demonstrated in practice, with improvements in web navigation success and question-answering accuracy. This research challenges the existing paradigm and offers a novel approach to fine-tuning language models.

Key Points

  • Replaying pre-training data during fine-tuning improves performance on the target task.
  • Generic replay increases target data efficiency by up to 1.87 times for fine-tuning and 2.06 times for mid-training.
  • The benefits of replay are more pronounced with less target data present during pre-training.

Merits

Methodological Strength

The study employs a controlled pre-training environment, which allows for a rigorous evaluation of the replay mechanism's effectiveness. The use of 4M target tokens, 4B total tokens, and 150M parameter models provides a comprehensive assessment of the approach's scalability.

Practical Significance

The findings demonstrate the potential of replay to improve performance in real-world applications, such as web navigation and question-answering.

Demerits

Limited Generalizability

The study focuses on a specific set of tasks and domains, which may limit the generalizability of the findings to other contexts.

Lack of Theoretical Explanation

The researchers do not provide a clear theoretical explanation for why replaying pre-training data improves performance, which may limit the understanding and application of the approach.

Expert Commentary

The study's results challenge the existing paradigm of pre-training and fine-tuning language models. The benefits of replaying pre-training data during fine-tuning are significant, with improvements in target data efficiency and performance on target tasks. However, the lack of theoretical explanation and limited generalizability of the findings may hinder the widespread adoption of the approach. Nevertheless, the research offers a novel perspective on language model training and has the potential to inform the development of more effective models and applications.

Recommendations

  • Future research should investigate the theoretical underpinnings of the replay mechanism to provide a clearer understanding of its benefits and limitations.
  • The findings should be validated in a broader range of tasks and domains to ensure the generalizability of the results.

Sources