Academic

Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

arXiv:2603.16127v1 Announce Type: new Abstract: We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas

arXiv:2603.16127v1 Announce Type: new Abstract: We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.

Executive Summary

This article investigates the impact of learning rate scheduling on large language models' performance after supervised fine-tuning. The authors find that using a constant learning rate after warmup, without decay, enhances downstream adaptability and outperforms decay-based schedulers. Experiments with 1B and 8B parameter models demonstrate the superiority of this approach, even in cases of mid-training and over-training. The study's findings have significant implications for training and model release strategies, suggesting that pre-training models without learning rate decay can improve their adaptability for downstream tasks.

Key Points

  • Pre-training without learning rate decay enhances supervised fine-tuning performance
  • Constant learning rate after warmup outperforms decay-based schedulers
  • Loss landscape analysis reveals that decay-based schedulers lead to sharper minima, compromising adaptability

Merits

Improved Adaptability

The proposed approach enhances the model's ability to adapt to downstream tasks, making it more versatile and useful in real-world applications.

Demerits

Potential Overfitting

The use of a constant learning rate without decay may lead to overfitting, particularly in cases where the model is not properly regularized.

Expert Commentary

The article's findings challenge the conventional wisdom on learning rate scheduling, highlighting the potential drawbacks of decay-based schedulers. The authors' use of loss landscape analysis provides valuable insights into the underlying mechanisms, demonstrating that the proposed approach preserves flatter minima that support adaptability. As the field of natural language processing continues to evolve, this study's implications for training and model release strategies will be crucial in developing more effective and adaptable language models.

Recommendations

  • Researchers and practitioners should consider using constant learning rates without decay when pre-training large language models
  • Further studies should investigate the applicability of this approach to other deep learning domains and tasks

Sources