Academic

Two-Stage Optimizer-Aware Online Data Selection for Large Language Models

arXiv:2604.00001v1 Announce Type: cross Abstract: Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm

arXiv:2604.00001v1 Announce Type: cross Abstract: Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.

Executive Summary

The article proposes an optimizer-aware framework for gradient-based online data selection and reweighting in large language model (LLM) fine-tuning. The authors introduce a two-stage Filter-then-Weight algorithm to select and optimize data samples under adaptive optimizers. Experiments demonstrate improved convergence and downstream performance over existing online data selection baselines. The framework's key feature is its ability to account for interactions and redundancy among selected samples, making it well-suited for online fine-tuning. However, the algorithm assumes access to the optimizer's state, which may not be feasible in all settings.

Key Points

  • The authors propose an optimizer-aware framework for online data selection in LLM fine-tuning.
  • The framework uses a two-stage Filter-then-Weight algorithm to select and optimize data samples.
  • The algorithm accounts for interactions and redundancy among selected samples.

Merits

Strength in Mathematical Formulation

The authors provide a rigorous mathematical formulation of the optimizer-aware update-matching problem, which is a key contribution of the article.

Improved Convergence and Downstream Performance

The experiments demonstrate significant improvements in convergence and downstream performance over existing online data selection baselines, making the framework a promising approach for LLM fine-tuning.

Demerits

Assumption of Access to Optimizer's State

The algorithm assumes access to the optimizer's state, which may not be feasible in all settings, limiting its practical applicability.

Computational Complexity

The factorized outer-product gradient representation and optimized matrix computations may increase computational complexity, which could be a challenge for large-scale LLM fine-tuning.

Expert Commentary

The article makes a significant contribution to the field of LLM fine-tuning by proposing an optimizer-aware framework for online data selection and reweighting. The two-stage Filter-then-Weight algorithm is a key innovation that addresses the challenges of online fine-tuning, and the experimental results demonstrate its effectiveness. However, as with any new approach, there are potential limitations and challenges that need to be addressed, such as the assumption of access to the optimizer's state and the computational complexity of the algorithm. Further research is needed to fully explore the potential of this framework and to address these limitations.

Recommendations

  • Future research should focus on developing more efficient and scalable algorithms for computing the factorized outer-product gradient representation and optimized matrix computations.
  • The proposed framework should be evaluated on a broader range of LLM architectures and tasks to demonstrate its generalizability and effectiveness.

Sources

Original: arXiv - cs.AI