Academic

FLUX: Data Worth Training On

arXiv:2603.13972v1 Announce Type: new Abstract: Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to break this long-standing trade-off by maximizing token retention while enforcing rigorous quality control. Models trained on FLUX-curated data consistently outperform prior methods. A 3B-parameter model trained on 60B tokens with FLUX achieves 32.14% MMLU accuracy, surpassing the previous state-of-the-art pipeline DCLM (31.98%) and significantly outperforming FineWeb (29.88%). FLUX achieves the same aggregate score as a model trained on DCLM data

arXiv:2603.13972v1 Announce Type: new Abstract: Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to break this long-standing trade-off by maximizing token retention while enforcing rigorous quality control. Models trained on FLUX-curated data consistently outperform prior methods. A 3B-parameter model trained on 60B tokens with FLUX achieves 32.14% MMLU accuracy, surpassing the previous state-of-the-art pipeline DCLM (31.98%) and significantly outperforming FineWeb (29.88%). FLUX achieves the same aggregate score as a model trained on DCLM data using only 39B tokens, resulting in a 34.4% reduction in training compute. At the data level, FLUX extracts 50B usable tokens from a single dump (CC-MAIN-2025-51), compared to 40B from DCLM (+25% retention). FLUX-Base yields 192B tokens, exceeding FineWeb's 170B while still maintaining superior quality. Overall, FLUX establishes a new state of the art in web-scale data preprocessing by demonstrating that high retention, strong quality control, and computational efficiency can be achieved simultaneously, redefining the limits of scalable dataset construction for modern language models.

Executive Summary

The article introduces FLUX, a novel preprocessing pipeline that addresses a critical bottleneck in large language model training: the trade-off between data retention and quality control. FLUX effectively breaks this dichotomy by enabling high token retention without compromising on quality, thereby redefining the limits of scalable dataset construction. Empirical results demonstrate that FLUX-curated data outperforms prior pipelines—specifically, a 3B-parameter model trained on FLUX data achieves higher MMLU accuracy than both DCLM and FineWeb, while achieving a significant reduction in compute requirements. The ability to extract more usable tokens from the same data dump, coupled with superior quality metrics, positions FLUX as a transformative advancement in web-scale data preprocessing.

Key Points

  • FLUX breaks the traditional trade-off between data retention and quality control
  • Empirical performance exceeds prior pipelines with higher MMLU accuracy
  • FLUX achieves significant compute efficiency gains without sacrificing data quality

Merits

Innovation

FLUX introduces a novel framework that simultaneously achieves high retention, strong quality control, and computational efficiency—a feat previously unattainable.

Empirical Validation

Results show clear, measurable superiority over existing pipelines in both accuracy metrics and efficiency, validating the effectiveness of FLUX.

Demerits

Scalability Constraints

While FLUX performs exceptionally well with current data dumps, scalability to future, exponentially larger datasets remains an open question.

Generalizability

The effectiveness of FLUX may be contingent upon the structure and quality of source data; applicability to non-web-scale or proprietary datasets is untested.

Expert Commentary

FLUX represents a paradigm shift in the field of large-scale data preprocessing. Historically, the tension between data volume and quality control has constrained progress in model training; FLUX dismantles this constraint by demonstrating that both can coexist. The pipeline’s ability to extract 50B usable tokens from a single CC-MAIN-2025-51 dump—25% more than DCLM—while maintaining a higher accuracy profile than the prior state-of-the-art, underscores its technical sophistication. Moreover, the 34.4% reduction in compute cost for equivalent performance is economically significant, offering tangible benefits to training budgets. From a broader perspective, FLUX challenges the conventional wisdom that quality must be sacrificed for scale, potentially inspiring new methodologies across domains beyond language modeling, including multimodal AI and synthetic data generation. This work is not merely an incremental improvement—it is a foundational shift that may influence the architecture of future preprocessing architectures.

Recommendations

  • Adopt FLUX as a baseline preprocessing pipeline for large-scale language model training projects.
  • Encourage open-source replication studies to validate FLUX’s efficacy across diverse data sources and model architectures.

Sources