Academic

Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers

arXiv:2603.19544v1 Announce Type: new Abstract: Artificial Intelligence for scientific applications increasingly requires training large models on data that cannot be centralized due to privacy constraints, data sovereignty, or the sheer volume of data generated. Federated learning (FL) addresses this by enabling collaborative training without centralizing raw data, but scientific applications demand model scales that requires extensive computing resources, typically offered at High Performance Computing (HPC) facilities. Deploying FL experiments across HPC facilities introduces challenges beyond cloud or enterprise settings. We present a comprehensive cross-facility FL framework for heterogeneous HPC environments, built on Advanced Privacy-Preserving Federated Learning (APPFL) framework with Globus Compute and Transfer orchestration, and evaluate it across four U.S. Department of Energy (DOE) leadership-class supercomputers. We demonstrate that FL experiments across HPC facilities ar

arXiv:2603.19544v1 Announce Type: new Abstract: Artificial Intelligence for scientific applications increasingly requires training large models on data that cannot be centralized due to privacy constraints, data sovereignty, or the sheer volume of data generated. Federated learning (FL) addresses this by enabling collaborative training without centralizing raw data, but scientific applications demand model scales that requires extensive computing resources, typically offered at High Performance Computing (HPC) facilities. Deploying FL experiments across HPC facilities introduces challenges beyond cloud or enterprise settings. We present a comprehensive cross-facility FL framework for heterogeneous HPC environments, built on Advanced Privacy-Preserving Federated Learning (APPFL) framework with Globus Compute and Transfer orchestration, and evaluate it across four U.S. Department of Energy (DOE) leadership-class supercomputers. We demonstrate that FL experiments across HPC facilities are practically achievable, characterize key sources of heterogeneity impacting the training performance, and show that algorithmic choices matter significantly under realistic HPC scheduling conditions. We validate the scientific applicability by fine-tuning a large language model on a chemistry instruction dataset, and identify scheduler-aware algorithm design as a critical open challenge for future deployments.

Executive Summary

This article presents a comprehensive cross-facility federated learning framework for heterogeneous High Performance Computing (HPC) environments. Building on the Advanced Privacy-Preserving Federated Learning (APPFL) framework, the authors employ Globus Compute and Transfer orchestration to facilitate collaborative training on multiple supercomputers. The framework is evaluated across four U.S. Department of Energy (DOE) leadership-class supercomputers, demonstrating the practical feasibility of federated learning experiments in HPC settings. The study highlights the importance of algorithmic choices under realistic HPC scheduling conditions and identifies scheduler-aware algorithm design as a critical open challenge for future deployments. The authors validate the scientific applicability of their framework by fine-tuning a large language model on a chemistry instruction dataset.

Key Points

  • Development of a comprehensive cross-facility federated learning framework for HPC environments
  • Employment of Globus Compute and Transfer orchestration for collaborative training
  • Evaluation of the framework across four U.S. DOE leadership-class supercomputers

Merits

Original Contribution

The article presents a novel framework for cross-facility federated learning in HPC environments, addressing a critical challenge in scientific applications.

Methodological Rigor

The authors employ a robust methodology, including the evaluation of their framework across multiple HPC platforms and the identification of key sources of heterogeneity.

Demerits

Limited Generalizability

The study focuses on a specific domain (HPC environments) and may not be directly applicable to other settings.

Lack of Comparative Analysis

The article does not provide a direct comparison with other federated learning frameworks, which may limit its generalizability and impact.

Expert Commentary

The article presents a significant contribution to the development of cross-facility federated learning frameworks in HPC environments. The authors' use of Globus Compute and Transfer orchestration and their evaluation of the framework across multiple HPC platforms demonstrate a high level of methodological rigor. However, the study's limited generalizability and lack of comparative analysis may limit its impact. The article's findings have significant practical implications for the deployment of AI models in HPC environments and highlight the need for careful consideration of algorithmic choices and scheduler-aware design in the development of FL frameworks.

Recommendations

  • Future studies should focus on developing and evaluating FL frameworks for multiple domains, including non-HPC environments.
  • Researchers should prioritize the development of scheduler-aware FL algorithms to address the challenges of realistic HPC scheduling conditions.

Sources

Original: arXiv - cs.LG