Academic

Federated Hierarchical Clustering with Automatic Selection of Optimal Cluster Numbers

arXiv:2603.12684v1 Announce Type: new Abstract: Federated Clustering (FC) is an emerging and promising solution in exploring data distribution patterns from distributed and privacy-protected data in an unsupervised manner. Existing FC methods implicitly rely on the assumption that clients are with a known number of uniformly sized clusters. However, the true number of clusters is typically unknown, and cluster sizes are naturally imbalanced in real scenarios. Furthermore, the privacy-preserving transmission constraints in federated learning inevitably reduce usable information, making the development of robust and accurate FC extremely challenging. Accordingly, we propose a novel FC framework named Fed-$k^*$-HC, which can automatically determine an optimal number of clusters $k^*$ based on the data distribution explored through hierarchical clustering. To obtain the global data distribution for $k^*$ determination, we let each client generate micro-subclusters. Their prototypes are th

Y
Yue Zhang, Chuanlong Qiu, Xinfa Liao, Yiqun Zhang
· · 1 min read · 8 views

arXiv:2603.12684v1 Announce Type: new Abstract: Federated Clustering (FC) is an emerging and promising solution in exploring data distribution patterns from distributed and privacy-protected data in an unsupervised manner. Existing FC methods implicitly rely on the assumption that clients are with a known number of uniformly sized clusters. However, the true number of clusters is typically unknown, and cluster sizes are naturally imbalanced in real scenarios. Furthermore, the privacy-preserving transmission constraints in federated learning inevitably reduce usable information, making the development of robust and accurate FC extremely challenging. Accordingly, we propose a novel FC framework named Fed-$k^$-HC, which can automatically determine an optimal number of clusters $k^$ based on the data distribution explored through hierarchical clustering. To obtain the global data distribution for $k^$ determination, we let each client generate micro-subclusters. Their prototypes are then uploaded to the server for hierarchical merging. The density-based merging design allows exploring clusters of varying sizes and shapes, and the progressive merging process can self-terminate according to the neighboring relationships among the prototypes to determine $k^$. Extensive experiments on diverse datasets demonstrate the FC capability of the proposed Fed-$k^*$-HC in accurately exploring a proper number of clusters.

Executive Summary

This article proposes a novel federated clustering framework, Fed-$k^*$-HC, which can automatically determine the optimal number of clusters based on data distribution explored through hierarchical clustering. The framework addresses the challenges of federated clustering, including unknown cluster numbers and imbalanced cluster sizes, by using a density-based merging design and progressive merging process. The proposed framework is evaluated on diverse datasets, demonstrating its capability in accurately exploring the proper number of clusters.

Key Points

  • Federated clustering with automatic selection of optimal cluster numbers
  • Hierarchical clustering for exploring data distribution
  • Density-based merging design for handling varying cluster sizes and shapes

Merits

Robustness to varying cluster sizes and shapes

The proposed framework can handle clusters of varying sizes and shapes, making it more robust than existing methods.

Demerits

Increased computational complexity

The hierarchical clustering and merging process may increase the computational complexity of the framework, potentially making it less efficient than other methods.

Expert Commentary

The proposed Fed-$k^*$-HC framework represents a significant advancement in federated clustering, addressing the long-standing challenge of determining the optimal number of clusters in a privacy-preserving manner. The use of hierarchical clustering and density-based merging design enables the framework to handle clusters of varying sizes and shapes, making it more robust than existing methods. However, the increased computational complexity of the framework may be a concern in certain applications. Overall, the framework has significant implications for various fields, including customer segmentation, gene expression analysis, and data protection policies.

Recommendations

  • Further evaluation of the framework on larger and more diverse datasets to demonstrate its scalability and robustness
  • Investigation of the framework's applicability to other areas, such as natural language processing and computer vision

Sources