Academic

Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights

Xingtong Yu, Shenghua Ye, Ruijuan Liang, Chang Zhou, Hong Cheng, Xinming Zhang, Yuan Fang · March 12, 2026 · 1 min read · 7 views

#cs.CL #cs.AI

arXiv:2603.10033v1 Announce Type: new Abstract: Graph foundation models (GFM) aim to acquire transferable knowledge by pre-training on diverse graphs, which can be adapted to various downstream tasks. However, domain shift in graphs is inherently two-dimensional: graphs differ not only in what they describe (topic domains) but also in how they are represented (format domains). Most existing GFM benchmarks vary only topic domains, thereby obscuring how knowledge transfers across both dimensions. We present a new benchmark that jointly evaluates topic and format gaps across the full GFM pipeline, including multi-domain self-supervised pre-training and few-shot downstream adaptation, and provides a timely evaluation of recent GFMs in the rapidly evolving landscape. Our protocol enables controlled assessment in four settings: (i) pre-training on diverse topics and formats, while adapting to unseen downstream datasets; (ii) same pre-training as in (i), while adapting to seen datasets; (iii) pre-training on a single topic domain, while adapting to other topics; (iv) pre-training on a base format, while adapting to other formats. This two-axis evaluation disentangles semantic generalization from robustness to representational shifts. We conduct extensive evaluations of eight state-of-the-art GFMs on 33 datasets spanning seven topic domains and six format domains, surfacing new empirical observations and practical insights for future research. Codes/data are available at https://github.com/smufang/GFMBenchmark.

Executive Summary

This article presents a comprehensive benchmark for evaluating graph foundation models (GFM), a type of pre-trained model designed to acquire transferable knowledge across diverse graphs. The benchmark assesses the models' ability to adapt to unseen downstream tasks, accounting for both topic and format domains. The authors conduct extensive evaluations of eight state-of-the-art GFMs on 33 datasets, revealing new insights and practical implications for future research. The study contributes to the rapidly evolving field of GFM by providing a timely evaluation of recent models and offering a more nuanced understanding of their strengths and limitations.

Key Points

▸ The article introduces a new benchmark for evaluating GFM, addressing the two-dimensional nature of domain shift in graphs.
▸ The benchmark enables controlled assessment across four settings, disentangling semantic generalization from robustness to representational shifts.
▸ The authors conduct extensive evaluations of eight state-of-the-art GFMs on 33 datasets, spanning seven topic domains and six format domains.

Merits

Comprehensive Benchmark

The study presents a thorough and systematic evaluation framework for GFM, addressing a significant gap in the existing literature.

Multi-Dimensional Assessment

The authors' two-axis evaluation approach provides a more nuanced understanding of GFM's strengths and limitations, considering both topic and format domains.

Demerits

Limited Generalizability

The study's findings may not be directly generalizable to other areas of natural language processing or machine learning, due to the specific focus on graph foundation models.

Expert Commentary

The article presents a timely and comprehensive evaluation of graph foundation models, addressing a significant gap in the existing literature. The authors' two-axis evaluation approach provides a more nuanced understanding of GFM's strengths and limitations, considering both topic and format domains. However, the study's findings may not be directly generalizable to other areas of natural language processing or machine learning. Nevertheless, the insights and benchmark presented in this article have significant implications for the development and evaluation of GFM, with practical and policy-related implications for AI model evaluation and deployment.

Recommendations

✓ Future research should explore the application of the proposed benchmark to other areas of natural language processing and machine learning.
✓ The authors' two-axis evaluation approach can be adapted to other types of neural networks, providing a more nuanced understanding of their strengths and limitations.

Sources

arXiv - cs.CL

Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Benchmark

Multi-Dimensional Assessment

Demerits

Limited Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs