Academic

Graph Tokenization for Bridging Graphs and Transformers

arXiv:2603.11099v1 Announce Type: new Abstract: The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge. In this work, we introduce a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs). To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens. Empirical results demonstrate that the proposed tokenizer enables Transformers such as BERT to be directly applied to graph benchmarks without architectural modifications. The proposed approach achieves st

Z
Zeyuan Guo, Enmao Diao, Cheng Yang, Chuan Shi
· · 1 min read · 8 views

arXiv:2603.11099v1 Announce Type: new Abstract: The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge. In this work, we introduce a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs). To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens. Empirical results demonstrate that the proposed tokenizer enables Transformers such as BERT to be directly applied to graph benchmarks without architectural modifications. The proposed approach achieves state-of-the-art results on 14 benchmark datasets and frequently outperforms both graph neural networks and specialized graph transformers. This work bridges the gap between graph-structured data and the ecosystem of sequence models. Our code is available at \href{https://github.com/BUPT-GAMMA/Graph-Tokenization-for-Bridging-Graphs-and-Transformers}{\color{blue}here}.

Executive Summary

This article introduces a graph tokenization framework that enables the application of Transformers to graph-structured data. The framework combines reversible graph serialization with Byte Pair Encoding (BPE) to generate sequential representations of graphs. The approach achieves state-of-the-art results on 14 benchmark datasets, outperforming graph neural networks and specialized graph transformers. The proposed tokenizer bridges the gap between graph-structured data and sequence models, allowing for the direct application of Transformers without architectural modifications.

Key Points

  • Introduction of a graph tokenization framework for bridging graphs and Transformers
  • Combination of reversible graph serialization and Byte Pair Encoding (BPE) for generating sequential representations of graphs
  • State-of-the-art results on 14 benchmark datasets, outperforming graph neural networks and specialized graph transformers

Merits

Effectiveness

The proposed framework achieves state-of-the-art results on multiple benchmark datasets, demonstrating its effectiveness in applying Transformers to graph-structured data.

Efficiency

The approach allows for the direct application of Transformers without architectural modifications, making it an efficient solution for handling graph-structured data.

Demerits

Complexity

The proposed framework may introduce additional complexity due to the combination of reversible graph serialization and BPE, which could impact its scalability and interpretability.

Expert Commentary

The proposed graph tokenization framework represents a significant advancement in bridging the gap between graph-structured data and sequence models. By combining reversible graph serialization and BPE, the approach effectively captures structural information in graphs and enables the application of Transformers without architectural modifications. The state-of-the-art results on multiple benchmark datasets demonstrate the potential of this framework for various graph-based applications. However, further research is needed to address the potential complexity and scalability issues associated with this approach.

Recommendations

  • Further research on optimizing the proposed framework for scalability and interpretability
  • Exploration of the application of the proposed framework to various graph-based domains, such as molecular property prediction and social network analysis

Sources