Academic

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

arXiv:2603.03805v1 Announce Type: new Abstract: Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce $\textbf{RDB-PFN}$, the first relational foundation model trained purely via $\textbf{synthetic data}$. Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a $\textbf{Relational Prior Generator}$ to create an infinite stream of diverse RDBs from scratch. Pre-training on $\textbf{over 2 million}$ synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine $\textbf{in-context learning}$. Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world r

Y
Yanbo Wang, Jiaxuan You, Chuan Shi, Muhan Zhang
· · 1 min read · 11 views

arXiv:2603.03805v1 Announce Type: new Abstract: Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce $\textbf{RDB-PFN}$, the first relational foundation model trained purely via $\textbf{synthetic data}$. Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a $\textbf{Relational Prior Generator}$ to create an infinite stream of diverse RDBs from scratch. Pre-training on $\textbf{over 2 million}$ synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine $\textbf{in-context learning}$. Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines (given the same DFS-linearized inputs), while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN

Executive Summary

This article introduces RDB-PFN, a relational foundation model trained on synthetic data, addressing the scarcity of high-quality relational databases. RDB-PFN achieves strong few-shot performance on real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines. The model's ability to adapt to new databases via in-context learning and its lightweight architecture make it a significant contribution to the field.

Key Points

  • RDB-PFN is the first relational foundation model trained on synthetic data
  • The model uses a Relational Prior Generator to create diverse relational databases from scratch
  • RDB-PFN achieves strong few-shot performance on real-world relational prediction tasks

Merits

Innovative Approach

The use of synthetic data and a Relational Prior Generator allows for the creation of a large and diverse dataset, addressing the scarcity of high-quality relational databases.

Strong Performance

RDB-PFN outperforms graph-based and single-table foundation-model baselines, demonstrating its effectiveness in relational prediction tasks.

Demerits

Limited Real-World Data

The model is trained on synthetic data, which may not fully capture the complexities of real-world relational databases.

Expert Commentary

The introduction of RDB-PFN marks a significant advancement in the field of relational databases, as it addresses the long-standing issue of data scarcity. The model's ability to learn from synthetic data and adapt to new databases via in-context learning makes it a valuable tool for real-world applications. However, further research is needed to fully understand the limitations and potential biases of RDB-PFN, particularly in regards to its performance on complex and heterogeneous databases.

Recommendations

  • Further research should be conducted to evaluate the performance of RDB-PFN on a wider range of real-world relational databases.
  • The development of RDB-PFN should be accompanied by the creation of policies and regulations that promote the responsible use of synthetic data and relational databases.

Sources