WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos

Imagine a world where every library in the city holds a unique collection of books, but no one is allowed to take their books out of the building. If you want to write the ultimate encyclopedia, you can't just gather all the books in one room. You have to figure out how to learn from them without ever physically moving them.

This is the problem of Data Silos. In the real world, companies and organizations (like hospitals, banks, or government agencies) have their own databases. They are often fragmented, messy, and don't talk to each other.

Collaborative Learning (CL) is the idea that these organizations can work together to train smart AI models without sharing their raw data. It's like neighbors sharing recipes to cook a better meal without ever swapping their secret ingredient jars.

However, there's a catch: Current tools are too idealistic.

The Problem: The "Perfect World" vs. Reality

Most existing tests for these AI tools assume a "perfect world":

Isolation: They pretend every database is a lonely island with no connections to others.
Alignment: They assume every database has the exact same columns (e.g., everyone has a "Name" column and a "Phone" column) and the exact same rows (e.g., everyone has data on the same 1,000 people).
Joinability: They assume you can easily glue all these tables together into one giant spreadsheet.

The Reality Check:
In the real world, databases are messy.

One hospital might call a column "Patient_ID," while another calls it "MedRec_Num."
One database might have 10,000 rows, while another has 10 million.
Some databases are connected by complex relationships (like a graph), while others are totally unrelated.
Trying to glue them all together often crashes the computer because the data is too huge.

Existing benchmarks are like training a pilot in a simulator where the weather is always perfect and the runway is always straight. When they try to fly in real life (with storms and crooked runways), they crash.

The Solution: WikiDBGraph

The authors of this paper built WikiDBGraph, a massive new "training ground" for these AI systems.

Think of it as a giant, interconnected map of 100,000 different libraries (databases) drawn from the real world (specifically, from Wikidata).

The Nodes (The Libraries): Each of the 100,000 nodes is a real database.
The Edges (The Bridges): They built 17 million "bridges" connecting these libraries. These bridges aren't just random; they are weighted based on how similar the libraries are.
- Analogy: Imagine a library about "Ancient Rome" is connected to a library about "Roman Architecture" by a strong bridge. A library about "Modern Tokyo" might have a weak bridge to the Rome library, or no bridge at all.

They used a clever AI trick called Contrastive Learning to figure out which libraries should be connected. It's like a librarian who reads the titles and sample pages of books in two different libraries and says, "Hey, these two collections are actually about the same topic, even though they are in different buildings!"

What They Found (The "Aha!" Moments)

When they tested existing AI collaboration methods on this messy, real-world map, they found some surprising things:

The "Garbage In, Garbage Out" Effect: Most current AI tools failed miserably. Why? Because they couldn't handle the messy "preparation" phase. Before the AI can learn, someone has to match the columns (e.g., realizing "Patient_ID" = "MedRec_Num"). The automated tools they used were too simple and often matched the wrong things, leading to bad results.
The Gap is Huge: There is a massive performance gap between what these AI tools can do when working together versus what a single "super-computer" could do if it had access to all the data at once. The current methods aren't smart enough to close this gap in messy environments.
Hybrid Connections are Key: The most interesting connections weren't just "horizontal" (same columns, different people) or "vertical" (same people, different columns). They were hybrid.
- Analogy: Imagine a "Monuments" database and a "Historic Places" database. They share some columns (like "Location"), but not all. They share some rows (some monuments are historic places), but not all. This "fuzzy" overlap is where the real magic happens, but current tools struggle to navigate it.

Why This Matters

WikiDBGraph is a reality check for the AI community. It's a giant, open-source playground that forces researchers to stop testing their algorithms in "perfect world" simulators and start testing them in the messy, interconnected, real world.

It shows us that to make collaborative AI work in the real world (like in healthcare or finance), we need to stop focusing just on the learning part and start fixing the data management part. We need better ways to match columns, handle huge data sizes, and navigate complex networks of databases.

In short: The paper built a realistic, messy, and massive map of the world's data to show us that our current AI tools are still too fragile to handle the real world, and it points the way toward building stronger, more adaptable systems.

Here is a detailed technical summary of the paper "WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos."

1. Problem Statement

Relational databases are increasingly fragmented across organizations, creating data silos that hinder large-scale data management and mining. Collaborative Learning (CL)—including Federated Learning (FL), Split Learning (SL), and Transfer Learning—offers a solution by enabling joint model training without sharing raw data. However, existing CL benchmarks and algorithms suffer from three critical, unrealistic assumptions that create a gap between algorithm design and real-world deployment:

Isolation: Benchmarks treat clients as isolated data sources, ignoring inter-client relationships.
Perfect Alignment: Benchmarks assume data is either perfectly horizontally aligned (disjoint samples, same features) or vertically aligned (disjoint features, same samples). Real-world databases often have partial, hybrid overlaps.
Joinability: Benchmarks assume databases can be fully joined into a single table. In reality, many databases are too large to join explicitly, and schema mismatches (e.g., different column names for the same concept) prevent easy alignment.

Current benchmarks (e.g., LEAF, FedML) rely on synthetic partitions of single tables or perfectly aligned real-world datasets, failing to capture the schema matching, fuzzy joining, and graph-structured connectivity inherent in real-world enterprise silos.

2. Methodology: WikiDBGraph Construction

The authors propose WikiDBGraph, a large-scale, open-source benchmark constructed from WikiDBs (a corpus of 100,000 relational databases extracted from Wikidata). The construction involves four key stages:

A. Database Serialization & Embedding

To handle the heterogeneity of 100,000 databases, the authors serialize each database into a concise textual format containing:

Database and table names.
Column names.
Representative sample values (default $\phi=3$ ) from each column.

They employ Contrastive Learning to train an embedding model ( $f_\theta$ ) that maps these textual summaries into a $d$ -dimensional vector space.

Positive Pairs: Databases sharing the same wikidata_topic_item_id (TID).
Negative Pairs: Databases with different TIDs.
Model: Initialized with the pretrained encoder BGE-M3 and fine-tuned using the InfoNCE loss to maximize the similarity of positive pairs and minimize negative pairs.

B. Graph Construction

Using the trained embedding model, the authors construct a graph where:

Nodes: Represent individual databases.
Edges: Represent correlations between databases, weighted by the cosine similarity of their embeddings.
Sparsity Control: A threshold ( $\tau$ ) is applied to filter edges, allowing for different sparsity regimes (low/medium/high) to suit various research needs.

C. Property Annotation

To capture the complexity of real-world data, the graph is enriched with 13 node properties and 12 edge properties, categorized as:

Structural: Table counts, column counts, foreign key density, schema graph edit distance.
Semantic: Topic clusters, embedding vectors, Jaccard similarity of schema elements.
Statistical: Data volume, cardinality, sparsity, entropy, and distribution divergence (KL divergence).

D. Automated Evaluation Pipeline

To evaluate CL algorithms on this graph, the authors designed an automated pipeline:

Sampling: Select top-similarity database pairs.
Label Selection: Identify a target classification column.
Table Joining: Left-join tables within a database based on foreign keys (capped at 1M rows to manage memory).
Column Alignment: Match columns between databases using case-insensitive string matching (a baseline for schema matching).
Training & Evaluation: Train CL algorithms on the aligned data and evaluate on a combined test set.

3. Key Contributions

WikiDBGraph Dataset: A large-scale graph of 100,000 interconnected relational databases with 17 million weighted edges. It is the first benchmark to explicitly model inter-database relationships and hybrid overlaps (horizontal, vertical, and mixed).
Automated CL Pipeline: A "zero-human-in-the-loop" pipeline that transforms raw, unaligned databases into a format suitable for training standard CL algorithms, enabling large-scale evaluation.
Comprehensive Characterization: The dataset reveals three distinct characteristics of real-world CL data:
- Interconnectedness: Databases form a non-uniform, long-tailed graph structure with "hub" databases.
- Hybrid Alignment: Most pairs exhibit partial overlap in both features and instances, defying pure horizontal/vertical assumptions.
- Unjoinability: The estimated size of fully joined tables is computationally prohibitive ($9 \times 10^{16}$ rows), necessitating algorithms that work directly on relational structures.
Case Studies: The paper provides detailed analyses of feature-overlap, instance-overlap, and hybrid-overlap scenarios, demonstrating the utility of graph-aware methods.

4. Experimental Results

The authors evaluated mainstream CL algorithms (FedAvg, FedProx, FedOV, SplitNN, SFL) on WikiDBGraph.

Performance Gap: While CL algorithms generally outperform "Solo" (single-database) training, they significantly underperform compared to "Combined" (centralized) training.
Preprocessing Bottleneck: The primary limitation identified is data preprocessing. The automated pipeline's simple string-based column alignment often fails to capture semantic matches, leading to "garbage in, garbage out" scenarios where misaligned data degrades performance.
- Evidence: Replacing string matching with DeepJoin (semantic-aware alignment) improved F1 scores by 0.06–0.09 for several algorithms.
Non-IID Heterogeneity: The dataset exhibits a "long-tail" distribution of non-IID data. Approximately 24.7% of pairs fall into a high-heterogeneity regime ( $\alpha < 1.0$ in Dirichlet distribution), posing significant challenges for standard FL algorithms.
Graph-Aware Benefits: In a case study involving 5 country-profile databases, the SFL (Split Federated Learning) method, when augmented with node and edge properties, outperformed standard FedAvg, proving that graph structure and metadata are crucial for personalized CL.

5. Significance and Future Directions

Bridging the Gap: WikiDBGraph exposes the critical disconnect between theoretical CL algorithms (which assume clean, aligned data) and the messy reality of database silos (unaligned, unjoinable, and heterogeneous).
New Research Directions: The benchmark highlights the need for:
- Semantic Schema Matching: Algorithms that can align columns based on meaning rather than just string similarity.
- Graph-Aware CL: Methods that leverage the graph structure of database relationships for better message passing and personalization.
- Relational CL: Algorithms that operate directly on relational structures without requiring explicit, full table joins.
Availability: The dataset (100k DBs), code, and construction pipeline are publicly available, providing a standardized testbed for future research in collaborative data management.

In conclusion, WikiDBGraph serves as a vital reality check for the collaborative learning community, shifting the focus from purely optimizing model training to solving the fundamental data management challenges of schema matching, alignment, and graph-based integration in real-world silos.