Automatic End-to-End Data Integration using Large Language Models

This paper introduces an automatic end-to-end data integration pipeline powered by GPT-5.2 that generates all necessary configuration artifacts, demonstrating comparable or superior performance to human-designed pipelines across multiple case studies at a significantly lower cost.

Aaron Steiner, Christian Bizer

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you are a chef trying to create a massive, perfect recipe book. But instead of having one kitchen, you have three different kitchens (Kitchens A, B, and C) sending you ingredients.

  • Kitchen A writes "Tomato" on the label.
  • Kitchen B writes "Red Fruit" on the label.
  • Kitchen C writes "Solanum lycopersicum" on the label.

Furthermore, Kitchen A lists prices in "Dollars," Kitchen B in "Euros," and Kitchen C in "Yen." Kitchen A lists the chef's name as "John," while Kitchen B lists him as "J. Smith."

The Problem:
Traditionally, to make one unified recipe book, you need a team of human data engineers (let's call them "Super Chefs"). These Super Chefs have to:

  1. Read every label and figure out that "Tomato," "Red Fruit," and "Solanum" are the same thing.
  2. Convert all the prices to Dollars.
  3. Realize "John" and "J. Smith" are the same person.
  4. Decide which price is the "correct" one if they conflict.

This takes weeks of manual work, costs a fortune, and is prone to human error.

The Solution (The Paper's Idea):
The researchers asked: What if we hire a super-smart AI (a Large Language Model, or LLM) to do all this thinking for us?

They built a system where an AI (specifically a futuristic version called GPT-5.2) acts as the "Head Chef" who designs the entire process from scratch. The AI doesn't just cook; it writes the instructions, creates the training manuals, and sets up the rules for the kitchen.

How the AI Chef Works (The 4 Steps)

1. The Translator (Schema Matching)

  • Human Way: A human reads the labels and manually draws lines connecting "Tomato" to "Red Fruit."
  • AI Way: The AI looks at the list of ingredients and the target recipe. It instantly realizes, "Oh, 'Red Fruit' is just a fancy way of saying 'Tomato'." It creates a perfect translation guide in seconds.
  • Result: The AI was 100% accurate, even when the labels were nonsense (like "Attribute 1"), because it looked at the content to guess the meaning.

2. The Standardizer (Value Normalization)

  • Human Way: A human creates a list: "If you see 'PS4', write 'PlayStation 4'. If you see 'SNES', write 'Super Nintendo'."
  • AI Way: The AI knows the whole world of video games. It automatically converts every weird abbreviation or slang term into the standard name. It doesn't just do the easy ones; it handles thousands of variations automatically.
  • Result: The AI did a much more thorough job than the humans, who only fixed the obvious ones.

3. The Matchmaker (Entity Matching)

  • Human Way: A human has to read thousands of pairs of records and say, "Yes, these are the same company," or "No, these are different." They then teach a computer program based on these examples.
  • AI Way: The AI acts as a tireless intern. It looks at the data, picks the most interesting examples, and labels them for the computer to learn from. It's like a master teacher showing a student the best examples to learn from, rather than just random ones.
  • Result: The computer trained by the AI performed just as well (or better) than the one trained by humans, but it cost pennies to generate the examples.

4. The Judge (Data Fusion)

  • Human Way: If two sources say different things (e.g., one says the movie came out in 2020, another says 2021), a human has to research the internet to find the truth and decide which rule to use.
  • AI Way: The AI simulates a research team. It picks famous examples (like "The Beatles" or "Apple Inc.") and asks itself, "What is the truth?" It then uses that answer to set the rules for the whole dataset.
  • Result: For static things (like music genres), the AI was perfect. For time-sensitive things (like a company's current revenue), the AI sometimes got confused because its internal knowledge was slightly outdated, but it was still very close.

The Big Reveal: Cost and Speed

The researchers ran this experiment on three real-world scenarios: Video Games, Companies, and Music.

  • The Human Team: Took about 19 hours of work per project. It cost a lot of money in salaries.
  • The AI Team: Took about 2 hours of computer time. It cost roughly $9 in API fees.

The Verdict:
The final "recipe books" (the integrated datasets) created by the AI were almost identical in quality to the ones made by humans.

  • They had the same number of recipes.
  • They had the same level of detail.
  • They were just as accurate.

The Catch (Limitations)

The AI isn't magic yet.

  1. It needs flat data: It struggles if the data is split across many complex, connected tables (like a messy filing cabinet vs. a neat spreadsheet).
  2. It might be "memorizing": Since the test data (like famous video games and companies) is likely in the AI's training data, it might have just "remembered" the answers rather than truly "reasoning" them out. We don't know how well it would work on secret, brand-new company data.
  3. Time Travel issues: The AI sometimes gets confused about when something happened because its knowledge cutoff is in the past, while the data might be from the future (or vice versa).

The Bottom Line

This paper proves that we can replace the expensive, slow, manual work of data engineers with an AI that acts as an "Auto-Pilot" for data integration. It's like upgrading from a team of scribes manually copying books to a high-speed photocopier that also edits the text.

For standard data tasks, the AI is now ready to take the wheel, saving companies time and money while producing results that are just as good as the human experts.