A Catalog of Data Errors

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are the head chef of a massive, bustling restaurant. Your kitchen is your database, and the ingredients you use are your data. You have recipes (algorithms) and customers (business decisions) that rely on your dishes being perfect.

But sometimes, things go wrong in the kitchen. Maybe you're out of salt, maybe you accidentally put sugar in the soup, or maybe you have three jars of the exact same tomato sauce sitting on the shelf when you only needed one.

This paper, "A Catalog of Data Errors," is essentially a massive, organized cookbook of kitchen disasters. The authors (a team of data experts) realized that while everyone knows food can go bad, nobody had a single, complete list of every way it could go wrong, especially with the rise of AI (which is like a robot chef that gets very confused if the ingredients are slightly off).

Here is the breakdown of their "Catalog of Disasters" in simple terms:

1. The Three Big Categories of Mess

The authors sorted all possible data mistakes into three buckets, just like sorting kitchen problems:

🕳️ The "Missing" Bucket (Missing Data):
- The Analogy: You open a recipe card, and the line for "eggs" is blank. Or, you have a list of 100 customers, but 10 of them are just empty ghosts.
- The Real Problem: Sometimes the data is just gone (a missing value). Sometimes it's there but hidden behind a fake name like "Unknown" or "-99" (a Disguised Missing Value). It's like a customer saying "I'm fine" when they are actually starving.
- The Bias Trap: If your restaurant only serves lunch to people in one neighborhood, your data is Biased. You think you serve everyone, but you're actually missing half the city.
🍳 The "Wrong" Bucket (Incorrect Data):
- The Analogy: You put ketchup on a steak when the recipe called for mustard. Or, you wrote "100" for the price, but you meant "10.00".
- The Real Problem: This is the biggest bucket. It includes:
  - Typos & Misspellings: Writing "Jhon" instead of "John."
  - Word Swaps: Writing "Bond, James" instead of "James Bond."
  - Out-of-Date Info: Listing a customer's address from 2010 when they moved in 2023.
  - The "Outlier": One customer who ordered 5,000 burgers when everyone else ordered one. Is it a mistake, or a real event? It's hard to tell.
  - Rule Breakers: A customer who is listed as "Manager" but has no manager above them, breaking the company's hierarchy rules.
📦 The "Too Much" Bucket (Redundant Data):
- The Analogy: You have three jars of the same tomato sauce labeled slightly differently ("Tomato Sauce," "Sauce, Tomato," "Tomato Sauce (Fresh)").
- The Real Problem: This is Duplicate Data. The computer sees them as three different things, but they are actually the same customer or product. It wastes space and confuses the robot chef. It also includes Irrelevant Data, like having a jar of pickles in a bakery database when you only sell bread.

2. Why Do We Need This Catalog?

Before this paper, data experts were like mechanics who knew a car was broken but couldn't agree on what was broken. One guy called it a "leak," another called it a "pressure issue," and a third called it a "fluid loss." They were talking about the same problem but using different names.

This paper says: "Let's stop arguing about names and start fixing the car."

It gives everyone a common language: Now, if a data scientist says, "We have a Disguised Missing Value," everyone knows exactly what that looks like.
It helps the AI: Modern AI is like a brilliant but literal robot. If you feed it "Unknown" for a salary, it might think the salary is actually zero. This catalog teaches us how to spot these tricks so the AI doesn't get confused.
It saves money: The paper mentions that bad data costs the US billions of dollars a year. That's like throwing away millions of dollars of food every year because the kitchen wasn't organized.

3. The "Error Indicators" (The Smoke Alarms)

The authors also point out that sometimes you don't see the fire (the error) directly, but you see the smoke.

Example: If you see a salary of $205,000 when everyone else makes $50,000, that's an Outlier. It's a smoke alarm. It doesn't mean the data is definitely wrong (maybe that person is a genius), but it tells you, "Hey, check this out!"

4. The Bottom Line

Think of this paper as the ultimate "User Manual for Dirty Data."

In the past, if you found a mistake in your data, you had to guess how to fix it. Now, thanks to this catalog, you can look up the specific type of mess you have (e.g., "Oh, this is a Cyclic Dependency where Employee A manages Employee B, who manages Employee A"), and you know exactly which tool to use to clean it up.

It turns the chaotic, messy world of real-world data into a structured, manageable list of problems, making it easier for humans and machines to work together to build a cleaner, smarter future.

1. Problem Statement

Data errors are pervasive in real-world relational databases (RDBs) and severely degrade the performance of downstream applications, including machine learning pipelines and business analytics. While data quality (DQ) is widely recognized as critical, existing taxonomies of data errors suffer from several limitations:

Fragmentation and Incompleteness: Existing classifications are often informal, cover only subsets of error types, or lack a unified framework.
Terminological Inconsistency: Different works use conflicting terms for the same errors (e.g., "contradiction" used for both FD violations and duplicates) or the same term for different errors.
Lack of Formal Definitions: Many error types lack rigorous mathematical definitions, making automated detection and correction difficult.
Emerging Gaps: Traditional taxonomies often overlook modern error types relevant to AI, such as statistical biases, disguised missing values, and out-of-vocabulary (OOV) words.
Confusion between Errors and Indicators: Existing literature often conflates actual data errors with "error indicators" (statistical or logical patterns that suggest an error but require human judgment).

The paper addresses the need for a comprehensive, formal, and unified catalog that distinguishes between actual data errors and error indicators, covering the full spectrum of issues found in tabular data.

2. Methodology

The authors developed the catalog through a systematic review and consolidation process:

Source Material: The foundation was built upon five existing major taxonomies [44, 61, 66, 90, 102].
Screening and Consolidation: The authors screened these works, resolved terminological inconsistencies (e.g., renaming "Violation of Company and Government Regulations" to "Legal Rule Violations"), and merged overlapping concepts.
Extension: They extended the existing lists by:
1. Identifying subtypes and variants of known errors.
2. Adding newly emerged error types relevant to modern data ecosystems (e.g., OOV words, biased data).
Classification Framework: The authors established a formal notation for RDBs (relations, tuples, attributes, and real-world mappings $M(e)$ $M (e)$ ) to define errors rigorously. They classified the final list of 35 items into three mutually exclusive categories based on error manifestation:
1. Missing: Data that should be present but is absent.
2. Incorrect: Data that is present but does not accurately represent the real-world entity.
3. Redundant: Data that is duplicated or unnecessary.
Granularity and Context: Each error is mapped to specific granularity levels (Value, Tuple, Attribute, Relation, DB, Multi-DB) and contexts (Syntactic vs. Semantic).

3. Key Contributions

The paper makes three primary contributions to the field of Data Quality:

A Comprehensive Catalog of 35 Error Types:
The authors present a unified list of 35 distinct data error types and error indicators. Unlike previous works, this catalog explicitly distinguishes between Data Errors (mismatches between database and reality) and Error Indicators (patterns like outliers or bias that hint at errors).
- Missing Data: Includes Explicit Missing Values, Disguised Missing Values (DMVs), Partially Empty Tuples/Attributes, Missing Tuples, Empty Attributes, and Biased Data.
- Incorrect Data: Includes Invalid Values/Tuples, Textual Errors (OOV words, Misspellings, Typos, Misscans, Incorrect Encoding), Synonyms, Word Transpositions, Misfielded Values, Noise, Semantically Ambiguous Data, Outliers, Syntax Violations, Heterogeneous Formatting, Incorrect Units, Incorrect References, and various Rule Violations (Constraint, Domain, Uniqueness, Dependency, FD, CFD, Cyclic, Business, DBA, Legal). It also covers Outdated Data.
- Redundant Data: Includes Duplicate Tuples and Irrelevant Data.
Formal Definitions and Examples:
For every error type, the paper provides:
- A formal mathematical definition using set theory and mapping functions (e.g., defining a DMV as a value $v$ where $v \in dom(a)$ but $M(v) = \perp$ ).
- Practical examples derived from a running "Employment" database scenario, illustrating how the error manifests in real-world data.
Terminological Resolution and Scope Clarification:
The paper resolves conflicts in existing literature (e.g., clarifying that "contradiction" is not a standalone error type but a symptom of FD violations or duplicates). It also clearly delineates the scope, focusing on RDB base data while briefly discussing metadata errors and related data characteristics (like "inaccessibility" or "doubtful credibility") as distinct from actual data errors.

4. Results

Unified Taxonomy: The resulting catalog consolidates 35 error types into a non-overlapping hierarchy.
Granularity Mapping: The authors successfully mapped each error type to the level at which it is detected (e.g., Functional Dependency violations are detected at the Relation level, while Typos are detected at the Value level).
Contextual Differentiation: The framework successfully separates syntactic errors (formatting, encoding) from semantic errors (meaning, logic, bias), providing a clearer path for detection strategies.
Gap Identification: The catalog highlights under-researched areas, such as "Disguised Missing Values" and "Biased Data," which are critical for AI/ML but often ignored in traditional database cleaning.

5. Significance

This paper serves as a foundational reference for both researchers and practitioners in Data Quality and Data Cleaning:

For Practitioners (Data Engineers, Scientists): It provides a checklist for implementing validation rules. By knowing the specific error type (e.g., distinguishing a "Typo" from a "Synonym"), practitioners can select the appropriate cleaning tool or algorithm (e.g., spell-checkers vs. entity resolution).
For Researchers: It identifies gaps in current tooling. For instance, while many tools detect missing values, few effectively handle "Disguised Missing Values" or "Biased Data." The catalog directs future research toward these under-explored areas.
Standardization: By providing formal definitions, the paper enables the development of standardized benchmarks and evaluation metrics for data cleaning tools.
AI Readiness: The inclusion of statistical error indicators (bias, outliers) bridges the gap between traditional database integrity and modern machine learning data requirements, ensuring data is fit for AI training.

In conclusion, "A Catalog of Data Errors" moves the field from ad-hoc, informal error handling to a systematic, formalized approach, enabling more robust data cleaning strategies and higher-quality data for decision-making and AI applications.