Comparison of Outlier Detection Algorithms on String Data

Imagine you are the manager of a massive library. Your job is to keep the shelves organized. Most of the books follow a strict rule: they all have titles in English, they all have exactly 10 letters, and they all start with a capital "A."

Suddenly, someone starts slipping random items onto the shelves: a banana, a rock, a book titled "The Great Gatsby" (which is in English but has 15 letters), and a book written in ancient Greek.

Your goal is Outlier Detection: finding the items that don't belong. This thesis by Philip Maus tackles a specific version of this problem: How do we find the "weird" items when everything is just a string of text (like words, dates, or codes), rather than numbers?

Most computer programs are great at spotting weird numbers (like a bank account balance of $1,000,000 when everyone else has $50). But they struggle with text. This paper tests two different "detectives" to see which one is better at finding the weird text.

Here is how the two detectives work, explained simply:

Detective #1: The "Crowd Watcher" (Local Outlier Factor)

The Analogy: Imagine a crowded party.

How it works: This detective looks at every guest and asks, "Who are your 5 closest friends?" (This is the k-nearest neighbor part).
The Logic: If you are standing in a dense crowd of people who all look and act similar, you are "normal." But if you are standing alone in a corner, or if your closest friends are actually very far away from you, you are an "outlier."
The Twist (The Weighted Levenshtein): The paper introduces a smart upgrade. Standard distance measures treat every difference as equal. Changing an "A" to a "B" costs the same as changing an "A" to a "1".
- The Upgrade: This detective now has a hierarchy. It knows that changing a "1" to a "2" (both numbers) is a small, innocent mistake. But changing a "1" to a "Z" (number to letter) is a huge, suspicious change. It weighs the differences based on how "related" the characters are.
Best at: Finding items that are structurally similar but slightly "off." For example, spotting a phone number that is the right length but has the wrong mix of numbers and letters.

Detective #2: The "Pattern Hunter" (Hierarchical Left Regular Expression)

The Analogy: Imagine a bouncer at a club with a very specific dress code.

How it works: This detective looks at all the "normal" people in the room and tries to write a rule (a Regular Expression) that describes exactly what they are wearing.
- Example Rule: "Everyone must wear a blue shirt and black pants."
The Logic: Once the rule is written, anyone who doesn't fit the rule is kicked out (labeled an outlier).
The Twist (The Smart Selection): The problem is, if you have a few weird people in the crowd, the bouncer might write a rule that fits everyone (e.g., "Wear clothes"), which is useless. So, this detective uses a special algorithm to find the tightest, most specific rule that covers the majority of the crowd but excludes the weirdos. It also has a "strictness knob" (parameter $p_{min}$ ) to decide how many people the rule must cover before it's accepted.
Best at: Finding items that break a clear, rigid structure. For example, if the normal data is always "Zip Codes" (5 digits), and the outlier is a "County Name" (letters), this detective instantly spots it because the County Name doesn't fit the "5 digits" rule.

The Showdown: What Happened?

The author tested these two detectives on real-world data from German hospitals (like zip codes, dates, and phone numbers).

1. The "Zip Code vs. County Name" Test:

Scenario: The crowd is mostly 5-digit numbers (Zip Codes). The intruders are names like "Frankfurt" or "Berlin."
Winner: Detective #2 (The Pattern Hunter).
Why: It quickly realized, "Ah! The rule is 'Exactly 5 digits'." Any name with letters immediately fails the rule. It was perfect.
Detective #1 struggled a bit because some county names happen to be 5 letters long, making them look like they belong in the crowd.

2. The "Zip Code vs. Phone Number" Test:

Scenario: The crowd is 5-digit numbers. The intruders are phone numbers (which are also just numbers, but longer).
Winner: Detective #1 (The Crowd Watcher).
Why: The Pattern Hunter got confused. It tried to write a rule for "numbers," and since both groups are numbers, the rule became too loose.
Detective #1 looked at the crowd and said, "Hey, the Zip Codes are all packed tightly together in a small space. These phone numbers are standing way over there in a different spot." It spotted the distance difference perfectly.

The Big Takeaway

There is no single "best" detective. It depends on the crime scene:

Use the Pattern Hunter (Regex) when the "normal" data has a very strict, rigid structure (like dates, zip codes, or ID numbers) and the outliers break that structure completely.
Use the Crowd Watcher (LOF) when the data is messy, or when the outliers are the same type of thing as the normal data but just have different lengths or slight variations (like house numbers vs. zip codes).

In summary: This paper teaches us that to clean up messy text data, we need to know our data's personality. If the data is rigid, use a rule-maker. If the data is fluid and clustered, use a crowd-watcher. By combining these tools, we can automate the cleaning of system logs, databases, and user inputs much more effectively.

Here is a detailed technical summary of the bachelor's thesis "Comparison of Outlier Detection Algorithms on String Data" by Philip Maus.

1. Problem Statement

Outlier detection is a well-established field in machine learning, but the vast majority of research and existing algorithms focus on numerical data. There is a significant gap in research regarding string data (categorical or textual data), despite its high relevance for applications such as automated data cleaning, system log analysis, and protein sequence analysis.

The thesis addresses the challenge of detecting syntactical outliers in single-word string datasets. Specifically, it aims to:

Identify data points that deviate from the expected structural patterns of a dataset without requiring semantic context (e.g., knowing that "glue" is not a color).
Compare two distinct algorithmic approaches adapted for string data: a density-based approach (Local Outlier Factor) and a language-model-based approach (Regular Expression Learning).

2. Methodology

The author proposes and evaluates two primary algorithms. Both rely on a hierarchical partition of the alphabet (e.g., grouping characters into classes like digits, lowercase letters, punctuation) to capture the syntactical structure of the data.

A. K-Nearest Neighbor-Based Approach (LOF)

This approach adapts the Local Outlier Factor (LOF) algorithm, traditionally used for numerical data, to work with strings.

Distance Metric: Instead of Euclidean distance, the algorithm uses the Levenshtein distance (edit distance).
Weighted Levenshtein: To improve sensitivity to syntactical structure, the author introduces a hierarchical weighting scheme. Replacing a character with another from the same class (e.g., 'a' to 'b') incurs a lower cost than replacing it with a character from a distant class (e.g., 'a' to '1'). The weight is determined by the path length between characters in a defined hierarchy tree.
Parameter Selection:
- $k$ (Neighbors): Instead of a fixed $k$ , the thesis utilizes the KFCS (k-finder based on neighborhood consistency) algorithm to automatically determine the optimal neighborhood size based on score consistency.
- Thresholding: A dynamic thresholding method is used where the threshold is a multiple ( $f$ ) of the mean anomaly score. The process is iterative: outliers are removed, the mean is recalculated, and the threshold is updated to detect multiple groups of outliers with varying degrees of anomalousness.

B. Regular Expression-Based Approach (HiLRE)

This approach is based on inferring a Hierarchical Left Regular Expression (HiLRE) that describes the "normal" data.

Concept: The algorithm assumes that expected data follows a specific language pattern. Any string not matching the inferred pattern is an outlier.
Learning Algorithm: It uses an incremental learning algorithm (based on [Dos+16]) that builds a HiLRE by processing strings one by one. It tracks "Learnings" (elements, counts, and multiplicity) to construct a minimal regular expression.
Outlier Selection Strategy:
1. Generate HiLREs for all possible subsets of the dataset.
2. Select the HiLRE ( $H^*$ ) that maximizes the minimal difference in the number of matches compared to its subset HiLREs. This ensures the selected pattern is robust and not just a trivial subset.
3. $p_{min}$ Parameter: A variant is introduced where the selected HiLRE must match at least a percentage ( $p_{min}$ ) of the total dataset. This prevents the algorithm from selecting overly specific patterns that fit only a few data points (overfitting).

3. Key Contributions

Adaptation of LOF for Strings: Successfully adapted the LOF algorithm for string data by integrating the Levenshtein metric and proposing a hierarchical weighting mechanism to better reflect syntactical similarities.
New HiLRE Outlier Detection Algorithm: Developed a novel method for outlier detection that infers a regular expression for expected data. It includes a specific selection strategy to find the optimal $H^*$ and a tunable parameter ( $p_{min}$ ) to control the strictness of the pattern.
Systematic Comparison: Conducted a comprehensive experimental comparison between density-based (LOF) and language-based (HiLRE) approaches on both synthetic and real-world datasets.
Dynamic Thresholding: Introduced an iterative thresholding strategy for LOF that allows for the detection of multiple outlier clusters with varying degrees of deviation.

4. Experimental Results

The algorithms were tested on synthetic datasets (ISO 8601 dates with injected outliers) and real-world datasets derived from German hospital quality reports (zip codes, county names, house numbers, phone numbers, dates, and times).

Performance on Clean Data (False Positives)

LOF: Achieved zero false positives with higher threshold factors. The hierarchical weighting showed no advantage on clean, uniform data but offered finer distinction on mixed-structure data.
HiLRE: Achieved zero false positives with high $p_{min}$ values (e.g., >35%). However, on diverse datasets (like county names), it struggled to find a tight fit, leading to higher false positives unless $p_{min}$ was set very high.

Performance on Datasets with Outliers (ROC Analysis)

Scenario 1: Zip Codes (Normal) vs. County Names (Outliers)
- HiLRE: Superior performance. It perfectly identified the 5-digit structure of zip codes, achieving 100% true positive rate with 0% false positives.
- LOF: Performed well but struggled when county names had the same length as zip codes, leading to some misclassifications. The hierarchical weighting improved stability but reduced the total number of detected outliers compared to the basic metric.
Scenario 2: County Names (Normal) vs. Zip Codes (Outliers)
- HiLRE: Poor performance. County names are too diverse to be described by a single tight regular expression. The algorithm failed to distinguish the outliers from the inherent noise of the diverse normal data.
- LOF: Also performed poorly, barely outperforming random guessing, as the high variance in normal data masked the outliers.
Scenario 3: Zip Codes (Normal) vs. House/Phone Numbers (Outliers)
- LOF: Superior performance. It successfully detected outliers based on length and character composition differences, achieving a 20–60% true positive rate with low false positives.
- HiLRE: Poor performance. It struggled to find a pattern that fit the zip codes without overfitting to the longer/shorter numbers, often classifying almost everything as an outlier or nothing at all.

5. Significance and Conclusion

The thesis demonstrates that no single algorithm is universally superior for string outlier detection; the choice depends heavily on the dataset's structural properties:

Use HiLRE-based algorithms when the expected data has a distinct, rigid structure (e.g., zip codes, specific date formats) that can be tightly captured by a regular expression. In these cases, HiLRE offers near-perfect precision and recall.
Use LOF-based algorithms when the data has variable structure or when outliers differ primarily in edit distance or length from the expected cluster. LOF is more robust to noise and diversity in the "normal" data but may produce higher false positive rates.

Future Work: The author suggests theoretical analysis of time/space complexity for the HiLRE selection process and expanding the scope to multi-word strings or semantic outlier detection (requiring context). The work also highlights the potential of using these algorithms for data profiling (e.g., discovering that hospital reports are mostly submitted late in the year) rather than just anomaly detection.