Comparison of Outlier Detection Algorithms on String Data

This thesis compares two outlier detection algorithms for string data—a modified Local Outlier Factor using a weighted Levenshtein distance and a new hierarchical left regular expression learner—demonstrating that the former excels when outliers differ significantly in edit distance, while the latter is superior for identifying structural deviations in expected data patterns.

Philip Maus

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you are the manager of a massive library. Your job is to keep the shelves organized. Most of the books follow a strict rule: they all have titles in English, they all have exactly 10 letters, and they all start with a capital "A."

Suddenly, someone starts slipping random items onto the shelves: a banana, a rock, a book titled "The Great Gatsby" (which is in English but has 15 letters), and a book written in ancient Greek.

Your goal is Outlier Detection: finding the items that don't belong. This thesis by Philip Maus tackles a specific version of this problem: How do we find the "weird" items when everything is just a string of text (like words, dates, or codes), rather than numbers?

Most computer programs are great at spotting weird numbers (like a bank account balance of $1,000,000 when everyone else has $50). But they struggle with text. This paper tests two different "detectives" to see which one is better at finding the weird text.

Here is how the two detectives work, explained simply:

Detective #1: The "Crowd Watcher" (Local Outlier Factor)

The Analogy: Imagine a crowded party.

  • How it works: This detective looks at every guest and asks, "Who are your 5 closest friends?" (This is the k-nearest neighbor part).
  • The Logic: If you are standing in a dense crowd of people who all look and act similar, you are "normal." But if you are standing alone in a corner, or if your closest friends are actually very far away from you, you are an "outlier."
  • The Twist (The Weighted Levenshtein): The paper introduces a smart upgrade. Standard distance measures treat every difference as equal. Changing an "A" to a "B" costs the same as changing an "A" to a "1".
    • The Upgrade: This detective now has a hierarchy. It knows that changing a "1" to a "2" (both numbers) is a small, innocent mistake. But changing a "1" to a "Z" (number to letter) is a huge, suspicious change. It weighs the differences based on how "related" the characters are.
  • Best at: Finding items that are structurally similar but slightly "off." For example, spotting a phone number that is the right length but has the wrong mix of numbers and letters.

Detective #2: The "Pattern Hunter" (Hierarchical Left Regular Expression)

The Analogy: Imagine a bouncer at a club with a very specific dress code.

  • How it works: This detective looks at all the "normal" people in the room and tries to write a rule (a Regular Expression) that describes exactly what they are wearing.
    • Example Rule: "Everyone must wear a blue shirt and black pants."
  • The Logic: Once the rule is written, anyone who doesn't fit the rule is kicked out (labeled an outlier).
  • The Twist (The Smart Selection): The problem is, if you have a few weird people in the crowd, the bouncer might write a rule that fits everyone (e.g., "Wear clothes"), which is useless. So, this detective uses a special algorithm to find the tightest, most specific rule that covers the majority of the crowd but excludes the weirdos. It also has a "strictness knob" (parameter pminp_{min}) to decide how many people the rule must cover before it's accepted.
  • Best at: Finding items that break a clear, rigid structure. For example, if the normal data is always "Zip Codes" (5 digits), and the outlier is a "County Name" (letters), this detective instantly spots it because the County Name doesn't fit the "5 digits" rule.

The Showdown: What Happened?

The author tested these two detectives on real-world data from German hospitals (like zip codes, dates, and phone numbers).

1. The "Zip Code vs. County Name" Test:

  • Scenario: The crowd is mostly 5-digit numbers (Zip Codes). The intruders are names like "Frankfurt" or "Berlin."
  • Winner: Detective #2 (The Pattern Hunter).
  • Why: It quickly realized, "Ah! The rule is 'Exactly 5 digits'." Any name with letters immediately fails the rule. It was perfect.
  • Detective #1 struggled a bit because some county names happen to be 5 letters long, making them look like they belong in the crowd.

2. The "Zip Code vs. Phone Number" Test:

  • Scenario: The crowd is 5-digit numbers. The intruders are phone numbers (which are also just numbers, but longer).
  • Winner: Detective #1 (The Crowd Watcher).
  • Why: The Pattern Hunter got confused. It tried to write a rule for "numbers," and since both groups are numbers, the rule became too loose.
  • Detective #1 looked at the crowd and said, "Hey, the Zip Codes are all packed tightly together in a small space. These phone numbers are standing way over there in a different spot." It spotted the distance difference perfectly.

The Big Takeaway

There is no single "best" detective. It depends on the crime scene:

  • Use the Pattern Hunter (Regex) when the "normal" data has a very strict, rigid structure (like dates, zip codes, or ID numbers) and the outliers break that structure completely.
  • Use the Crowd Watcher (LOF) when the data is messy, or when the outliers are the same type of thing as the normal data but just have different lengths or slight variations (like house numbers vs. zip codes).

In summary: This paper teaches us that to clean up messy text data, we need to know our data's personality. If the data is rigid, use a rule-maker. If the data is fluid and clustered, use a crowd-watcher. By combining these tools, we can automate the cleaning of system logs, databases, and user inputs much more effectively.