Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution

This paper proposes using homoglyphic substitution, which replaces standard characters with visually similar alternatives from different scripts, as an adversarial technique to degrade stylometric analysis and protect personal information from being inferred from written text.

Original authors: Robert Dilworth

Published 2026-04-14
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine your writing style is like your voice. Just as you have a unique way of speaking—your accent, your favorite words, the rhythm of your sentences—every time you type, you leave behind a "digital fingerprint." This is called stylometry.

In the past, if you wanted to hide your identity, you might just stop writing or use a pseudonym. But today, advanced computer systems can look at a random social media post, analyze your writing fingerprint, and guess:

  • How old you are.
  • Where you live.
  • Even who you are, specifically.

This paper, written by Robert Dilworth, is about how to break that fingerprint so these systems can't recognize you. The author calls this "Hijacking Text Heritage."

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "Digital Mirror"

Think of your writing style as a reflection in a mirror. Usually, the mirror shows exactly who you are.

  • The Threat: Governments or corporations might steal your ID (like a passport) to know who you are. But they can also steal your writing from the internet to figure out who you are without ever seeing your ID.
  • The Goal: The author wants to "smudge the mirror" so the reflection looks like a blurry mess, making it impossible for the computer to identify the person behind the text.

2. The Solution: "Homoglyphs" (The Visual Trick)

The paper focuses on a specific trick called Homoglyph Substitution.

  • What is it? Imagine you have a letter "a" in the English alphabet. Now, imagine there is a letter from a different language (like Russian) that looks exactly like "a" but is actually a completely different character underneath the hood.
  • The Analogy: Think of it like wearing a perfectly realistic mask. To your eyes, the person looks exactly like their friend. But if you put them under a special X-ray scanner (the computer), the mask reveals a different skeleton.
  • The Paper's Idea: By swapping some of your letters for these "look-alike" characters, you confuse the computer. The computer thinks the text is different, but a human reading it sees no difference.

3. The "Poison" Recipe (The Experiments)

The author tested different ways to "poison" the text to break the computer's ability to identify the writer. He compared it to mixing a chemical potion:

  • The Ingredients: You can swap vowels (a, e, i, o, u) or consonants (b, c, d, etc.).
  • The Dosage: How much of the text needs to be changed?
    • Too little (12-25%): The computer still recognizes you. It's like putting a tiny smudge on a mirror; you can still see the face.
    • Just right (37.5%): This is the "sweet spot." The computer gets confused and can no longer verify who wrote the text.
    • Too much (50%+): It works, but it's unnecessary work. You don't need to change every letter to hide; changing about 4 out of 10 words is enough.

4. Why Do This? (The "Shadow AI" Fear)

The paper argues that we need this defense because of a scary future scenario:

  • The Scenario: Imagine a social media app asks you to upload your ID to prove you are over 18. Then, it asks you to write a short essay about your fears.
  • The Trap: The company doesn't just want to check your age. They want to use your writing to build a "Shadow AI"—a digital twin of your mind. They want to know your deepest thoughts so they can predict what you will buy, what you will vote for, or even what crimes you might commit before you do them.
  • The Defense: By "poisoning" your text with these look-alike letters, you are essentially burning the blueprint of your digital twin. You are telling the AI, "You can't build a model of me because my data is corrupted."

5. The "TraceTarnish" Tool

The author created a tool (a script) called TraceTarnish.

  • How it works: It takes your text and automatically swaps out letters for their "look-alike" cousins and adds invisible characters (like invisible ink) to the text.
  • The Result: The text looks normal to a human, but to a computer trying to track you, it looks like garbage data. It's like writing a letter in a language that looks like English but is actually a secret code that breaks the computer's dictionary.

Summary

The paper is a call to action for digital privacy. It suggests that in a world where computers can read your mind through your writing style, the best defense is to deliberately confuse them.

By using "Homoglyphs" (visual trickery), we can create a "firewall" made of text. We aren't hiding our words; we are just making sure the computer can't trace those words back to us. As the author says, it's like "fighting fire with fire"—using the computer's own reliance on patterns against it to protect our freedom.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →