Interpretable Predictability-Based AI Text Detection: A Replication Study

This paper presents a replication and extension of the AuTexTification 2023 authorship attribution system, demonstrating that integrating newer multilingual models with document-level stylometric features and SHAP analysis improves detection performance across languages while highlighting the critical need for clear documentation to ensure reliable replication.

Adam Skurla, Dominik Macko, Jakub Simko

Published 2026-03-17
📖 5 min read🧠 Deep dive

Imagine you are a detective trying to solve a mystery: Who wrote this story? Was it a human with a unique voice, or was it a sophisticated robot (AI) mimicking a human?

This paper is like a "detective's field report" where the authors tried to rebuild and improve a specific tool used to solve this mystery. They looked at a system from a 2023 competition, tried to copy it exactly, found some things were missing or broken, and then built a better, more transparent version.

Here is the breakdown of their journey, using simple analogies:

1. The Mission: Copying the Blueprint (Replication)

The authors started by trying to rebuild a machine built by other researchers in 2023. Think of it like trying to bake a cake using a recipe and a photo of the finished product, but without the original chef's kitchen.

  • The Problem: When they tried to bake the cake, it didn't taste exactly the same. Why?
    • Missing Ingredients: Some of the original "flour" (specific AI models for Spanish) had disappeared from the internet. They had to use a substitute.
    • Different Ovens: The original recipe said to bake for "until done," but the code they found baked for a fixed time.
    • Measurement Errors: The way they counted the "sugar" (linguistic features) was slightly different because the measuring cups (software tools) had changed over time.
  • The Lesson: You can't truly copy a scientific experiment unless the original team shares every single detail, down to the exact version of the software they used.

2. The Upgrade: New Engines and Better Magnifying Glasses

Once they had a working machine, they decided to upgrade it. They wanted to see if newer technology could do a better job.

  • The Engine (The AI Models): The original machine used an old engine (GPT-2) to guess how "predictable" a sentence was. The authors swapped this for newer, smarter engines (like Qwen and mGPT).
    • Analogy: Imagine the old engine was a bicycle trying to keep up with a car. The new engines are high-performance sports cars. They can spot the subtle differences between human and AI writing much faster and more accurately.
  • The Magnifying Glass (Stylometric Features): The original tool looked at the text through a standard magnifying glass. The authors added 26 new lenses (features) to look at the text more closely.
    • They started counting things like: How many exclamation points are there? How long are the sentences? How repetitive are the words?
    • Analogy: If the AI is a forger, it might write perfect grammar, but it might use the same sentence structure over and over, or avoid complex emotional words. These new lenses catch those tiny "tells" that the old tool missed.

3. The Result: One Tool for All Languages

The original system needed a different tool for English and a different one for Spanish. It was like having two different keys for two different locks.

  • The Breakthrough: The authors created a "Master Key" (a multilingual configuration). They found a single setup that worked just as well for English as it did for Spanish.
  • Why it matters: This saves time and money. Instead of building a custom detector for every language in the world, we can use one smart, adaptable detector for many languages.

4. The "Why" (Interpretability)

Most AI detectors are "Black Boxes." You put text in, and it spits out a result, but you have no idea why. It's like a judge saying "Guilty" without explaining the evidence.

  • The SHAP Analysis: The authors used a technique called SHAP (which stands for "Shapley Additive exPlanations"). Think of this as a highlighter pen that lights up exactly which parts of the text made the AI suspicious.
  • The Discovery: They found that the new "lenses" (the 26 extra features) were actually very important. The AI wasn't just guessing; it was paying attention to things like sentence length and word variety. This makes the system trustworthy because we can see the evidence.

Summary: What Did They Learn?

  1. Reproducibility is hard: If scientists don't share their exact code and settings, others can't verify their work. Small details matter.
  2. Newer is better: Using modern AI models to analyze text makes detection much stronger.
  3. Style matters: Even with powerful AI, looking at the "style" of writing (how sentences are built, punctuation, etc.) is still a superpower for catching fakes.
  4. One size fits all: We can build one smart detector that works for multiple languages without needing to tweak it for each one.

In a nutshell: The authors took an old, slightly broken detective kit, fixed the missing parts, added high-tech lenses, and created a single, transparent tool that can catch AI writers in both English and Spanish, while showing us exactly how it caught them.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →