Imagine you are a historian trying to read a dusty, damaged letter from 1850. You can't read the handwriting, so you use a robot scanner (called OCR) to turn the image into digital text.
But here's the problem: The robot is bad at reading old, smudged handwriting. It might read the name "Madison" as "Madifon" or "Madi son."
To fix this, researchers usually run the text through a "clean-up" process. They use rules, AI, or human editors to fix the mistakes. The problem is that once they fix it, they often throw away the "before" picture. They overwrite "Madifon" with "Madison" and pretend the mistake never happened.
This paper argues that throwing away the history of the mistake is dangerous. If you don't know why the robot changed "Madifon" to "Madison," you can't be sure if it's actually the right name or just a lucky guess.
Here is the paper's solution, explained with simple analogies:
1. The "Edit History" for Text (Provenance)
Think of a Google Doc. When you type, you can see the "Version History." You can see who changed what, when, and why.
- Current DH (Digital Humanities) practice: It's like taking a Google Doc, fixing the typos, and then deleting the version history. You only see the final, clean text.
- This paper's idea: Keep the version history forever. Every time a word is changed, we attach a tiny digital "tag" that says:
- What was the original word? ("Madifon")
- What did we change it to? ("Madison")
- Who did it? (A computer rule, an AI model, or a human?)
- How sure are we? (Confidence score: 74%)
- Did a human check it? (Yes/No)
This is called Provenance. It's like a "food label" for text, telling you exactly where the ingredients came from and how they were processed.
2. The "Filter" Analogy
The researchers tested this idea on a small collection of old historical texts. They created three different versions of the same text to see how it affected a computer program trying to find names (like "George Washington" or "Paris").
- Version A (Raw OCR): The messy, robot-read text. (Full of errors, but honest about what the robot saw).
- Version B (Fully Corrected): The text where every possible fix was applied automatically. (Looks perfect, but might have hidden mistakes).
- Version C (Provenance-Filtered): The "Smart" version. Here, the researchers said: "Only apply the fixes that the computer is very sure about, or that a human has checked."
The Result:
- Version B found the most names, but it also found a lot of fake names (ghosts) because the computer made wild guesses.
- Version C found almost as many real names as Version B, but it drastically reduced the fake ones.
It's like a security guard at a concert.
- Version B lets everyone in, even people who look suspicious, because they might have a ticket. (High coverage, high risk).
- Version C only lets in people with a clear ID or a VIP pass. (Slightly fewer people, but much safer).
3. Why This Matters for History
The paper found that small changes in the text can completely change the story a computer tells.
- If the computer thinks "Madi son" is two people instead of "Madison" (one person), the entire analysis of who was important in that era changes.
- By keeping the "Edit History" (Provenance), researchers can say: "Hey, this computer thinks this is a famous general, but it's only 60% sure, and no human checked it. Let's be careful before we write a history book about him."
The Big Takeaway
In the past, Digital Humanities researchers treated the "cleaned" text as the Truth.
This paper says: The "cleaned" text is just one opinion.
By treating the correction process as a first-class citizen (giving it its own spotlight), researchers can:
- Audit their work: See exactly where the computer might be lying.
- Manage uncertainty: Decide how much risk they are willing to take (e.g., "I only want to analyze text where a human has checked the names").
- Reproduce results: If another researcher wants to check their work, they can see the exact "recipe" of edits used to get the final result.
In short: Don't just give us the final, polished essay. Show us the scratch notes, the red pen marks, and the sticky notes so we know exactly how the story was built. That is the power of Provenance.