RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts

This paper introduces RILEC, a large-scale dataset and a generative framework for detecting and creating L1 Russian interference errors in English learner texts, demonstrating that models fine-tuned on this augmented data significantly improve the identification of specific error types like transliteration and tense misuse.

Darya Kharlamova, Irina Proskurina

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to learn to play the piano, but your brain is secretly trying to play the violin at the same time. Every time you press a piano key, your muscle memory from the violin makes your finger slip to the wrong note. In the world of language learning, this is called L1 Interference. It's when your native language (like Russian) "hacks" your second language (English), causing you to make specific, predictable mistakes.

For example, a Russian speaker might write "I will have enough time" instead of "If we have enough time," because in Russian, the grammar for "if" and "will" works differently. Or they might write "cassa" instead of "cashier" because they are literally spelling the Russian word with English letters.

This paper introduces a new tool called RILEC to help teachers and students catch these specific "violin slips" in English essays. Here is how they built it, explained simply:

1. The Problem: Not Enough "Mistake" Data

To teach a computer to spot these specific Russian-style errors, you need a massive library of examples. But real student essays are hard to find, and even harder to find where someone has carefully labeled why the mistake happened. It's like trying to teach a doctor to diagnose a rare disease when you only have five patient files.

2. The Solution: The "Mistake Factory" (RILEC)

The authors built RILEC (Russian L1 Interference Learner English Corpus). Think of this as a giant, high-tech "Mistake Factory."

They started with a real collection of essays from Russian students (about 6,000 sentences). But they knew that wasn't enough to train a super-smart AI. So, they invented three different ways to manufacture new, realistic mistakes to fill the gaps:

  • The "Robot Coach" (PPO-Optimized Models): Imagine a robot that has read thousands of essays. They trained this robot using a special technique called PPO (think of it as a video game reward system). Every time the robot made a mistake that looked like a real Russian learner's error, it got a "gold star." If it made a normal mistake, it got nothing. Over time, the robot learned to generate thousands of new sentences that sound exactly like a Russian learner struggling with English.
  • The "Rule Book" (Rule-Based): For some very specific errors (like mixing up tenses or spelling Russian words with English letters), they wrote a strict set of instructions. It's like a mad-libs game: "Take this sentence, find the year, and change the verb to the wrong tense." This ensures they get plenty of examples for the tricky, rule-heavy errors.
  • The "Creative Writer" (LLM Prompting): They asked a very smart AI (like a creative writing assistant) to look at a real mistake and say, "Okay, make up a new story that uses this exact same mistake." This helped them generate errors that were more natural and varied.

By combining these three methods, they expanded their dataset from 6,000 sentences to 18,000+ sentences. It's like turning a small seed into a massive forest of examples.

3. The Result: A Super-Spotter

Once they had this massive library of "Mistake Factory" data, they trained a new AI model to be a Super-Spotter.

  • Before: If you showed a computer a student essay, it might say, "This sentence is wrong," but it wouldn't know why. It's like a teacher saying, "You got this wrong," without explaining the rule.
  • After: The new model, trained on RILEC, can say, "You used the wrong tense because you are thinking in Russian," or "You spelled 'cashier' as 'cassa' because of transliteration."

4. Why This Matters

The paper found that this "Mistake Factory" approach worked incredibly well.

  • Accuracy: The model got very good at spotting specific errors like Transliteration (spelling Russian words in English) and Tense Semantics (using the wrong time tense), scoring over 90% accuracy on those.
  • The "Human" Touch: Even though the data was made by machines, the mistakes felt real. The model learned the style and logic of a Russian learner, not just random typos.

The Big Picture

Think of this research as building a specialized translator for errors. Instead of just fixing the grammar, it translates the student's brain back to the teacher. It explains, "Ah, I see what happened! You were thinking in Russian, and your brain translated 'cassa' directly."

This helps teachers give better feedback and helps students understand why they are making mistakes, rather than just being told they are wrong. It turns the frustrating process of learning a new language into a clearer, more logical journey.