Conditioning LLMs to Generate Code-Switched Text

This paper proposes a methodology to fine-tune Large Language Models for generating fluent English-Spanish code-switched text by leveraging back-translated parallel corpora, demonstrating that while traditional metrics fail to correlate with human preferences, LLM-based evaluation aligns well with human judgment and the approach significantly advances CS text generation capabilities.

Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you are a chef trying to teach a robot how to cook a specific, delicious dish: Code-Switching.

In the human world, code-switching is when a bilingual person naturally mixes two languages in a single sentence, like saying, "I need to comprar milk because I'm hambriento." It's a natural, fluid way of speaking for millions of people. But for Artificial Intelligence (AI), this is like trying to bake a cake with two different types of flour that don't usually mix. Most AI models are trained on "pure" English or "pure" Spanish, so when they try to mix them, they often end up with a lumpy, inedible mess or just serve you a plain English cake.

This paper is about a team of researchers who figured out how to teach an AI chef to bake the perfect "mixed-language" cake. Here is how they did it, broken down into simple steps:

1. The Problem: The AI is a "Monolingual Purist"

The researchers found that even the smartest AI models (Large Language Models or LLMs) struggle to mix languages naturally. If you ask them to "write a sentence mixing English and Spanish," they often:

  • Forget to mix at all (just writing English).
  • Mix it awkwardly (like saying "The gato is running fast" instead of "The cat is corriendo fast").
  • Get the grammar wrong.

It's like asking a strict chef who only knows how to make French cuisine to suddenly make a Mexican dish; they might put a taco shell on a croissant, which looks weird and tastes wrong.

2. The Solution: The "Back-Translation" Trick

Since there aren't enough examples of perfect mixed-language sentences to teach the AI, the researchers invented a clever workaround. They used a "reverse engineering" approach:

  • Step A: The Source Material. They started with a huge pile of real, natural mixed-language sentences found on social media (like Twitter/X).
  • Step B: The Translator. They asked a super-smart AI to translate these mixed sentences back into pure English. Think of this as taking a mixed cocktail and separating the ingredients back into pure water and pure juice.
  • Step C: The Training Pair. Now they had a perfect pair:
    • Input: Pure English sentence.
    • Output: The original natural mixed sentence.
  • Step D: The Lesson. They used these pairs to "fine-tune" (train) a new AI model. They taught the AI: "When you see this English sentence, you must output this specific mixed version."

It's like giving the AI a "fill-in-the-blank" workbook where the answer key is already there, teaching it exactly how to switch languages naturally.

3. The Results: The AI Learns to Dance

They tested their new AI against other models, including some very famous, massive ones (like GPT-4) that were just guessing based on a few examples (called "few-shot" prompting).

  • The Winner: The AI that was specifically trained (fine-tuned) using their new method was the clear champion. It produced mixed sentences that humans found much more natural and fluent.
  • The Losers: The massive, untrained models often failed. They tended to just write in English or mix the languages in a robotic, unnatural way.

The Analogy: Imagine a jazz band. The untrained AI is like a musician who knows the notes but can't improvise; they play the sheet music perfectly but can't "jam." The fine-tuned AI is like a seasoned jazz musician who knows exactly when to switch instruments and blend the sounds to create something new and smooth.

4. The Evaluation: Can a Robot Judge the Art?

The researchers also asked a tricky question: How do we know the AI did a good job?

Usually, we use computer programs to grade AI writing (like a math teacher checking answers). But the researchers found that these standard "math" metrics are terrible at judging mixed languages.

  • The Problem: A computer metric might give a high score to a sentence that is 100% English because it matches the "English" part of the answer key, even though the task was to mix languages. It's like a judge giving a high score to a painting because the background is blue, ignoring that the artist forgot to paint the sky.
  • The Human Factor: When real humans judged the sentences, they preferred the fine-tuned AI. They cared about the feeling and flow of the mixed language, which the computer metrics missed.
  • The "AI Judge": They tried using another AI (GPT-4) to grade the work. It was better than the math metrics, but still not perfect. It was like asking a robot to judge a human dance; it can see the moves, but it doesn't quite "feel" the rhythm.

The Big Takeaway

This paper proves two main things:

  1. Training Matters: You can't just ask a smart AI to "be creative" with languages. You have to give it specific, high-quality examples (like the "back-translation" method they used) to teach it how to mix languages naturally.
  2. New Rules for Grading: We need new ways to grade AI when it comes to mixing languages. The old math-based tests don't work because they don't understand the nuance of human conversation.

In short, the researchers built a new "kitchen" and a new "recipe" that allows AI to finally speak like a real bilingual human, rather than a robot trying to sound like one.