Conditioning LLMs to Generate Code-Switched Text

Imagine you are a chef trying to teach a robot how to cook a specific, delicious dish: Code-Switching.

In the human world, code-switching is when a bilingual person naturally mixes two languages in a single sentence, like saying, "I need to comprar milk because I'm hambriento." It's a natural, fluid way of speaking for millions of people. But for Artificial Intelligence (AI), this is like trying to bake a cake with two different types of flour that don't usually mix. Most AI models are trained on "pure" English or "pure" Spanish, so when they try to mix them, they often end up with a lumpy, inedible mess or just serve you a plain English cake.

This paper is about a team of researchers who figured out how to teach an AI chef to bake the perfect "mixed-language" cake. Here is how they did it, broken down into simple steps:

1. The Problem: The AI is a "Monolingual Purist"

The researchers found that even the smartest AI models (Large Language Models or LLMs) struggle to mix languages naturally. If you ask them to "write a sentence mixing English and Spanish," they often:

Forget to mix at all (just writing English).
Mix it awkwardly (like saying "The gato is running fast" instead of "The cat is corriendo fast").
Get the grammar wrong.

It's like asking a strict chef who only knows how to make French cuisine to suddenly make a Mexican dish; they might put a taco shell on a croissant, which looks weird and tastes wrong.

2. The Solution: The "Back-Translation" Trick

Since there aren't enough examples of perfect mixed-language sentences to teach the AI, the researchers invented a clever workaround. They used a "reverse engineering" approach:

Step A: The Source Material. They started with a huge pile of real, natural mixed-language sentences found on social media (like Twitter/X).
Step B: The Translator. They asked a super-smart AI to translate these mixed sentences back into pure English. Think of this as taking a mixed cocktail and separating the ingredients back into pure water and pure juice.
Step C: The Training Pair. Now they had a perfect pair:
- Input: Pure English sentence.
- Output: The original natural mixed sentence.
Step D: The Lesson. They used these pairs to "fine-tune" (train) a new AI model. They taught the AI: "When you see this English sentence, you must output this specific mixed version."

It's like giving the AI a "fill-in-the-blank" workbook where the answer key is already there, teaching it exactly how to switch languages naturally.

3. The Results: The AI Learns to Dance

They tested their new AI against other models, including some very famous, massive ones (like GPT-4) that were just guessing based on a few examples (called "few-shot" prompting).

The Winner: The AI that was specifically trained (fine-tuned) using their new method was the clear champion. It produced mixed sentences that humans found much more natural and fluent.
The Losers: The massive, untrained models often failed. They tended to just write in English or mix the languages in a robotic, unnatural way.

The Analogy: Imagine a jazz band. The untrained AI is like a musician who knows the notes but can't improvise; they play the sheet music perfectly but can't "jam." The fine-tuned AI is like a seasoned jazz musician who knows exactly when to switch instruments and blend the sounds to create something new and smooth.

4. The Evaluation: Can a Robot Judge the Art?

The researchers also asked a tricky question: How do we know the AI did a good job?

Usually, we use computer programs to grade AI writing (like a math teacher checking answers). But the researchers found that these standard "math" metrics are terrible at judging mixed languages.

The Problem: A computer metric might give a high score to a sentence that is 100% English because it matches the "English" part of the answer key, even though the task was to mix languages. It's like a judge giving a high score to a painting because the background is blue, ignoring that the artist forgot to paint the sky.
The Human Factor: When real humans judged the sentences, they preferred the fine-tuned AI. They cared about the feeling and flow of the mixed language, which the computer metrics missed.
The "AI Judge": They tried using another AI (GPT-4) to grade the work. It was better than the math metrics, but still not perfect. It was like asking a robot to judge a human dance; it can see the moves, but it doesn't quite "feel" the rhythm.

The Big Takeaway

This paper proves two main things:

Training Matters: You can't just ask a smart AI to "be creative" with languages. You have to give it specific, high-quality examples (like the "back-translation" method they used) to teach it how to mix languages naturally.
New Rules for Grading: We need new ways to grade AI when it comes to mixing languages. The old math-based tests don't work because they don't understand the nuance of human conversation.

In short, the researchers built a new "kitchen" and a new "recipe" that allows AI to finally speak like a real bilingual human, rather than a robot trying to sound like one.

Here is a detailed technical summary of the paper "Conditioning LLMs to Generate Code-Switched Text" by Heredia et al.

1. Problem Statement

Code-switching (CS)—the mixing of two or more languages within a single utterance—is a prevalent phenomenon in multilingual communities, particularly in informal digital communication. However, Natural Language Processing (NLP) systems struggle with CS due to:

Data Scarcity: A lack of large-scale, diverse, and high-quality parallel datasets for training.
Model Limitations: State-of-the-art Large Language Models (LLMs), even multilingual ones, perform poorly on CS generation compared to monolingual tasks. They often fail to produce natural switching points or revert to monolingual outputs.
Evaluation Gaps: Existing automatic metrics (e.g., BLEU, BERTScore) and evaluation frameworks are not designed to capture the nuances of CS, leading to poor correlation with human judgment.

The authors aim to address these gaps by developing a methodology to fine-tune LLMs to generate fluent, natural CS text from monolingual inputs (specifically English to English-Spanish) and by rigorously evaluating the performance of these models against baselines.

2. Methodology

The proposed framework consists of three main stages: Data Creation, Model Training, and Evaluation.

A. Parallel Data Creation (The EN-CS Corpus)

Since no existing parallel datasets exist for training CS generation (Monolingual $\to$ CS), the authors created a synthetic parallel corpus named EN-CS:

Source Data: They started with the LINCE benchmark (English-Spanish CS sentences).
Filtering: They filtered out monolingual sentences and those with only single-word borrowings, retaining only sentences with at least two words in each language to ensure genuine CS.
Back-Translation: Using the Command R model, they back-translated the filtered CS sentences into monolingual English. This created pseudo-parallel pairs: (Monolingual English $\leftrightarrow$ Code-Switched English-Spanish).
Quality Control:
- Silver Standard: The automatically translated English sentences were used for training and development.
- Gold Standard: A subset of 1,040 test instances underwent manual post-editing by proficient bilingual speakers to create a high-quality reference set.
- Result: A final corpus of ~10,703 training pairs and 1,040 gold test pairs.

B. Model Training and Fine-Tuning

The task was framed as a Machine Translation (MT) problem where the source is Monolingual English and the target is Code-Switched English-Spanish.

Models: Two Llama 3 variants (8B Base and 8B Instruct) were fine-tuned using QLoRA (Quantized Low-Rank Adaptation).
Baselines:
- Few-Shot LLMs: GPT-4o and Llama 3.3-70B Instruct prompted with 5-shot examples.
- Dedicated MT: A fine-tuned NLLB (No Language Left Behind) model.
Training Strategy: The models were trained to complete the CS portion of the sentence given the English input. To prevent hallucination or over-generation, outputs were truncated at the punctuation mark closest to the original sentence length.

C. Evaluation Framework

The authors employed a multi-faceted evaluation approach:

Human Preference (Tournament): 14 annotators performed pairwise comparisons of model outputs against the Gold Standard and each other. Criteria included CS naturalness, content fluency, and orthography.
Error Analysis: A typology was developed categorizing errors into CS Errors (e.g., monolingual output, unnatural switching), Translation Errors (meaning loss, grammar), and Format Errors.
Automatic Metrics: Standard NLG metrics (BLEU, BERTScore, chrF) and an LLM-as-a-Judge (GPT-4o) were used to assess correlation with human preferences.
Domains: Evaluation was conducted on In-Domain (social media style) and Out-of-Domain (creative non-fiction) datasets.

3. Key Contributions

EN-CS Corpus: The creation of a new, high-quality parallel corpus for English-Spanish CS generation, derived via LLM back-translation and human verification.
Fine-Tuning Methodology: Demonstrating that fine-tuning LLMs on synthetic CS data significantly outperforms zero-shot/few-shot prompting and dedicated MT models in generating natural CS.
Comprehensive Error Typology: A refined error classification system specifically for CS generation, distinguishing between failures to switch languages and standard translation errors.
Evaluation Insights: A critical analysis showing that traditional reference-based metrics fail to evaluate CS generation accurately, while LLM judges show moderate but insufficient correlation with human judgment.

4. Key Results

Performance Comparison

Fine-Tuned LLMs vs. Baselines: The fine-tuned Llama 3 (Base) model achieved the highest ranking in human preference evaluations, outperforming both the instruction-tuned Llama 3 and the few-shot GPT-4o/Llama 3.3-70B.
- Surprising Finding: The Instruction-tuned model performed worse than the base model, likely due to instruction tuning reducing the model's ability to follow specific generation constraints (a known "alignment tax").
- Few-Shot Models: Despite being larger (70B) or proprietary (GPT-4o), they ranked lowest in human preference because they frequently produced monolingual outputs (a critical CS error) rather than switching languages.
Domain Generalization: The fine-tuned Llama 3 Base model generalized well to the Out-of-Domain (creative writing) set, whereas the NLLB and Instruction-tuned models showed significant performance drops, indicating overfitting to the training domain.

Error Analysis

CS Errors: Few-shot models made the most "CS Errors" (failing to switch languages). Fine-tuned models made significantly fewer CS errors (<15% of total errors), proving they learned the switching mechanism.
Trade-offs: While fine-tuned models switched languages well, they occasionally introduced format or translation errors. However, these were deemed less critical than the failure to code-switch.

Evaluation Metric Correlation

Reference-Based Metrics: BLEU, BERTScore, and chrF showed very low correlation ( $\rho \approx 0.05 - 0.09$ ) with human preferences. They often rewarded monolingual outputs that matched the English part of the reference, failing to penalize the lack of Spanish.
LLM Judge (GPT-4o): Showed a stronger correlation ( $\rho \approx 0.35$ ) than reference metrics but still failed to align perfectly with humans. GPT tended to prefer fluent monolingual sentences over natural CS, whereas humans prioritized the presence of CS as the primary criterion.

5. Significance and Conclusion

This work demonstrates that fine-tuning is essential for enabling LLMs to generate natural code-switched text, outperforming even massive proprietary models used in few-shot settings. The study highlights a critical disconnect in current NLP evaluation: standard metrics are inadequate for CS tasks because they do not account for the linguistic complexity of switching.

Implications:

Data Strategy: The back-translation approach offers a viable path to creating training data for low-resource CS pairs where parallel data is scarce.
Model Tuning: Instruction tuning may hinder specific generative tasks like CS; base models fine-tuned with specific formatting constraints may be superior.
Future Evaluation: The field requires specialized evaluation methods that prioritize the presence and naturalness of code-switching over simple n-gram overlap or semantic similarity.

The authors release their code and the EN-CS dataset under a CC-BY-NC-SA license to facilitate further research in multilingual generation.