One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

Imagine you have a super-smart robot that reads books. You want to know: Does this robot understand the story, or is it just memorizing the letters?

To find out, the researchers in this paper set up a clever experiment using the Serbian language. Here is the breakdown of what they did and what they found, explained simply.

1. The Perfect Test: Two Ways to Write the Same Thing

Serbia is unique because its people write in two different scripts (alphabets) interchangeably: Latin (like English: A, B, C) and Cyrillic (like Russian: А, Б, В).

The Magic: You can translate a sentence from Latin to Cyrillic perfectly. The meaning stays 100% the same, but the letters look completely different.
The Catch: To a computer (specifically a Large Language Model), these two scripts look like two totally different languages. The computer breaks them down into different "chunks" (tokens) and has no idea that "Hello" in Latin is the same as "Hello" in Cyrillic.

The Analogy: Imagine you have a song.

Script A is the sheet music written in standard notes.
Script B is the same song written in a secret code of emojis.
To a human, it's the same song. To a robot that only reads notes, the emoji version looks like gibberish.

2. The Robot's "Brain" (SAEs)

The researchers didn't just ask the robot what it thought; they looked inside its brain using a tool called a Sparse Autoencoder (SAE).

Think of the robot's brain as a massive room with 65,000 light switches. When the robot reads a sentence, certain switches flip on.

If the robot is thinking about "cats," a specific set of switches lights up.
If it's thinking about "running," a different set lights up.

The question was: If we show the robot the same sentence in Latin and then in Cyrillic, will the same light switches turn on?

3. The Experiment

They fed the robot sentences in three ways:

The Same Sentence: "The cat sat on the mat" in Latin vs. "The cat sat on the mat" in Cyrillic.
The Same Meaning, Different Words: "The cat sat on the mat" vs. "The feline rested on the rug" (both in Latin).
Random Nonsense: Totally different sentences.

4. The Big Discovery

The results were surprising and exciting:

The Robot Cares More About Meaning Than Spelling: When the robot read the same sentence in Latin and Cyrillic, the same light switches flipped on 58% of the time. This is huge! Even though the computer saw two totally different sets of symbols, it recognized the underlying idea.
It's Better Than Paraphrasing: Interestingly, the robot recognized the same sentence in two scripts better than it recognized two different sentences that meant the same thing (paraphrases).
- Analogy: The robot is more confused by you changing your vocabulary ("feline" vs. "cat") than by you changing the alphabet (Latin vs. Cyrillic). It cares more about what you said than how you spelled it.
Bigger Brains = Better Understanding: As they tested bigger and smarter versions of the robot (from small to massive), this ability got even stronger. The biggest robots were almost perfect at ignoring the script and focusing on the meaning.

5. Why This Matters

This proves that these AI models aren't just pattern-matching machines that memorize specific words. They are building abstract concepts.

The "Ghost" in the Machine: The researchers found that the AI has built a "ghost" version of meaning that floats above the actual letters. Whether you write "Dog" or "Собака," the AI's internal concept of "Dog" is the same.
No Cheating: They checked to make sure the robot wasn't just memorizing the training data. Since the specific mix of "Latin Original" and "Cyrillic Paraphrase" likely never appeared together in the robot's training books, the fact that it still recognized the connection proves it's actually understanding, not just remembering.

The Takeaway

This paper shows that modern AI is learning to see the forest, not just the trees. Even when the "trees" (the letters) look completely different, the AI can still see the "forest" (the meaning). This is a massive step forward in understanding how machines learn to think like humans, regardless of the language or alphabet they are using.

Here is a detailed technical summary of the paper "One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations."

1. Problem Statement

The paper addresses a fundamental question in mechanistic interpretability: Do the features learned by Sparse Autoencoders (SAEs) represent abstract semantic meaning, or are they inextricably tied to surface-level orthographic forms (scripts) and tokenization patterns?

While Large Language Models (LLMs) are known to learn language-agnostic representations, it remains unclear if these representations survive drastic changes in input encoding. Most existing studies on cross-lingual transfer suffer from confounds such as vocabulary mismatches or imperfect semantic mappings between languages. The authors seek a controlled environment to isolate orthography (writing system) from semantics (meaning) to test if SAE features remain invariant when the same sentence is written in two completely different scripts that share no tokens.

2. Methodology

Experimental Testbed: Serbian Digraphia

The authors utilize Serbian, a language with active digraphia, where the same language is written interchangeably in Latin and Cyrillic scripts.

Controlled Variables: The scripts have a deterministic, lossless mapping (zero semantic change), but LLMs tokenize them into completely disjoint token vocabularies (no shared tokens).
Hypothesis: If SAE features capture abstract semantics, identical sentences in Latin and Cyrillic should activate highly overlapping feature sets despite having zero token overlap.

Models and SAEs

Model Family: The study evaluates the Gemma model family (Gemma-3 variants) ranging from 270M to 27B parameters.
SAE Configuration: They use Gemma Scope 2 SAEs (JumpReLU sparse autoencoders) with 65,536 features trained on model activations.
Layers: Analysis is conducted across 3–4 layers per model (spanning early, middle, and late processing stages).
Thresholding: Features are activated if their value exceeds a threshold of $\tau = 0.1$ .

Dataset Construction

A dataset of 30 sentence triplets was constructed, each containing:

Original: A natural sentence.
Paraphrase: A semantically equivalent rephrasing.
Random: An unrelated sentence.
These were translated into English, Serbian Latin, and Serbian Cyrillic (Total: 270 unique sentences). Semantic similarity was verified using LaBSE embeddings.

Evaluation Metric

The authors use Jaccard Similarity over the sets of active SAE features ( $F(s)$ ) for two sentences $s_1$ and $s_2$ :
$J(s_1, s_2) = \frac{|F(s_1) \cap F(s_2)|}{|F(s_1) \cup F(s_2)|}$
This metric ranges from 0 (no overlap) to 1 (identical feature sets).

Comparison Types

The study defines specific comparison pairs to test invariance:

Cross-Script Original: Same sentence in Latin vs. Cyrillic (Core test).
Cross-Script Paraphrase: Same paraphrase in Latin vs. Cyrillic.
Cross-Script Cross-Paraphrase: Original in Latin vs. Paraphrase in Cyrillic (Tests generalization beyond exact matches).
Baselines: Random pairs within script, random pairs across scripts, and random pairs across languages (Serbian vs. English).

3. Key Results

Evidence for Script Invariance

The results demonstrate that SAE features are highly invariant to script changes:

Cross-Script Original Similarity: Identical sentences in Latin and Cyrillic achieved a mean Jaccard similarity of ~0.58.
Cross-Script Paraphrase Similarity: ~0.59.
Cross-Script Cross-Paraphrase: ~0.47.
Baselines: These values significantly exceed the Cross-Script Random baseline (~~0.28) and the Cross-Language Random baseline (~~0.19).

Key Finding: Changing the script causes less representational divergence than paraphrasing within the same script. This implies the model prioritizes semantic meaning over orthographic form.

Effect of Model Scale

The study analyzed how script invariance evolves with model size (270M to 27B):

Increasing Invariance: Cross-script similarity for identical sentences increased from 0.50 (270M) to 0.65 (27B).
Decreasing Random Noise: Random baselines decreased as models scaled up, indicating better discrimination between unrelated concepts.
Convergence: At 27B, the semantic discrimination gaps for English, Serbian Latin, and Serbian Cyrillic converged, suggesting that at sufficient scale, the model achieves comparable semantic understanding regardless of script.

Ruling Out Memorization

The high similarity in Cross-Script Cross-Paraphrase pairs (Original in Latin vs. Paraphrase in Cyrillic) provides strong evidence against memorization. Since these specific combinations are unlikely to co-occur in training data, the feature overlap must stem from genuine semantic alignment rather than rote memorization of specific token sequences.

4. Key Contributions

Novel Evaluation Paradigm: Introduced Serbian digraphia as a controlled testbed to probe the abstractness of learned representations, overcoming the vocabulary and mapping confounds of traditional cross-lingual studies.
Demonstration of Script Invariance: Provided empirical evidence that SAE features in Gemma models capture semantics that transcend surface tokenization, with cross-script similarity for identical sentences reaching ~0.58 Jaccard.
Scale Analysis: Characterized the relationship between model scale and script invariance, showing that larger models develop more robust, script-independent representations.

5. Significance and Implications

Abstract Semantics: The findings suggest that SAEs successfully decompose neural activations into features representing abstract concepts rather than surface-level token patterns.
Interpretability: This validates the use of SAEs for cross-script and cross-lingual interpretability research, implying that "concepts" learned by LLMs are not bound to specific writing systems.
Generalizability: While limited to Serbian, the paradigm offers a blueprint for investigating orthographic abstraction in other multi-script languages.
Future Directions: The paper suggests that identifying specific script-invariant features could reveal interpretable semantic anchors, aiding in the development of more robust and generalizable AI systems.

In conclusion, the paper establishes that meaning is represented at a level of abstraction above tokenization, and that SAEs are effective tools for uncovering these deep, script-invariant semantic structures in modern LLMs.