Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

Imagine you have a very smart robot chef (a Large Language Model, or LLM) who was trained to cook using a specific set of pre-chopped ingredients. For example, the chef expects "onion" to arrive as a single, whole unit.

Now, imagine you decide to test the chef by handing them a bag of raw, individual onion slices instead of the whole onion. You might expect the chef to get confused, drop the knife, or serve a terrible meal because the ingredients are in the wrong format.

Surprisingly, this paper discovers that the robot chef doesn't just cope; it actually does a great job. It can take those scattered slices, put them back together in its mind to form the "onion," and then cook the dish perfectly.

The researchers wanted to know: How does the chef do this? Does it just guess based on the slices, or does it secretly rebuild the whole onion first?

Here is the breakdown of their findings using simple analogies:

1. The Magic Trick: "Word Recovery"

The researchers found that the robot doesn't actually try to cook with the raw slices. Instead, it performs a magic trick called Word Recovery.

The Analogy: Think of the input as a sentence written on a strip of paper where every letter is separated by a space: W h a t _ i s _ n a t u r a l _ g a s ?.
What happens inside: As the information travels through the robot's "brain" (its layers), it acts like a team of workers passing a message down a line. In the early stages, the workers who handle the letters n, a, t, u, r, a, l start talking to each other. They whisper, "Hey, we belong together!"
The Result: By the time the message reaches the middle of the brain, the robot has mentally glued those letters back together. It has reconstructed the word "natural" inside its own memory, even though it never saw the word "natural" as a single piece when it started. It then uses this reconstructed word to answer your question.

2. The Proof: "The Surgery Test"

To prove that this "rebuilding" is actually necessary and not just a side effect, the researchers performed a "surgery" on the robot's brain.

The Analogy: Imagine the robot's brain is a library of information. The researchers found the specific shelf where the reconstructed word "natural" was stored. They then took a pair of scissors and cut out that specific shelf, effectively deleting the word "natural" from the robot's memory while it was trying to think.
The Result: When they did this, the robot immediately forgot how to answer the question. It stumbled and failed.
The Conclusion: This proved that the robot needs to rebuild the words to function. It's not just a lucky guess; the "Word Recovery" is the engine that makes the robot work.

3. The Glue: "In-Group Attention"

The researchers also wanted to know how the letters know to stick together. They looked at the robot's "attention" mechanism (how it decides which parts of the input to focus on).

The Analogy: Imagine a classroom where students are sitting in groups. The students representing the letters of the word "natural" are sitting in the same cluster.
The Discovery: The researchers found that in the early stages of processing, these "letter-students" are constantly raising their hands and talking to each other (this is called In-Group Attention). They are ignoring the other students in the room to focus on forming their group identity.
The Experiment: When the researchers told the teacher to "shut up" and stop the students in the "natural" group from talking to each other, the group fell apart. The robot couldn't form the word, and its performance crashed.

The Big Takeaway

This paper solves a mystery about how AI works. We thought that if you broke a word into tiny pieces (characters), the AI would be confused because it was trained on whole words.

But the AI is much smarter than that. It has an internal "glue" mechanism. When it sees broken pieces, it quickly gathers them up, sticks them back together into words, and then uses those words to understand the world. It's like a puzzle solver that can take a pile of scattered puzzle pieces, instantly see the picture, and solve the puzzle, even if the picture was never given to it as a whole.

In short: The AI doesn't just read letters; it secretly rebuilds words in its mind to make sense of them.

Here is a detailed technical summary of the paper "Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness."

1. Problem Statement

Large Language Models (LLMs) are typically trained using fixed, canonical tokenization schemes (e.g., Byte Pair Encoding or BPE), which map raw text into discrete subword tokens. Conventional wisdom suggests that deviating from this training tokenization—specifically by using character-level tokenization (where input is decomposed into individual characters without explicit word boundaries)—should severely degrade model performance due to the loss of lexical structure.

However, recent empirical observations show that LLMs exhibit surprising robustness to such non-canonical inputs, maintaining competitive performance even when word boundaries are removed. The core research question is: How do LLMs internally process and interpret fragmented character-level inputs to achieve this robustness? Do they reason directly over characters, or do they internally reconstruct higher-level lexical units?

2. Methodology

The authors employ mechanistic interpretability techniques to uncover the internal computations supporting this robustness. Their approach consists of three distinct stages:

A. Detection via Decoding-Based Probing

To determine if hidden states encode canonical word identities, the authors introduce a word recovery metric:

Setup: Inputs are tokenized at the character level.
Mechanism: At each transformer layer $\ell$ , the hidden state $h_j$ corresponding to a character is decoded using the model's output embedding matrix ( $W_{out}$ ) to generate a probability distribution over the vocabulary.
Metric: A "Recovery Score" is calculated as the proportion of unique canonical tokens present in the input that appear in the top- $K$ predictions of the hidden states across the character sequence.
Goal: To quantify when and how many canonical word identities are reconstructed within the model's internal representations.

B. Causal Intervention via Subspace Removal

To establish that word recovery is functionally necessary (causal) rather than just a byproduct (epiphenomenal), the authors perform targeted interventions:

Technique: They identify the direction in the residual stream corresponding to a recovered canonical token ( $w_t$ ). They project the hidden states of the characters belonging to that token onto this direction and subtract the component ( $h \leftarrow h - \langle h, w_t \rangle w_t$ ).
Execution: This removal is applied starting from specific layers through the end of the network.
Measurement: They measure the resulting degradation in downstream task performance (e.g., QA benchmarks). If performance drops significantly, it confirms the recovered representation is causally necessary.

C. Fine-Grained Attention Analysis

To identify the mechanism enabling word recovery, the authors analyze attention patterns:

Hypothesis: "In-group attention" (attention between characters belonging to the same canonical token) is critical for aggregating character information into word-level representations.
Intervention: They selectively mask attention scores between characters within the same canonical token group ( $G(t_i)$ ) while leaving cross-group attention intact.
Timing: They test masking starting from early layers versus later layers to determine when this aggregation is most critical.

3. Key Results

A. Ubiquity of Word Recovery

Universal Phenomenon: Across three different model families (Gemma-2, Qwen2.5, Llama-3) and four benchmarks (ARC-E, ARC-C, CSQA, OpenbookQA), models consistently reconstruct a significant portion of canonical word tokens from character-level inputs in their hidden states.
Recovery Scores: Maximum recovery scores range from 57.7% to 96.8%, indicating that lexical information is effectively reconstructed despite the absence of canonical tokenization.
Model Differences:
- Gemma-2: Recovers most tokens in early layers, saturating quickly.
- Qwen2.5 & Llama-3: Exhibit a two-stage pattern where initial recovery happens early, followed by a sharp increase in mid-to-late layers.

B. Causal Necessity of Word Recovery

Performance Degradation: Removing the subspace corresponding to recovered word tokens causes a substantial drop in task accuracy, particularly when the intervention starts in early layers where recovery first occurs.
Early vs. Late: Interventions in early layers (where only a fraction of tokens are recovered) cause the most significant performance loss. Interventions in later layers (where recovery is high) have diminishing returns, suggesting early recovered tokens are the critical intermediates for contextual understanding.

C. Critical Role of In-Group Attention

Mechanism Validation: Masking in-group attention in the first five layers leads to a consistent and significant reduction in both word recovery scores and downstream task performance.
Conclusion: Early-layer in-group attention is the primary mechanism that aggregates distributed character information into coherent lexical representations. Disrupting this prevents the model from "recovering" the words, thereby breaking its ability to process the input.

4. Key Contributions

Identification of "Word Recovery": The paper defines and quantifies a core internal process where LLMs reconstruct canonical word/subword identities from character-level inputs.
Causal Evidence: Through subspace interventions, the authors prove that word recovery is not merely an epiphenomenon but a causally necessary mechanism for LLMs to handle non-canonical tokenization.
Mechanistic Explanation: The study pinpoints early-layer in-group attention as the specific circuit responsible for aggregating character data into word-level representations, providing a mechanistic explanation for tokenization robustness.
Methodological Framework: The paper provides a robust framework (decoding-based probing + causal intervention) for analyzing how LLMs bridge the gap between raw input structures and learned semantic representations.

5. Significance

Redefining Tokenization Limits: The findings challenge the view that tokenization is a hard constraint on model capability. Instead, they suggest LLMs possess an internal "vocabulary" that allows them to dynamically reconstruct lexical units, making them robust to input fragmentation.
Interpretability Insights: This work advances mechanistic interpretability by linking specific attention patterns (in-group attention) to high-level cognitive functions (lexical reconstruction), offering a clearer picture of how transformers process language.
Practical Implications: Understanding these mechanisms could inform the design of more robust tokenizers, improve model resilience against adversarial character-level perturbations, and guide future research into how models represent linguistic structure internally.

In summary, the paper demonstrates that LLMs do not reason directly over characters; rather, they rapidly reconstruct word-level representations via early-layer in-group attention, which serves as the essential intermediate step enabling their robustness to character-level inputs.