Language Models are Injective and Hence Invertible

Imagine you have a magical, super-smart machine (a Large Language Model, or LLM) that reads your text and turns it into a secret code. For years, scientists believed this machine was a bit like a blender: you throw in a whole fruit salad (your text), and it blends everything into a smoothie (the hidden code). Once it's blended, you can't tell which apple slice came from where, or if you accidentally dropped a grape in. The assumption was that the machine loses information along the way, making it impossible to perfectly reconstruct your original text just by looking at the smoothie.

This paper says: "Wrong. The blender isn't a blender; it's a high-tech 3D printer."

Here is the breakdown of their discovery, using simple analogies:

1. The Big Surprise: No Information is Lost

The authors proved mathematically that these language models are injective. In plain English, this means: Different inputs always produce different outputs.

The Old View: If you type "Hello" and "Goodbye," the machine might turn both into the exact same secret code. If that happened, you'd never know which word you typed just by looking at the code.
The New View: The machine is like a perfect fingerprint scanner. Even if two people look very similar, their fingerprints are unique. Similarly, even if two sentences are very similar, the machine creates a unique "fingerprint" (a hidden state) for each one. There is no "collapsing" of different inputs into the same output.

The Analogy: Imagine a library where every single book, no matter how similar the cover, gets a unique, unbreakable barcode. If you have the barcode, you can always find the exact book it came from. The paper proves that these AI models act like that library.

2. Why Does This Happen? (The "Smooth" Machine)

You might wonder, "But these machines use complex math and weird curves. How can they be perfect?"

The authors looked at the math under the hood and found that the machine is built from smooth, continuous functions (like drawing a curve without lifting your pen). They proved that for the machine to accidentally mix up two different inputs, it would have to land on a mathematical "impossibility"—a set of settings so rare that it's like trying to win the lottery every single day for a million years.

The Analogy: Think of the machine's settings as a giant landscape of hills and valleys. The "bad" settings (where the machine mixes up words) are like a single, invisible speck of dust on a football field. When the machine is built (initialized) and trained, it naturally rolls around the field but never lands on that speck of dust. It's statistically impossible for it to happen by accident.

3. The Magic Tool: SIPIT (The "Reverse Blender")

Because the machine is so precise, the authors built a tool called SIPIT (Sequential Inverse Prompt via ITerative updates).

The Problem: Usually, if you see the secret code (the hidden state), you can't easily turn it back into the original text. It's like seeing a smoothie and trying to guess the exact recipe.
The Solution: SIPIT is a reverse-engineering machine. Because the code is unique, SIPIT can look at the secret code and say, "Ah, this specific pattern of numbers can only come from the word 'Apple', not 'Pear'." It does this step-by-step, reconstructing your text word-for-word, perfectly.

The Analogy: If the original AI is a chef turning ingredients into a dish, SIPIT is a detective who can taste the dish and perfectly list every single ingredient used, in the exact order they were added, with 100% accuracy.

4. Why Should You Care?

This isn't just a math trick; it changes how we think about AI safety and privacy.

Privacy: If someone steals the "secret code" (the hidden state) from an AI, they can use SIPIT to read your private messages exactly as you sent them. The code isn't an abstract summary; it is your text, just in a different format.
Trust: It proves that AI models don't "forget" or "distort" your input. They remember everything perfectly. This is great for debugging (we can see exactly what the AI "thought") but scary for privacy (nothing is truly hidden inside the machine).

Summary

The paper shatters the idea that AI models are "lossy" (forgetful) machines. Instead, they are perfectly precise recorders.

Input: Your text.
Process: A smooth, mathematical transformation that never loses a single bit of information.
Output: A unique code that can be turned back into your text perfectly.

It's like realizing that the "black box" of AI isn't a box that swallows things; it's a mirror that reflects your input with such perfect clarity that you can always look back and see exactly what you put in.

1. Problem Statement

The prevailing intuition in the deep learning community is that Transformer-based Large Language Models (LLMs) are inherently lossy. Due to non-linear activations, normalization layers (LayerNorm), and many-to-one attention mechanisms, it is widely assumed that distinct input prompts can collapse into identical hidden representations. This "lossiness" suggests that:

Exact recovery of the input text from internal representations is impossible.
The link between text and representation is non-invertible, raising concerns about transparency, interpretability, and safety (e.g., if hidden states are leaked, can the original prompt be reconstructed?).

The Core Question: Are decoder-only Transformer language models, mapping discrete input sequences to continuous hidden states, actually injective (one-to-one), and if so, can we efficiently invert them?

2. Methodology

The authors approach this problem through a rigorous mathematical framework combining real analysis and measure theory, followed by extensive empirical validation and the development of an inversion algorithm.

A. Theoretical Framework: Real-Analyticity

The paper models the Transformer as a function $f(s; \theta)$ mapping a prompt $s$ to a hidden state $r(s; \theta)$ .

Real-Analyticity: The authors prove that standard Transformer components (embeddings, LayerNorm with $\epsilon > 0$ , causal attention, MLPs with analytic activations like GELU/SiLU, and residual connections) are real-analytic functions of their parameters.
Measure-Zero Collisions: A fundamental property of non-zero real-analytic functions is that their zero sets have Lebesgue measure zero.
- They define a collision as two distinct prompts $s \neq s'$ yielding the same representation: $r(s; \theta) = r(s'; \theta)$ .
- They construct a specific parameter setting (a "witness") where $r(s; \theta) \neq r(s'; \theta)$ , proving the collision function is not identically zero.
- Conclusion: The set of parameters $\theta$ that cause collisions has measure zero. Therefore, for any parameter distribution with a density (e.g., Gaussian, Xavier), the probability of a collision is zero.
Preservation Under Training: The authors prove that Gradient Descent (GD) updates preserve the absolute continuity of the parameter distribution. Since GD steps are real-analytic maps with non-singular Jacobians almost everywhere, they cannot map a continuous distribution onto a measure-zero set. Thus, injectivity is preserved throughout training.

B. Algorithmic Contribution: SIPIT

To operationalize this theoretical injectivity, the authors introduce SIPIT (Sequential Inverse Prompt via ITerative updates).

Mechanism: SIPIT exploits the causal structure of Transformers. The hidden state at position $t$ depends only on the prefix $s_{1:t-1}$ and the current token $s_t$ .
Process:
1. Start with an empty prefix.
2. For the current position $t$ , iterate through the vocabulary $V$ .
3. For each candidate token $v$ , compute the predicted hidden state $F(v; \pi, t)$ given the current prefix $\pi$ .
4. Compare the predicted state with the observed hidden state $\hat{h}_t$ .
5. Select the unique token that matches (within a tolerance $\epsilon$ ).
6. Append the token to the prefix and repeat for $t+1$ .
Guarantees: The algorithm provides provable linear-time complexity ( $O(T \cdot |V|)$ ) for exact reconstruction and is robust to noise and quantization.

3. Key Contributions

Theoretical Proof of Injectivity: The first rigorous proof that causal decoder-only Transformers are almost surely injective for distinct prompts. This holds for finite width, depth, and training time, provided parameters are initialized from a continuous distribution and trained via standard GD.
SIPIT Algorithm: The first algorithm to provably and efficiently reconstruct the exact input text from hidden activations in linear time, turning the theoretical property of injectivity into a practical tool.
Empirical Validation: Extensive experiments on billions of collision tests across six state-of-the-art models (GPT-2, Gemma-3, Llama-3, Mistral, Phi-4) showing zero collisions observed.
Robustness Analysis: Demonstrated that injectivity and SIPIT's success hold even under:
- Weight quantization (FP4, INT8).
- Different vocabulary sizes.
- Noisy or perturbed hidden states.
- Out-of-distribution (random) sequences.

4. Results

Collision Search: In experiments involving ~5 billion pairwise comparisons across models (including 1B to 70B parameter models), the minimum $L_2$ distance between distinct last-token representations was consistently orders of magnitude above the collision threshold ( $10^{-6}$ ).
Inversion Performance:
- Accuracy: SIPIT achieved 100% token-level accuracy in reconstructing prompts from hidden states.
- Efficiency: SIPIT reconstructed prompts significantly faster than gradient-based baselines (e.g., HARDPROMPTS failed completely; SIPIT was ~100x faster than brute-force search).
- Vocabulary Scaling: The algorithm explored less than 0.22% of the vocabulary on average, demonstrating high efficiency even with large vocabularies (e.g., Llama-3 with ~128K tokens).
Quantization: Quantized models (FP4/INT8) showed increased separation between representations, further preventing collisions.

5. Significance and Implications

Paradigm Shift: The paper overturns the assumption that LLMs are "lossy" compressors. Instead, they are lossless encoders where the hidden state contains the full information of the input sequence.
Transparency & Interpretability: Since the input can be exactly recovered, mechanistic interpretability tools (probes, causal tracing) are not "missing" information; the information is structurally preserved.
Privacy & Security:
- Data Protection: If hidden states (e.g., KV caches, intermediate activations) are stored or transmitted, they effectively constitute the user's raw text. This has profound legal implications for GDPR and data deletion rights.
- Adversarial Risks: An adversary with access to hidden states can perfectly reconstruct the prompt, highlighting new attack vectors for privacy leakage.
Regulatory Impact: The authors argue that current regulations often treat model weights as abstract representations, ignoring that inference-time hidden states are exact, recoverable copies of user input.

Conclusion

The paper establishes that injectivity is a fundamental, structural property of standard Transformer language models, not an asymptotic idealization. By proving that collisions are mathematically impossible under standard training regimes and providing SIPIT to invert these models, the authors demonstrate that LLMs are fully reversible. This finding necessitates a re-evaluation of privacy, safety, and interpretability frameworks in the era of large language models.