How to Steal Reasoning Without Reasoning Traces

Here is an explanation of the paper "How to Steal Reasoning Without Reasoning Traces" using simple language and creative analogies.

The Big Idea: The "Magic Trick" of the AI

Imagine you have a brilliant chef (the Target AI, like GPT-5 or a commercial model) who can cook a perfect, complex meal. However, this chef is very secretive. When you order a dish, they don't let you see the kitchen, the recipe, or the step-by-step cooking process. They only give you:

The Final Dish (the answer).
A tiny sticky note saying, "I chopped the onions, simmered the sauce, and added salt" (the Reasoning Summary).

The chef thinks, "If I hide my secret recipe and only show the summary, no one can learn how to cook like me."

This paper proves the chef is wrong.

The researchers (the "attackers") built a new tool called Trace Inversion. This tool is like a "Reverse-Engineer's Kitchen." Even though the chef only gave them the final dish and a tiny note, the tool can guess the entire secret recipe with shocking accuracy.

Once the tool guesses the recipe, the researchers can teach a smaller, cheaper chef (the Student AI) how to cook using that guessed recipe. The result? The small chef starts cooking almost as well as the famous, expensive one.

How the "Heist" Works (The 3-Step Plan)

The paper describes a three-stage process to steal the reasoning skills:

1. The Training Phase (The "Practice Kitchen")

The attackers need to teach their "Reverse-Engineer" tool how to guess recipes.

The Setup: They take a bunch of public math and logic problems.
The Teacher: They use an open-source AI (a "Surrogate") that does show its full cooking process.
The Trick: They take the Surrogate's full recipe, hide it, and only show the "Final Dish" and a "Tiny Note" (just like the commercial chef does).
The Learning: They train their tool to look at the Dish + Note and try to write out the full recipe that must have created it. They do this thousands of times until the tool gets really good at guessing the missing steps.

2. The Heist (The "Real Job")

Now, the tool is ready to target the real, secret commercial chef (e.g., GPT-5 mini).

The attackers ask the commercial chef 10,000 questions.
The chef replies with just the Answer and a Short Summary.
The attackers feed these answers and summaries into their trained tool.
The Magic: The tool spits out a long, detailed, step-by-step reasoning trace that looks just like the secret recipe the commercial chef used internally.

3. The Payoff (The "Apprentice")

The attackers take these "guessed" recipes and use them to train their own small AI model.

Instead of just teaching the small model the answer, they teach it the process (the guessed reasoning).
The Result: The small model learns to think deeply and solve hard problems, effectively "stealing" the reasoning capabilities of the big, expensive model.

Why This Matters: The "Black Box" is Leaky

For a long time, AI companies thought that hiding the "Chain of Thought" (the internal thinking steps) was a good security measure. They believed that if you only saw the answer and a brief summary, you couldn't learn how the model thought.

This paper says: "Not so fast."

The Analogy: Imagine a master detective solving a crime. They tell you the verdict ("The butler did it") and a one-sentence summary ("He had a motive and a gun"). The detective thinks, "You can't learn my detective skills from that."
The Reality: If you have enough examples of Verdicts + Summaries, and you have a smart enough tool to guess the missing detective work, you can reconstruct the detective's entire thought process. You can then teach a rookie detective to think just like the master.

The Numbers Don't Lie

The researchers tested this on real-world benchmarks (like math competitions and coding tests):

Without the trick: If they just taught a small model the answers and summaries, it got about 57% of the math problems right.
With the trick (Trace Inversion): After using the "guessed recipes" to train the model, it jumped to 78% accuracy.

They even managed to do this against a "Black Box" commercial model (GPT-5 mini) that they couldn't see inside at all. They successfully transferred the "brain power" of the big model to a smaller, open-source model.

The Takeaway

Hiding the "how" doesn't protect the "what."

Even if AI companies stop showing their internal reasoning steps and only show summaries, clever attackers can use AI tools to reconstruct those steps anyway. This means that simply hiding the "Chain of Thought" is not enough to stop people from stealing a model's intelligence.

The Bottom Line: If you want to protect your AI's secret reasoning, you can't just hide the recipe; you have to make the dish itself un-teachable, or find a way to stop people from guessing the recipe from the final meal.

Here is a detailed technical summary of the paper "How to Steal Reasoning Without Reasoning Traces" by Zhang, Morris, and Shmatikov.

1. Problem Statement

Large Language Models (LLMs) with advanced reasoning capabilities (e.g., Chain-of-Thought or CoT) often hide their full internal reasoning traces from users to protect intellectual property (IP) and prevent "capability stealing." Instead of revealing the step-by-step thought process, commercial providers (like OpenAI, Google, Anthropic) typically output only:

A final answer ( $y$ ).
A brief reasoning summary or "bubble" ( $b^*$ ), which is a compressed version of the full trace ( $t$ ).

The prevailing assumption in the industry is that hiding the full trace $t$ and only exposing $b^*$ prevents attackers from distilling the model's reasoning capabilities into a smaller "student" model. This paper challenges that assumption, asking: Can an attacker reconstruct detailed reasoning traces and successfully distill reasoning capabilities using only the final answers and optional summaries?

2. Methodology: Trace Inversion

The authors introduce Trace Inversion, a three-stage framework designed to synthesize detailed reasoning traces from black-box model outputs.

A. Threat Model

Attacker: Has black-box access to a victim model $V$ (can query inputs $x$ and receive outputs $y$ and optionally $b^*$ ). The attacker has no access to $V$ 's internal weights, logits, or true reasoning traces.
Goal: Synthesize a reasoning trace $\hat{t}$ that is logically consistent with $(x, y, b^*)$ and use it to fine-tune a student model $S$ to improve its reasoning performance.

B. The Three-Stage Pipeline

Stage 1: Training the Inversion Model ( $I$ )
- The attacker uses a surrogate reasoning model ( $V'$ , e.g., an open-source model like R1) to generate ground-truth traces ( $t'$ ) for a public dataset of inputs ( $x'$ ).
- A compression model ( $C'$ ) is used to simulate the victim's behavior, compressing $t'$ into summaries ( $b'$ ).
- An inversion model ( $I$ ) is trained (fine-tuned) to map the pair $(x', y', b')$ (or just $(x', y')$ in the no-summary setting) back to the full trace $t'$ .
- Objective: Maximize the likelihood of generating the original trace $t'$ given the compressed inputs.
Stage 2: Inverting Victim Outputs
- The attacker queries the actual black-box victim model $V$ to obtain real-world tuples $(x, y, b^*)$ .
- The trained inversion model $I$ processes these tuples to synthesize a detailed reasoning trace $\hat{t}$ . This $\hat{t}$ serves as a proxy for the victim's hidden internal reasoning.
Stage 3: Student Distillation
- The synthesized traces $\hat{t}$ are used as supervision signals to fine-tune a student model ( $S$ ).
- The student is trained on the triplet $(x, \hat{t}, y)$ , effectively learning to reason like the victim model, even though the victim's true traces were never seen.

3. Key Contributions

New Vulnerability Identification: The paper demonstrates that hiding full chains of thought (CoT) does not reliably prevent capability stealing. Even with only final answers and short summaries, reasoning capabilities can be effectively distilled.
Trace Inversion Framework: A novel method to synthesize high-quality, detailed reasoning traces from observable outputs (answers + summaries) without needing access to the teacher's internal states.
Empirical Validation on Black-Box Models: The attack was successfully demonstrated against GPT-5 mini, a commercial black-box model. The authors showed that fine-tuning a student model (Qwen-2.5-7B) on inverted traces significantly outperformed fine-tuning on answers or summaries alone.
Open Source Release: The authors released their code and models to facilitate further research into the security implications of reasoning models.

4. Experimental Results

A. Trace Synthesis Quality

With Summaries: When the victim model provided summaries, the inversion model achieved high overlap with ground-truth traces. For DeepSeek-R1 traces, the model achieved 81% token recovery and a 52.79 token-overlap F1 score.
Without Summaries: Even when only the final answer was available, the inversion model could synthesize meaningful traces, though with slightly lower overlap metrics.
Surrogate Strength: The method works even with a weaker surrogate model (e.g., R1-Distill) to train the inversion model, though performance improves with a stronger surrogate.

B. Capability Stealing Performance

The authors evaluated the performance of student models (Qwen-2.5-7B and Llama-3.1-8B) fine-tuned on different data sources across three benchmarks: MATH500, JEEBench, and LiveCodeBench.

Key Findings against GPT-5 mini (Black-Box):

Baseline (Answer + Summary): Fine-tuning Qwen on GPT-5 mini's answers and summaries yielded 56.8% on MATH500 and 11.7% on JEEBench.
Trace Inversion Attack: Fine-tuning Qwen on inverted traces generated from the same GPT-5 mini outputs improved performance to 77.6% on MATH500 and 42.3% on JEEBench.
Comparison: The inverted traces significantly outperformed both "Answer-only" and "Answer+Summary" baselines, and in many cases, outperformed fine-tuning on the surrogate's own traces (Surrogate-Trace).

Key Findings against Open-Weight Models (R1):

Inverted traces consistently improved student performance over answer-only baselines.
On JEEBench, using inverted traces from R1 improved Qwen's accuracy from 24.0% (Answer+Summary) to 36.3%.

C. Economic Feasibility

The attack is economically viable. Collecting 10,000 queries from GPT-5 mini costs approximately $70. The inversion and fine-tuning steps are performed offline using the attacker's own resources.

5. Significance and Implications

Failure of "Summary-Only" Protections: The paper proves that providing a compressed reasoning summary is insufficient to protect reasoning capabilities. The summary contains enough information for an inversion model to reconstruct a high-fidelity reasoning path.
Distillation vs. Reconstruction: The goal of the attack is not necessarily to perfectly reconstruct the exact words the victim model thought, but to generate a trace that is effective for teaching a student model. The synthesized traces act as high-quality supervision signals.
Defense Limitations: Current defenses that rely on obfuscating internal reasoning (e.g., hiding logits, truncating traces) are ineffective against Trace Inversion because the attack operates solely on the final outputs.
Future Directions: The authors suggest that future defenses must focus on preventing the utility of the output for distillation (e.g., "undistillable" teachers) rather than just hiding the trace. They also note that watermarking may help with attribution but does not prevent the stealing of capabilities.

Conclusion

The paper fundamentally shifts the security landscape for reasoning LLMs. It demonstrates that hiding the chain of thought is not a silver bullet against model stealing. As long as a model produces correct answers (and potentially brief summaries), its reasoning capabilities can be reverse-engineered and transferred to smaller, cheaper models, posing a significant risk to the intellectual property of commercial AI providers.