Altered Thoughts, Altered Actions: Probing Chain-of-Thought Vulnerabilities in VLA Robotic Manipulation

Imagine you have a very smart robot chef. In the past, if you told it, "Make me a sandwich," it would look at the counter, grab the bread, and start working immediately.

But the newest generation of robots (called VLA models) has a new trick: before it moves its arm, it thinks out loud. It generates a mental script like, "Okay, I see the bread on the left and the cheese on the right. I will pick up the bread first, then the cheese." Only after writing this script does it actually move its arm.

This paper asks a scary question: What happens if someone secretly edits that mental script while the robot is thinking, but leaves the robot's eyes and ears completely untouched?

The Experiment: The "Ghost Editor"

The researchers imagined a hacker who can't touch the robot's code or its cameras. Instead, the hacker sits in the middle of the robot's brain, intercepting the "thought script" just before the robot acts. They then swap out specific words in that script and let the robot act on the corrupted plan.

They tested seven different ways to mess with the script, ranging from simple noise to a super-smart AI rewriting the whole thing.

The Big Surprise: It's All About the "Names"

The results were shocking and counter-intuitive.

The "Jumbled Sentences" Test: They shuffled the order of the sentences (e.g., saying "I will pick up the cheese" before "I see the bread").
- Result: The robot didn't care. It still made the sandwich perfectly.
- Analogy: It's like reading a recipe where the steps are out of order, but the ingredients are still named correctly. The robot is smart enough to figure out the order on its own.
The "Wrong Direction" Test: They flipped every direction word (changing "left" to "right," "up" to "down").
- Result: The robot still succeeded.
- Analogy: The robot is like a person driving with a GPS that says "Turn Left" when they should turn Right. But because the robot can see the road, it ignores the wrong GPS voice and just looks at the actual street signs.
The "Super-Hacker" Test: They used a powerful AI to rewrite the script to be grammatically perfect and logical, but with the wrong plan.
- Result: The robot still succeeded.
- Analogy: Even if a brilliant human wrote a fake plan that sounded perfect, the robot wasn't fooled because the names of the objects were still correct.
The "Name Swap" Test (The Killer): They kept the script logical and the directions correct, but they swapped the names of the objects.
- Original Script: "Pick up the wine bottle."
- Hacked Script: "Pick up the chocolate pudding."
- Result: The robot's success rate crashed. It reached for the pudding instead of the wine, or it got confused and failed completely. On the hardest tasks, the robot failed 45% more often just because the name of the object was wrong.

The Lesson: The Robot is "Blind" to Logic, but "Sharp" on Names

The paper reveals a strange flaw in these advanced robots: They don't actually trust their own "thinking" process for logic or direction. They rely on their cameras for that.

However, they do trust the "thinking" process to tell them what to grab. The robot's brain uses the text script essentially as a "label" to point its hand at the right object. If the label says "pudding," the robot looks for pudding, even if the camera clearly shows a wine bottle.

Why This Matters: The "Silent" Attack

This is dangerous because of stealth.

Old Attacks: To break a robot, hackers usually had to put a weird sticker on a stop sign (so the robot sees it as a speed limit) or shout a weird command (so the robot hears a different instruction). These are easy to spot and block.
This New Attack: The hacker changes the robot's internal thoughts. To an outside observer, the robot's eyes see the wine bottle, and the voice says "Make a sandwich." Everything looks perfect. But inside the robot's brain, the plan says "Grab the pudding."

It's like a spy whispering the wrong address to a delivery driver while the driver is looking at the map. The driver sees the map (the camera), but they follow the whisper (the corrupted thought).

The Takeaway

As robots get smarter and start "thinking" before they act, we have a new safety problem. We can't just check if the robot's eyes are working or if the voice commands are clean. We have to secure the internal conversation between the robot's "thinking brain" and its "moving hands."

The paper suggests a simple fix: Before the robot acts, we should have a "bouncer" check the script to make sure the names of the objects match what the robot is actually seeing. If the script says "pudding" but the camera sees "wine," the robot should stop and ask, "Wait, what's going on?"

` tags before decoding actions. It achieves a 95.4% success rate on the LIBERO benchmark (40 tabletop manipulation tasks).

Control Models:
- OpenVLA-OFT: A non-reasoning VLA (no CoT generation) to test if the vulnerability is specific to reasoning architectures.
- OpenVLA (Base): A zero-shot non-reasoning model.
Corruption Taxonomy: The authors designed seven corruption conditions organized into three tiers of attacker capability:
1. Tier 1 (Blind Noise): Random Tokens (50% of tokens replaced randomly) and Padding (replaced with filler tokens).
2. Tier 2 (Mechanical-Semantic): Shuffled (sentence order permuted), Entity Swap (object names swapped with other valid objects), and Negation Flip (spatial directions reversed, e.g., left $\leftrightarrow$ right).
3. Tier 3 (LLM-Adaptive): An auxiliary LLM (Llama-3.1-70B) rewrites the CoT to be "plausible but wrong" (e.g., wrong objects, wrong directions) while maintaining grammatical structure.

Evaluation Protocol

Metric: Task Success Rate (SR) change ( $\Delta$ SR) compared to a clean baseline.
Significance: Effects within $\pm$ 4 percentage points (pp) are considered negligible. Statistical significance is tested via paired t-tests and Wilcoxon signed-rank tests.
Cross-Surface Comparison: The same corruption types were applied to the task instruction (input level) to compare the potency of CoT attacks vs. input attacks.

3. Key Contributions

First Systematic Study: The first characterization of reasoning trace attacks on embodied AI, extending CoT vulnerability research from language safety to physical task failure.
Selective Causal Sensitivity: Discovery that the action decoder relies exclusively on entity-reference integrity (object names) within the CoT. It is largely insensitive to sentence order, spatial direction terms, token noise, or even "plausible" adversarial rewrites.
Stealth Threat Vector: Demonstration that CoT attacks are invisible to input-validation defenses because all external inputs remain clean.
Double Dissociation: Proof that the vulnerability is architecture-specific; CoT attacks degrade reasoning models but leave non-reasoning models unaffected, whereas instruction-level attacks degrade both.
Capability Inversion: Finding that a sophisticated LLM-based attacker is less effective than a simple mechanical entity swap, because the LLM inadvertently preserves the entity-grounding structure needed by the decoder.

4. Key Results

Selective Sensitivity (The "Entity Swap" Effect)

Entity Swap: Replacing object names in the CoT caused a significant drop in performance:
- Overall: $-8.3$ pp (from 95.4% to 87.0%).
- LIBERO-Goal (Complex tasks): $-19.3$ pp.
- Individual Hardest Task: Up to $-45$ pp (e.g., "put wine bottle on rack" dropped from 96.7% to 51.7%).
Negligible Impact: All other corruptions had minimal effects (within $\pm$ $\pm$ 4 pp):
- Shuffled sentences: No impact.
- Negation Flip (spatial terms): No impact (the model relies on visual grounding, not text, for direction).
- Random Tokens/Padding: No impact.
- LLM-Adversarial: Only $-0.5$ pp (statistically insignificant).

Dose-Response Analysis

Degradation scales monotonically with corruption intensity on complex tasks (LIBERO-Goal), showing a linear decline as more tokens are corrupted. This suggests the decoder extracts information independently from sentences rather than suffering a "snowball effect" of cascading errors.

Cross-Surface Comparison (CoT vs. Instruction)

Potency: Instruction-level entity swaps are far more damaging ($-85.2$ pp on Goal tasks) than CoT swaps ($-19.3$ pp).
Stealth: CoT attacks are unique because they leave inputs clean. In a production system with input sanitization, the CoT channel is the only remaining attack vector.

Reasoning-Specificity Control

OpenVLA-OFT (Non-reasoning): Completely immune to CoT corruption (as it generates no CoT).
Instruction Attacks: Degrade both models, confirming that while entity grounding is a general VLA fragility, the CoT channel is a distinct vulnerability exclusive to reasoning-augmented models.
Amplification: Interestingly, DeepThinkVLA suffered larger relative degradation than OpenVLA-OFT under instruction-level attacks, suggesting the reasoning module may amplify input errors rather than correct them.

5. Significance and Implications

Safety Critical: The findings reveal that "thinking" robots are vulnerable to a specific, stealthy class of attacks where the robot's internal logic is hijacked without altering what it sees or hears.
Architectural Insight: The results indicate that current reasoning VLAs use CoT primarily for entity grounding (linking the plan to physical objects) rather than for complex sequential reasoning or spatial logic (which are handled by vision).
Defense Strategy:
- Ineffective Defenses: General "plausibility" checks or input sanitization will not stop CoT attacks.
- Effective Defense: A simple entity-reference validator (cross-referencing object names in the CoT against the instruction and visual scene) can detect 100% of Entity Swap attacks with a low false-positive rate.
Future Outlook: As "think-then-act" systems (e.g., NVIDIA GR00T, Cosmos Reason) move toward industrial deployment, securing the internal text interfaces between reasoning and action modules is a critical priority for robotic safety.

In summary, the paper establishes that the integrity of object references in the reasoning trace is the single most critical factor for robotic task success, making it the primary target for both attackers and defenders in the era of reasoning-enabled robots.