ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly

Imagine you are trying to put a key into a very tight, rusty lock.

If you just use your eyes, you can see the keyhole from a distance. You can line up the key perfectly. But the moment the key touches the metal of the lock, your view gets blocked. You can't see the tiny bumps or the exact angle anymore. If you try to force it in just by looking, you'll likely jam it or break the key.

This is exactly the problem robots face when doing delicate assembly work, like putting a peg into a hole. They are great at seeing, but they are "blind" when they actually touch the object.

ReTac-ACT is a new robot brain that solves this by giving the robot a pair of "super-fingers" that can feel, combined with its eyes. Here is how it works, broken down simply:

1. The Problem: The "Blind Spot" at the Finish Line

Current robots (like the ones using standard AI) are like drivers who only look through the windshield. They drive great until they hit a traffic jam or a narrow alley where the view is blocked. In robot terms, when a robot arm gets close to a hole, the arm itself blocks the camera. The robot gets confused, misses the hole, or jams the parts together.

2. The Solution: A "State-Gated" Switch

The authors created a system called ReTac-ACT. Think of it as a smart traffic light for the robot's senses.

Phase 1: The Approach (Eyes On): When the robot is far away from the hole, the "traffic light" is green for Vision. The robot uses its cameras to find the hole and move the peg close. It doesn't need to feel anything yet.
Phase 2: The Insertion (Fingers On): The moment the peg touches the metal, the robot's internal sensors (proprioception) detect the contact. The "traffic light" instantly flips. It turns Vision down (because the view is blocked) and turns Touch up to maximum.

This "gating" mechanism is the secret sauce. It doesn't just mix the senses randomly; it knows exactly when to switch from "looking" to "feeling."

3. The "Super-Fingers" (Tactile Sensors)

The robot uses special sensors (called GelSight or optical tactile sensors) on its fingertips. These aren't just pressure pads; they are like high-definition cameras built into the skin. They can see the tiny scratches, the angle of the peg, and the exact shape of the hole, even when the robot's main camera can't see them.

4. The "Training Camp" (Reconstruction)

To make these fingers smart, the researchers used a clever training trick. They forced the robot to reconstruct the image of what it was feeling.

Analogy: Imagine a student learning to draw a face. Instead of just memorizing the name "nose," the teacher forces the student to draw the nose over and over again until it's perfect.
By forcing the robot to "redraw" the tactile image from its memory, the robot learns to pay attention to the important details (like the shape of the hole) rather than just random textures.

5. The Results: A Master Craftsman

The researchers tested this on a standard industrial challenge: putting a peg into a hole with a gap so small it's thinner than a human hair (0.1 mm).

Old Robots (Vision Only): They failed almost 100% of the time at this tiny gap because they couldn't see the final millimeter.
ReTac-ACT: It succeeded 80% of the time.

It's like the difference between a person trying to thread a needle in the dark by guessing, versus a person who can feel the thread and the needle with their fingertips.

Why This Matters

This isn't just about putting pegs in holes. This technology is a giant leap forward for:

Factory Automation: Making robots that can assemble delicate electronics or car parts without human help.
Surgery: Helping robotic surgeons feel the tissue they are cutting.
Home Robots: Eventually, robots that can fold laundry or wash dishes without crushing things.

In short, ReTac-ACT teaches robots to stop relying solely on their eyes when things get tricky and to trust their "hands" when it really counts. It's the difference between a clumsy beginner and a master craftsman.

Here is a detailed technical summary of the paper "ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly."

1. Problem Statement

Precision robotic assembly, particularly "peg-in-hole" tasks, requires sub-millimeter corrections in the final "last-millimeter" phase.

The Limitation of Vision-Only: Existing state-of-the-art imitation learning methods (e.g., ACT, Diffusion Policy) rely heavily on visual feedback. However, during the contact phase, the end-effector and workpiece often occlude the camera's view, and geometric features become indistinct. This leads to failure in tasks requiring high precision (e.g., 0.1 mm clearance).
The Challenge of Fusion: Simply adding tactile sensors to existing architectures often results in "modality imbalance," where the model is dominated by visual signals, rendering the tactile data ineffective. Furthermore, tactile signals are static during free-space motion but critical during contact; a static fusion strategy wastes computational resources and fails to adapt to task phases.

2. Methodology: ReTac-ACT

The authors propose ReTac-ACT (Reconstruction-enhanced Tactile Action Chunking with Transformers), a policy that extends the ACT framework to natively process and dynamically fuse vision and tactile modalities.

A. Architecture Overview

The system consists of three main modules:

Multi-Modal Encoders:
- Visual Encoder: Uses a ResNet-18 backbone (pre-trained on ImageNet) to process RGB images from three cameras.
- Tactile Encoder: Uses a dedicated 5-layer CNN (optimized for high-frequency contact deformations rather than semantic natural images) to process high-resolution tactile images from optical sensors (GelSight/Xense).
Cross-Modal Dynamic Fusion Module:
- Bidirectional Cross-Attention: Before fusion, visual and tactile tokens interact via cross-attention. This allows tactile features to refine visual localization and visual context to guide tactile interpretation.
- State-Gated Mechanism: A gating network conditioned on the robot's proprioceptive state (joint positions, poses) computes a scalar weight $\alpha_t$ $α_{t}$ . This dynamically shifts the policy's reliance:
  - $\alpha_t \approx 0$ : Vision-dominant (free-space approach).
  - $\alpha_t \approx 1$ : Tactile-dominant (contact/insertion phase).
- Reciprocal Fusion: The final fused representation is a weighted sum of the original and cross-attention-enhanced features, creating an information bottleneck that forces meaningful modality switching.
Action Generator: A CVAE-based Transformer decoder predicts temporal action chunks (14-DoF bimanual joints + gripper commands).

B. Training Objectives

To ensure the tactile encoder learns manipulation-relevant features rather than generic textures, the authors introduce an auxiliary Tactile Reconstruction Objective:

Reconstruction Loss ( $L_{rec}$ ): The model attempts to reconstruct the raw tactile image from the latent tokens. This forces the encoder to preserve fine-grained contact geometry and prevents feature collapse.
Contrastive Alignment ( $L_{con}$ ): An InfoNCE-style loss aligns visual and tactile feature spaces to handle the domain gap between modalities.
Total Loss: Combines action prediction ( $L_{\ell1}$ ), VAE KL-divergence, reconstruction, and contrastive losses.

3. Key Contributions

State-Gated Vision-Tactile Policy: A novel architecture that adaptively weights modalities based on robot state, seamlessly transitioning from vision-guided approach to tactile-guided insertion without heuristic switches.
Tactile Representation Learning: The introduction of an auxiliary reconstruction objective that compels the tactile encoder to capture high-frequency contact geometry, significantly improving sensitivity to sub-millimeter deviations.
Standardized Benchmark & Dataset:
- Evaluation on the NIST Assembly Task Board (ATB) M1 benchmark with standardized tolerances.
- Release of a large-scale vision-tactile dataset (5,000+ trajectories) covering 5 geometric shapes and 4 clearance levels, along with the full codebase.

4. Experimental Results

The method was evaluated on a bimanual robot (Realman RM75) performing peg-in-hole tasks with varying clearances.

Performance at 3 mm Clearance (Level 1):
- ReTac-ACT: 90% success rate.
- Baselines: ACT (40%), Diffusion Policy (20%), pi05 (20%).
- ReTac-ACT achieved 0% grasp failure, whereas baselines missed 40–70% of grasps.
Performance at 0.1 mm Clearance (Industrial Grade):
- ReTac-ACT: Maintained 80% success.
- Baselines: ACT dropped to 15%; Diffusion Policy failed completely (0%).
- This demonstrates ReTac-ACT's ability to handle industrial-grade tolerances where pure vision fails due to occlusion.
Ablation Studies:
- Removing Cross-Attention/Reciprocal Fusion dropped success to 5%.
- Removing Tactile Reconstruction dropped success to 15% (proving the need for geometry-aware tactile features).
- Removing State-Gating (static fusion) dropped success to 35%, confirming the necessity of adaptive modality switching.
Robustness: As clearance tightened from 3mm to 0.1mm, ReTac-ACT's performance degraded by only 11%, compared to a 62.5% drop for ACT.

5. Significance

Solving the "Last-Millimeter" Problem: ReTac-ACT effectively bridges the gap between high-level visual planning and low-level tactile control, solving a critical bottleneck in robotic assembly where visual feedback is compromised.
Industrial Applicability: Achieving 80% success at 0.1 mm clearance suggests the method is viable for real-world industrial precision assembly, a domain previously dominated by human operators or rigid programming.
Methodological Advancement: The paper demonstrates that simply adding sensors is insufficient; the architecture must dynamically manage modality reliance and enforce specific feature learning (via reconstruction) to handle contact-rich tasks.
Community Resource: By releasing the NIST-standardized dataset and code, the authors provide a reproducible foundation for future research in visuo-tactile learning.