Bi-AQUA: Bilateral Control-Based Imitation Learning for Underwater Robot Arms via Lighting-Aware Action Chunking with Transformers

Imagine trying to teach a robot arm to pick up a toy and put it in a box while you are both underwater. Now, imagine the water is murky, the light is flickering, and the colors are shifting from red to blue to green every few seconds. For a standard robot, this is a nightmare. It would get confused, think the toy is a different color, or miss the box entirely because its "eyes" can't handle the weird lighting.

This paper introduces Bi-AQUA, a new way to teach underwater robots that solves two big problems at once: bad lighting and lack of touch.

Here is the simple breakdown using some everyday analogies:

1. The Problem: The "Foggy Glasses" and "Blind Hands"

Most underwater robots today are like a person wearing foggy glasses trying to thread a needle.

The Lighting Issue: Underwater, light behaves strangely. It gets absorbed (making things look dark), scatters (making things look blurry), and changes color (making a red ball look black). Standard robot brains get confused because they expect the world to look the same as it did when they were learning.
The Touch Issue: Many robots only use cameras (vision). But in water, you often need to feel things to know if you've grabbed something or if you're pushing against a wall. Robots that only "see" are like a blindfolded person trying to open a drawer; they might bump into it but never know when it's actually closed.

2. The Solution: The "Master and Apprentice" with Super-Senses

The researchers built a system called Bi-AQUA (Bilateral Control-Based Imitation Learning). Think of it as a Master and Apprentice setup.

The Master (Leader): A human operator sits on a boat or in a dry room, holding a robot arm. They can see clearly and feel the water resistance.
The Apprentice (Follower): A robot arm is underwater. It tries to copy the Master's movements exactly.
The "Bilateral" Magic: This is the key. The Master doesn't just tell the Apprentice where to move; they also share force. If the Apprentice bumps into a rock, the Master feels a "push" in their hand. If the Master pushes hard, the Apprentice knows to push hard too. This gives the robot a sense of "touch" even though it's underwater.

3. The Secret Sauce: The "Lighting Translator"

The real breakthrough in this paper is how Bi-AQUA handles the changing lights.

Imagine you are learning to drive in a car. Usually, you learn in daylight. But what if you suddenly had to drive at night, then in a tunnel, then in a blizzard? You would crash.

Bi-AQUA solves this by giving the robot a "Lighting Translator" (a special AI brain component).

The Translator's Job: Before the robot decides how to move, it looks at the camera image and asks, "What kind of lighting is this? Is it red? Is it flickering?"
The "FiLM" Filter: Think of this like putting on different pairs of sunglasses. If the light is red, the robot automatically "tunes" its vision to understand that redness is normal. If the light is blue, it adjusts again. It doesn't just ignore the weird light; it uses the information about the light to make better decisions.
The "Token": The robot also carries a little "note" (a token) in its brain that says, "Hey, remember, the light is weird right now." This note travels through the robot's decision-making process, ensuring every step it takes accounts for the current lighting.

4. The Results: From "Clumsy" to "Pro"

The researchers tested this in a real water tank with tasks like:

Pick and Place: Grabbing a block and moving it.
Closing a Drawer: A long, tricky task requiring pushing and pulling.
Pulling a Peg: A very tight fit that requires precise force.

The Results were amazing:

Old Robots (without the Lighting Translator): They worked perfectly in white light but failed miserably (0% success) when the light turned blue, green, or started changing colors. They were like a person who can only drive in perfect sunny weather.
Bi-AQUA: It succeeded 100% of the time, even when the lights were changing every 2 seconds, or when the object was a weird color, or when bubbles were blocking the view. It was like a driver who could handle rain, snow, fog, and night driving without blinking.

The Big Takeaway

Bi-AQUA is like teaching a robot to be a diver who can see through the murk and feel the current. By combining the "sense of touch" from the human operator with a special brain that understands how underwater light works, the robot can finally do complex jobs underwater without getting confused by the dark, colorful, and shifting environment.

It's a huge step toward robots that can actually help us explore the ocean, fix underwater cables, or clean up pollution, rather than just crashing into things when the sun goes down.

Here is a detailed technical summary of the paper "Bi-AQUA: Bilateral Control-Based Imitation Learning for Underwater Robot Arms via Lighting-Aware Action Chunking with Transformers."

1. Problem Statement

Underwater robotic manipulation faces significant challenges due to the unique properties of the underwater environment:

Visual Degradation: Rapid shifts in lighting spectrum, intensity, and direction, combined with wavelength-dependent attenuation, scattering, turbidity, and bubbles, cause severe visual inconsistency.
Policy Failure: Standard visuomotor policies (Imitation Learning or Reinforcement Learning) suffer from "action drift" when perception changes drastically, even within seconds.
Limitations of Existing Methods:
- Image Enhancement: Current underwater image restoration methods improve perceptual quality but do not solve the core control problem of adapting closed-loop policies to dynamic illumination.
- Unilateral Control: Most state-of-the-art imitation learning frameworks (e.g., ALOHA, ACT) rely on unilateral control (position only). They lack force feedback, making them fragile in contact-rich tasks or visually ambiguous underwater scenarios.
- Lack of Lighting Modeling: No existing framework combines bilateral control with explicit modeling of lighting as a latent factor within the visuomotor policy.

2. Methodology: Bi-AQUA

The authors propose Bi-AQUA, the first framework to integrate Bilateral Control-based Imitation Learning (Bi-IL) with explicit, hierarchical lighting modeling.

A. Core Architecture

Bi-AQUA builds upon the Bi-ACT (Bilateral Action Chunking with Transformers) architecture, which uses a leader-follower setup where the human operator controls a leader robot in air, and a follower robot executes mirrored motions underwater. The system exchanges both position and force data.

Bi-AQUA introduces a hierarchical lighting adaptation mechanism with three key components:

Label-Free Lighting Encoder ( $E_L$ ):
- Function: Extracts compact lighting embeddings ( $v_L$ ) directly from raw RGB images without requiring manual lighting labels or color annotations.
- Design: A dual-path architecture:
  - Convolutional Path: Captures spatial lighting cues via convolutional layers and global average pooling.
  - Histogram Path: Computes a 2D histogram of Saturation and Value (SV) channels to model color statistics, processed by an MLP.
- Output: A fused lighting embedding that represents the current illumination state.
FiLM-Based Visual Feature Modulation:
- Function: Adapts the visual backbone features to the current lighting conditions.
- Mechanism: Uses FiLM (Feature-wise Linear Modulation). The lighting embedding ( $v_L$ ) generates scaling ( $\gamma$ ) and shifting ( $\beta$ ) coefficients for the feature maps in the visual backbone (specifically the final ResNet layer).
- Benefit: This allows the policy to dynamically adjust its perception features based on the lighting context before feeding them into the transformer.
Lighting Token for Action Conditioning:
- Function: Provides global lighting context to the action generation process.
- Mechanism: The lighting embedding is tokenized and concatenated with proprioceptive and latent tokens as input to the Transformer encoder.
- Flow: The lighting information flows through the encoder and is accessed by the decoder via cross-attention, enabling the generation of action chunks that are conditioned on the specific underwater illumination.

B. Training and Inference

Training: Uses a Conditional Variational Autoencoder (CVAE) objective with Behavior Cloning. The model is trained end-to-end to predict leader action chunks ( $\hat{l}_{t:t+k}$ ) given follower states and lighting-aware visual features.
Inference: Runs in a closed loop. At each step, the system computes the lighting embedding, applies FiLM modulation, and generates a chunk of actions for the leader robot, which are then converted to follower commands via the bilateral controller.

3. Key Contributions

First Underwater Bi-IL Framework: Introduces the first imitation learning framework for underwater robot arms that utilizes bilateral control (force + position).
Explicit Lighting Modeling: Proposes a novel, label-free approach to model underwater lighting as a latent factor, integrating it hierarchically via a Lighting Encoder, FiLM modulation, and a dedicated lighting token.
Robustness to Dynamic Conditions: Demonstrates that combining force-sensitive bilateral control with explicit lighting adaptation enables robust performance under static, unseen, and rapidly changing lighting conditions.

4. Experimental Results

Experiments were conducted in a real-world water tank using a 3-DOF robot arm with a gripper. Tasks included Pick-and-Place, Drawer Closing (long-horizon), and Peg Extraction (contact-rich).

Lighting Robustness (Pick-and-Place):
- Bi-AQUA: Achieved 100% success in 7 out of 8 lighting modes (including unseen colors like cyan/purple and dynamic changing light). It maintained 80% success in the challenging blue light.
- Baseline (Bi-ACT without lighting modeling): Failed completely in most conditions, achieving 100% only in white light and 20% in red light.
- Ablation: Removing either the Lighting Token or FiLM modulation resulted in significant performance drops, particularly in dynamic lighting, proving that both perception modulation and sequence-level conditioning are necessary.
Generalization:
- Bi-AQUA generalized well to novel objects (black rubber block, blue sponge) and visual disturbances (bubbles) without retraining.
Contact-Rich Tasks (Drawer & Peg):
- In the Peg Extraction task (requiring tight geometric tolerance and friction management), the full Bi-AQUA model achieved 100% success across trained lighting modes and high success in unseen modes.
- A version of Bi-AQUA without force feedback (unilateral) failed significantly (0% success in red/purple/changing light), highlighting the critical role of bilateral force feedback in underwater precision tasks.
Efficiency:
- Bi-AQUA matched human teleoperation speeds (approx. 15.7s vs 15.4s), outperforming baselines which were slower due to failed retries or hesitation.

5. Significance

This work bridges a critical gap in underwater robotics by demonstrating that explicit lighting modeling is essential for reliable visuomotor control in turbid, variable environments.

Paradigm Shift: It moves beyond simple image enhancement, integrating lighting adaptation directly into the control policy.
Force + Vision Synergy: It proves that bilateral control (force feedback) is not just a terrestrial luxury but a necessity for underwater manipulation, especially when combined with lighting-aware perception.
Practical Impact: The framework enables autonomous underwater robots to perform complex, long-horizon, and contact-rich tasks in real-world scenarios where lighting is unpredictable, paving the way for practical autonomous underwater operations.

Bi-AQUA: Bilateral Control-Based Imitation Learning for Underwater Robot Arms via Lighting-Aware Action Chunking with Transformers

1. The Problem: The "Foggy Glasses" and "Blind Hands"

2. The Solution: The "Master and Apprentice" with Super-Senses

3. The Secret Sauce: The "Lighting Translator"

4. The Results: From "Clumsy" to "Pro"

The Big Takeaway

1. Problem Statement

2. Methodology: Bi-AQUA

A. Core Architecture

B. Training and Inference

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers