RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentanglement

Imagine you are trying to take a clear photo of a fast-moving race car. Because the car is moving so fast, your camera shutter stays open a tiny bit too long, and the resulting photo is a blurry mess. This is the problem of motion blur.

For a long time, computers have tried to fix these blurry photos using only the picture itself. But sometimes, the blur is so bad that the computer just guesses wrong, like trying to solve a puzzle with half the pieces missing.

Recently, scientists started using a special kind of camera called an Event Camera. Think of a normal camera as a video recorder that takes a picture every 1/30th of a second. An Event Camera is different: it's like a swarm of tiny, hyper-alert fireflies. Each firefly only "flashes" (sends an event) when it sees something change quickly, like a wheel spinning or a bird flapping its wings. These flashes happen incredibly fast, giving the computer a perfect map of where things are moving.

The Problem: The "Shy" Fireflies
The paper's authors noticed a big problem with these Event Cameras in the real world. To stop the camera from getting confused by noise (like dust or flickering lights), engineers set a "volume knob" (called a threshold) that tells the fireflies: "Only flash if the change is loud enough."

The trouble is, this makes the fireflies shy.

If a car is moving slowly, or if the edge of an object is faint, the change isn't "loud" enough.
The fireflies stay silent.
The computer gets a map of motion that is fragmented and missing pieces.

When existing computer programs tried to use these "shy" maps to fix the blurry photo, they got confused. They tried to mix the blurry photo and the broken motion map together, which made the final result even worse. It's like trying to bake a cake using a recipe that's missing half the ingredients and then mixing in some dirt because you thought it was chocolate.

The Solution: RED (Robust Event-guided Deblurring)
The authors created a new system called RED. They didn't just build a better cake mixer; they changed how they think about the ingredients. Here is how RED works, using simple analogies:

1. The "Training Camp" (RPS)

Before the system goes to work, they put it through a tough training camp. They simulate all kinds of "shy firefly" scenarios. They pretend the volume knob is turned up high, then low, then erratic.

Why? This teaches the computer: "Hey, sometimes the motion map will be broken. Don't panic. Learn to work with what you have."
Result: The system becomes tough and adaptable, ready for any real-world condition.

2. The "Specialized Teams" (Disentanglement)

Old systems tried to mix the blurry photo and the broken motion map into one big soup. RED says, "No! Let's keep the teams separate first."

The Image Team: Focuses only on the look of the photo (colors, shapes, textures). They ignore the movement.
The Event Team: Focuses only on the movement (where things changed). They ignore the colors.
The Cross-Team: A mediator that helps them talk to each other.

By separating them, the system prevents the "broken" motion data from ruining the "good" picture data. It's like having a translator who speaks two languages perfectly, rather than forcing everyone to speak a broken mix of both.

3. The "Handshake" (Selective Fusion)

Once the teams have done their own work, they come together, but very carefully.

MSEM (Motion Saliency Enhancer): The Event Team whispers to the Image Team: "Hey, look right here! There was a fast movement here, even though the picture is blurry. Let's sharpen this specific spot."
ESEM (Event Semantic Engraver): The Image Team whispers back to the Event Team: "You're missing some context because your map is broken. Here is the shape of the object so you know what you are looking at."

They only share information where it is useful. If the motion data is too broken, they ignore it and rely on the picture. If the picture is too blurry, they lean on the motion data.

The Result

When they tested RED, it was a game-changer.

Old methods: When the motion data was missing 30% of the time, the photo quality crashed.
RED: Even when the motion data was missing 50% of the time, RED still produced a sharp, clear photo. In fact, it was often better than systems that didn't use motion data at all.

In a Nutshell:
Imagine you are trying to fix a torn map.

Old way: You glue the torn pieces together randomly, making a mess.
RED way: You first study the map's geography (the image) and the torn pieces' shapes (the events) separately. Then, you carefully match the pieces only where they fit perfectly, ignoring the parts that are too damaged.

RED teaches computers to be smart about what they trust, ensuring that even when the sensors are imperfect, the final picture is crystal clear.

Here is a detailed technical summary of the paper "RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentanglement."

1. Problem Statement

Motion blur is a common degradation in dynamic scenes caused by rapid object motion or camera shake. While Event Cameras (Dynamic Vision Sensors, DVS) offer high temporal resolution and are theoretically ideal for motion deblurring, they suffer from a critical practical limitation: Threshold-Induced Event Under-Reporting.

The Mechanism: DVS cameras trigger an event only when the logarithmic intensity change at a pixel exceeds a specific threshold ( $\theta$ ).
The Issue: To suppress noise and false triggers, DVS are often operated with higher thresholds. This causes under-reporting, where weak motions or low-contrast edges fail to trigger events.
The Consequence: Existing event-guided deblurring methods assume dense, reliable event streams. When faced with sparse or fragmented events (high under-reporting ratios), these methods degrade significantly. In severe cases, the corrupted event data contaminates the image representation, causing the event-guided model to perform worse than image-only baselines.

2. Methodology

The authors propose RED (Robust Event-guided Deblurring), a framework built on the principle of "Disentangle First, Then Fuse Selectively." The architecture consists of three core components:

A. Robustness-Oriented Perturbation Strategy (RPS)

To train the model to handle unknown and varying degrees of event loss, the authors introduce RPS.

Mechanism: RPS simulates the physical triggering process of DVS cameras during training. It models event generation as a probabilistic process involving signal ( $S$ ) and noise ( $N_p$ from photons, $N_g$ from circuitry).
Implementation: It stochastically thins the event voxel grid based on a survival probability determined by the contrast threshold.
Goal: By exposing the network to a continuum of under-reporting ratios (from mild to severe) during training, the model learns to be robust against real-world event dropouts without requiring explicit threshold labels.

B. Modality-Specific Representation Mechanism (MRM)

Instead of naively fusing image and event features, MRM disentangles the feature space into three distinct representations before fusion:

Image-Semantic Representation: Extracts high-level semantic context from the blurry image (which remains stable even with motion blur).
Event-Motion Representation: Extracts temporal motion cues from the event stream.
Cross-Modal Representation: Captures complementary interactions.

Design: It uses specific attention mechanisms:
- Semantic Attention: Focuses on spatial context in the image branch.
- Motion Attention: Focuses on temporal continuity in the event branch.
- Cross-Modality Attention: Selectively transfers semantic cues from images to events (to fill semantic gaps) and motion cues from events to images (to restore structure), preventing corrupted events from overwhelming image semantics.

C. Feature Interaction Modules

Two modules facilitate the selective fusion of the disentangled features:

Motion Saliency Enhancer Module (MSEM):
- Extracts high-frequency motion components from the event stream.
- Injects these motion priors into the image branch to enhance spatial details that are lost in the blur.
Event Semantic Engraver Module (ESEM):
- Extracts global semantic representations from the image branch.
- "Engraves" (embeds) this semantic context into the event branch, compensating for the lack of semantic information in sparse event data.

3. Key Contributions

RED Framework: A novel robust event-guided deblurring network that outperforms state-of-the-art (SOTA) methods in both accuracy and robustness.
RPS (Robustness-Oriented Perturbation Strategy): A training strategy that mimics realistic threshold-induced under-reporting, significantly improving adaptability to unknown capture conditions.
Modality-Specific Disentanglement: A new design philosophy that separates semantic and motion features before fusion, preventing the "contamination" of image semantics by disrupted event data.
Selective Interaction: The MSEM and ESEM modules enable a synergistic exchange where images provide semantics to events, and events provide motion priors to images.

4. Experimental Results

The authors evaluated RED on synthetic (GoPro) and real-world (HighREV, REVD) datasets.

Performance under Under-Reporting:
- Existing methods (e.g., EFNet, STCNet, AHDINet) show a sharp decline in PSNR and SSIM as the under-reporting ratio increases. At high ratios (e.g., 0.3), they often fall below the performance of image-only baselines (DSTN).
- RED maintains stable, high performance even at an under-reporting ratio of 0.5, consistently outperforming the image-only baseline.
Quantitative Metrics:
- On the GoPro dataset, RED achieves 37.63 dB PSNR and 0.980 SSIM (at 0% under-reporting), outperforming the next best method (AHDINet: 37.09 dB).
- On real-world datasets (HighREV and REVD), RED achieves the highest scores (e.g., 30.04 dB PSNR on HighREV), demonstrating strong generalization.
Ablation Studies:
- Removing RPS causes a significant drop in robustness.
- Replacing the modality-specific attention (MRM) with generic self-attention causes a massive performance drop (~11.86 dB), proving the necessity of disentangling features.
- The combination of MSEM and ESEM provides a cumulative gain of 0.85 dB over the baseline.

5. Significance

This paper addresses a critical gap between theoretical event-based deblurring and practical deployment. By acknowledging that real-world event cameras suffer from under-reporting, RED moves away from the assumption of "perfect" event data.

Practicality: The RPS strategy makes the model robust to varying sensor configurations and environmental conditions without needing retraining.
Architectural Insight: The "disentangle-first" approach offers a new paradigm for multi-modal learning, suggesting that separating modalities into their core strengths (semantics vs. motion) before interaction yields superior results compared to early or naive fusion.
State-of-the-Art: RED sets a new benchmark for event-guided motion deblurring, proving that event cameras can be effectively utilized even when data is sparse or noisy.

RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentanglement

1. The "Training Camp" (RPS)

2. The "Specialized Teams" (Disentanglement)

3. The "Handshake" (Selective Fusion)

The Result

1. Problem Statement

2. Methodology

A. Robustness-Oriented Perturbation Strategy (RPS)

B. Modality-Specific Representation Mechanism (MRM)

C. Feature Interaction Modules

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation