Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

Imagine you are the head of a quality control team in a massive factory. Your job is to spot defects on products coming down the assembly line.

The Old Way: The "One Expert Per Product" Problem

In the past, factories used a strategy called "One Expert Per Product."

If you made bagels, you hired a specialist who only knew bagels.
If you made cookies, you hired a different specialist who only knew cookies.
If you made carrots, you hired a third specialist.

The Problem: This is incredibly expensive and slow. If the factory suddenly starts making donuts, you have to hire a whole new team, train them from scratch, and buy new equipment. It's a logistical nightmare.

The New Goal: The "Super Detective"

The researchers wanted to build a Single Super Detective who could spot defects on any product (bagels, cookies, carrots, donuts) using just one brain. This is called Unified Multimodal Anomaly Detection.

To make this detective even smarter, they gave them two pairs of glasses:

RGB Glasses: See the color and texture (like a normal camera).
Depth Glasses: See the 3D shape and height (like a 3D scanner).

By combining these two views, the detective can spot a scratch that is invisible to color but obvious in 3D, or a dent that is obvious in color but invisible in 3D.

The Big Crisis: "Catastrophic Forgetting"

Here is where the story gets tricky. Imagine you train your Super Detective on Bagels first. They become a Bagel expert. Then, you teach them about Cookies.

The Disaster: As soon as they start learning about Cookies, their brain starts to glitch. They begin to forget what a Bagel looks like! They might start thinking a bagel is a cookie, or they miss defects on the bagels they used to know perfectly.

In computer science, this is called Catastrophic Forgetting. It's like trying to learn a new language, but every time you learn a new word, you forget half the words you already knew.

The Villains: "Spurious" and "Redundant" Features

The paper identifies two specific villains causing this memory loss:

Spurious Features (The "Distractors"): These are fake clues. Imagine the detective sees a "crumb" on a cookie and thinks, "Ah, crumbs mean it's a cookie!" But then they see a crumb on a bagel and get confused. The crumb isn't a real clue for the category; it's just a random coincidence. The detective gets distracted by these fake links between different objects.
Redundant Features (The "Noise"): This is too much information. Imagine the detective is looking at a cookie and sees the color, the texture, the shape, the shadow, the background, and the lighting. Most of this is just "noise." The brain gets overwhelmed by all the extra data and can't focus on the real defect.

When you combine two types of glasses (RGB + Depth), the noise and distractions get even louder, making the detective forget even faster.

The Solution: The "IB-IUMAD" Framework

The authors built a new system called IB-IUMAD (Incremental Unified Multimodal Anomaly Detection) to fix this. They used two clever tools:

1. The Mamba Decoder: The "Organizer"

Think of the Mamba Decoder as a super-organized librarian.

When the detective looks at a bagel and a cookie, their brains might get mixed up because they look similar in some ways.
The Librarian steps in and says, "Stop! Look at the label on the box. This is a bagel. That is a cookie. Don't mix their features up."
It untangles the messy connections between different objects, ensuring the detective learns the true features of a cookie without accidentally "stealing" features from the bagel.

2. The Information Bottleneck: The "Filter"

Think of this as a high-tech coffee filter.

The detective receives a huge cup of "feature soup" (all the data from the RGB and Depth glasses).
The Filter squeezes out all the redundant water (the noise, the background, the useless shadows).
It only lets the pure coffee (the essential, useful information) pass through to the detective's brain.
By forcing the brain to focus only on the most important clues, it stops the brain from getting overwhelmed and forgetting old lessons.

The Result

The researchers tested this new system on real factory data (MVTec 3D-AD and Eyecandies).

Before: The old systems forgot 12.5% of what they learned when moving to new products. They were slow and needed a lot of memory.
After (IB-IUMAD): The new system forgot only 6.3% (cutting the forgetting in half!). It was also 44 times more memory-efficient and 41 times faster than the old "One Expert Per Product" method.

The Takeaway

This paper is like a guide on how to teach a robot to be a master of many trades without losing its mind. By acting as a strict organizer (Mamba) and a ruthless filter (Information Bottleneck), the system learns new skills (like detecting donuts) without erasing its old skills (like detecting bagels), all while using less energy and memory than ever before.

1. Problem Definition

The paper addresses the challenge of Incremental Unified Multimodal Anomaly Detection (IUMAD).

Context: Industrial quality inspection typically uses RGB and depth images to detect surface defects. Traditional approaches follow an "N-objects-N-models" paradigm (a separate model per object), which is computationally expensive and lacks generalization.
Goal: Develop a single "N-objects-One-model" that can detect anomalies across multiple categories and support incremental learning (learning new objects sequentially without retraining on old data).
Core Challenge: Catastrophic Forgetting. As the model learns new objects, it tends to overwrite knowledge of previously learned objects.
Specific Insight: The authors identify that spurious features (irrelevant correlations between objects) and redundant features (non-discriminative information) in multimodal fusion significantly exacerbate catastrophic forgetting. Multimodal frameworks are more susceptible to this than unimodal ones due to the complexity of cross-modal feature coupling.

2. Methodology: IB-IUMAD

The authors propose IB-IUMAD, a novel denoising framework designed to mitigate forgetting by explicitly filtering spurious and redundant features. The architecture consists of four main components:

A. Multimodal Feature Extraction Networks (MFEN)

Uses EfficientNet to extract features from RGB and Depth images.
Synthesizes "abnormal" features by applying feature jitters and perturbations to normal features, creating a training signal for reconstruction.

B. Mamba Decoders (Disentangling Inter-Object Coupling)

Problem: As new objects are learned, the feature space of previous objects becomes coupled with spurious features from the new objects.
Solution: The framework integrates Mamba decoders (based on State Space Models) into the Multimodal Reconstruction Network (MRN).
Mechanism:
- Each decoder uses an Efficient State Space Module (ESSM) and Deep Separable Convolution (DwConv) to extract fine-grained features.
- A label classifier is attached to the decoder output. By minimizing cross-entropy loss, the model is forced to use label information to disentangle inter-object feature coupling.
- This prevents the model from indiscriminately updating the feature space of previously learned objects, reducing spurious interference.

C. Information Bottleneck Fusion Module (IBFM) (Filtering Redundancy)

Problem: Fused multimodal features often contain redundant information that does not contribute to anomaly detection but increases the risk of forgetting.
Solution: An Information Bottleneck (IB) regularization module is applied to the fused features.
Mechanism:
- The module fuses multi-scale features from RGB and Depth using Cross-Attention.
- It applies a projection layer (Linear + Dropout + ReLU) to compress the fused feature $F_{fu}$ into a predictive feature $F^g_{fu}$ .
- Objective: Maximize the mutual information between the compressed feature and the label ( $I(F^g_{fu}; Y)$ ) while minimizing the mutual information between the original fused feature and the compressed feature given the label ( $I(F_{fu}; F^g_{fu}|Y)$ ).
- This is optimized using Kullback-Leibler (KL) Divergence as a loss function, effectively filtering out redundant information while preserving discriminative power.

D. Loss Function

The total loss combines:

Reconstruction Loss (MSE): Reconstructing normal features from abnormal inputs.
Classification Loss (Cross-Entropy): For the Mamba decoder outputs to enforce feature disentanglement.
Information Bottleneck Loss (KL Divergence): To enforce the filtering of redundant features.

3. Key Contributions

Empirical Validation of Feature Impact: The paper provides a theoretical and empirical analysis demonstrating that spurious and redundant features are primary drivers of catastrophic forgetting in incremental multimodal settings, causing significantly worse performance degradation in multimodal frameworks compared to unimodal ones.
Novel Framework (IB-IUMAD): The first work to address IUMAD by combining Mamba decoders (for feature disentanglement) and Information Bottleneck fusion (for redundancy filtering).
Efficiency and Performance: The method achieves state-of-the-art performance while drastically reducing memory usage and increasing inference speed compared to existing paradigms.

4. Experimental Results

The method was evaluated on MVTec 3D-AD and Eyecandies datasets across four incremental settings (e.g., 6-1 with 4 steps).

Performance Gains:
- On MVTec 3D-AD (6-1 with 4 steps, RGB+Depth), IB-IUMAD improved I-AUROC by 3.5% and AUPRO by 2.9% over the previous best (IUF).
- It reduced the Forgetting Metric (FM) by 5.8% (I-AUROC) and 1.5% (AUPRO), indicating significantly better retention of old knowledge.
- It outperformed baselines (IUF, CDAD) across all incremental settings and modalities (RGB, Depth, RGB+Depth).
Efficiency:
- Compared to the "N-objects-N-models" approach, IB-IUMAD reduced memory usage by 44x and increased inference speed by 41x while maintaining comparable accuracy.
- Compared to other unified models, it achieved higher frame rates (21.4 FPS vs. <5 FPS for competitors like M3DM) with lower memory footprint (1.4GB vs. 65GB).
Ablation Studies: Confirmed that both the Mamba decoder and the IBFM module are essential; removing either leads to a drop in accuracy and an increase in forgetting.

5. Significance

Theoretical Advancement: This work bridges the gap between Information Bottleneck theory and Incremental Learning in the context of Multimodal Anomaly Detection. It provides a rigorous explanation for why multimodal models forget more easily (due to feature coupling) and offers a mathematical solution.
Industrial Applicability: By solving the catastrophic forgetting problem and drastically reducing memory/computational costs, IB-IUMAD makes the deployment of unified, scalable anomaly detection systems feasible for real-world industrial environments where new product lines are constantly introduced.
Paradigm Shift: It moves the field away from the resource-heavy "one-model-per-object" approach toward a sustainable, single-model incremental learning paradigm.