Micro-expression Recognition Based on Dual-branch Feature Extraction and Fusion

Imagine you are trying to catch a thief in a crowded room. Most people are looking at the big, obvious movements—someone running, shouting, or waving their arms. But the real clue is a tiny, split-second twitch of a lip or a fleeting furrow of a brow that happens so fast the naked eye barely registers it. This is a micro-expression.

This paper is about teaching a computer to spot these tiny, hidden clues better than anyone else. Here is the story of how they did it, explained simply.

The Problem: The "Needle in a Haystack"

Micro-expressions are like whispers in a hurricane. They are:

Too fast: They happen in a fraction of a second.
Too subtle: The movement is tiny.
Too noisy: The rest of the face (or the background) is moving, making it hard to see the small change.

Old computer methods tried to watch the whole video and calculate every movement (like trying to read every word in a book to find one typo). This was slow, expensive, and often missed the point.

The Solution: The "Two-Headed Detective"

The authors built a new AI system that acts like a two-headed detective. Instead of just looking at the whole picture or just one spot, it uses two different "brains" working at the same time to solve the case.

1. The "Wide-Angle" Brain (ResNet)

What it does: This part of the AI looks at the entire face to understand the big picture. It asks, "Is the whole face tense? Is the person generally happy or sad?"
The Analogy: Think of this as a security guard standing at the back of the room. He sees the whole crowd and notices the general mood. He uses a special trick called "Residual Learning" (ResNet) which is like giving the guard a pair of bionic legs. Even if the guard has to walk a very long path (a deep network), these legs prevent him from getting tired or losing his way (solving the "gradient vanishing" problem).

2. The "Zoom-In" Brain (Inception)

What it does: This part of the AI zooms in on specific tiny areas, like the corners of the mouth or the eyebrows. It asks, "Is just this muscle twitching?"
The Analogy: Think of this as a forensic investigator with a magnifying glass. While the security guard watches the crowd, the investigator is looking at a single drop of sweat on a suspect's forehead. The "Inception" architecture is like having multiple magnifying glasses of different sizes at once, so the investigator can see details from different angles simultaneously.

3. The "Smart Mixer" (Attention Fusion)

What it does: Now the AI has two reports: one from the Wide-Angle guard and one from the Zoom-In investigator. But how do they combine them?
The Analogy: Imagine the two detectives are shouting their findings at the same time. The "Smart Mixer" (called a CBAM module) acts like a super-intelligent editor. It listens to both, but it knows when to turn up the volume on the Zoom-In investigator if a tiny twitch is happening, and when to listen to the Wide-Angle guard if the whole face is reacting. It filters out the noise (like a blinking eye or a background movement) and focuses only on the important clues.

The Training: "Less is More"

The researchers tried training their AI with different "brain sizes" (different numbers of layers).

The Surprise: They thought a bigger, deeper brain would be smarter.
The Reality: Because micro-expression videos are rare (like having only 255 clues to solve a mystery), a giant brain got confused and overfitted. It memorized the training data instead of learning the rules.
The Fix: They found that a smaller, simpler brain (ResNet12) actually worked best. It's like using a sharp, simple knife instead of a giant, clumsy chainsaw to cut a delicate gem.

The Results: Winning the Game

They tested their "Two-Headed Detective" on a famous dataset called CASME II (a collection of micro-expression videos).

The Score: The AI got 74.67% accuracy.
The Comparison: This beat the old methods (like LBP-TOP) by a huge margin (over 11% better!). It was also better than most other high-tech methods, though it was slightly behind one method that artificially "magnified" the tiny movements first.
The Catch: The AI sometimes got confused between "Surprise" and "Repression" because both involve similar mouth movements. It's like the AI mistook a smile for a grimace because the corners of the mouth moved the same way.

Why Does This Matter?

This technology isn't just for fun. It could help:

Police: Catch liars during interrogations.
Doctors: Detect hidden depression or anxiety in patients who are trying to hide it.
Marketers: Understand if people truly like a product or are just pretending.

In a Nutshell

The authors built a smart AI that uses two different ways of looking (broad view + zoomed-in view) and a smart editor to combine them. By keeping the system simple enough to handle the limited data, they created a tool that is much better at spotting the tiny, fleeting emotions that humans often miss.

1. Problem Statement

Micro-expressions are involuntary, transient, and subtle facial movements that reveal authentic subconscious emotions. They pose significant challenges for computer vision systems due to:

Transience and Subtlety: The movements are extremely short-lived and low-intensity, making them difficult to capture compared to macro-expressions.
Data Scarcity: Micro-expression datasets (like CASME II) are small, leading to overfitting in deep learning models.
Limitations of Existing Methods: Traditional handcrafted features (e.g., LBP-TOP, optical flow) are noise-sensitive and computationally expensive. Existing deep learning methods often rely on full video sequences, causing information redundancy, or fail to effectively distinguish between fine-grained local features and global context.

2. Methodology

The authors propose a Dual-branch Feature Fusion Framework integrated with a Convolutional Block Attention Mechanism (CBAM). The architecture processes onset-apex frames (rather than full sequences) to reduce redundancy.

A. Network Architecture

The model consists of three core components:

Global Branch (ResNet):
- Utilizes a ResNet backbone (specifically ResNet12, determined via ablation) to extract global facial features.
- Purpose: To alleviate gradient vanishing and network degradation while capturing deep-level semantic information.
- Design: Uses residual blocks (Basic and Bottleneck) with skip connections to learn meaningful representations without excessive parameter growth.
Local Branch (Inception):
- Utilizes an Inception architecture to extract local features from specific facial regions (e.g., ocular, oral, mandibular) identified via Action Unit (AU) annotations.
- Purpose: To enhance multi-scale feature representation and suppress interference from irrelevant regions.
Feature Fusion Module (CAFFM):
- Dual-branch Fusion: The global feature ( $F_G$ ) and local feature ( $F_L$ ) are concatenated.
- Attention Mechanism: The fused features pass through a Convolutional Block Attention-based Feature Fusion Module (CAFFM). This module employs CBAM blocks sequentially to dynamically adjust attention weights across both channels and spatial positions.
- Refinement: The process involves concatenation, ReLU activation, residual addition, and max-pooling to refine the features before classification.

B. Data Preprocessing

Dataset: CASME II (255 samples).
Region Mapping: Facial regions were mapped to specific Action Units (AUs) (e.g., AU1, AU2 for brows; AU12, AU15 for mouth).
Input: Images were cropped using DNN face detection, standardized, and resized to 231×282 pixels.
Class Merging: "Fear" and "Sadness" classes were merged into "Others" due to insufficient sample sizes and overlapping AUs.

3. Key Contributions

Dual-Branch Architecture: A novel framework integrating ResNet (global) and Inception (local) branches to simultaneously capture holistic context and fine-grained local details, addressing the limitations of single-branch models.
Adaptive Attention Fusion (CAFFM): The design of a specific fusion module using CBAM to weigh and integrate dual-branch features, effectively focusing the model on salient facial regions and reducing noise.
Optimized Backbone Selection: Through ablation studies, the authors identified that ResNet12 outperforms deeper variants (ResNet18, ResNet34) on small micro-expression datasets, preventing overfitting caused by model complexity mismatch.
Performance Benchmarking: The method achieves state-of-the-art results on the CASME II dataset without relying on data magnification techniques (unlike some competing methods).

4. Experimental Results

Experiments were conducted on the CASME II dataset using Accuracy, UF1, and UAR as metrics.

Ablation Study (Network Depth):
- ResNet12 achieved the highest accuracy (75.77%), outperforming ResNet18 (74.34%) and ResNet34 (75.15%). This confirmed that shallower networks are more suitable for the small sample size of micro-expression data.
Component Ablation:
- The full model (DBFEM + CAFFM) achieved 74.67% accuracy.
- Removing the attention fusion (DBFEM alone) dropped accuracy to 71.16%, proving the efficacy of the attention mechanism.
Comparative Performance:
- The proposed method (74.67%) outperformed:
  - LBP-TOP: +11.26%
  - MSMMT: +3.36%
  - SLSTT-Mean: +0.88%
  - Later: +3.99%
- It was slightly lower than AMAN (75.4%), which uses micro-expression magnification. However, the proposed method achieves high performance using raw data, demonstrating robustness without complex preprocessing.
Efficiency: The model maintains a Frames Per Second (FPS) of 97.3, meeting real-time application requirements.

5. Significance and Future Work

Significance: The paper demonstrates that combining global and local feature extraction with adaptive attention mechanisms is highly effective for micro-expression recognition, particularly in data-scarce environments. It offers a balance between high accuracy and computational efficiency.
Limitations: The model struggles with classes sharing similar Action Units (e.g., Surprise vs. Repression), leading to confusion in the confusion matrix.
Future Directions:
- Constructing large-scale, high-quality micro-expression datasets.
- Developing models with stronger generalization for cross-dataset recognition.
- Integrating micro-expression detection and recognition into a unified pipeline for practical deployment.