MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering

The MEGC 2026 challenge introduces two new tasks, Micro-Expression Video Question Answering (ME-VQA) and Micro-Expression Long-Video Question Answering (ME-LVQA), to advance the analysis of facial micro-expressions by leveraging the multimodal reasoning capabilities of large vision-language models on both short and long-duration video sequences.

Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Su-Jing Wang, Adrian K. Davison

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you are at a high-stakes poker game. Everyone is trying to look calm, but one player just got a terrible hand. For a split second—faster than a camera shutter—his face twitches. His lip curls slightly, or his eyebrow raises. Then, just as quickly, he masks it with a poker face. That tiny, involuntary flicker is a Micro-Expression (ME). It's the truth trying to peek out from behind a lie.

For years, scientists have been trying to build computers that can spot these fleeting "truth flashes." But now, the game has changed. The MEGC 2026 (Micro-Expression Grand Challenge 2026) is a new competition asking a different, more human question: "Can you not just spot the twitch, but actually talk about it?"

Here is a simple breakdown of what this paper is about, using some everyday analogies.

The Big Idea: From "Spotting" to "Chatting"

In the past, computers were like security guards with a checklist. They would scan a face and say, "Yes, I see a micro-expression," or "No, I don't."

MEGC 2026 is upgrading the computer to be more like a detective with a notebook. Instead of just checking a box, the computer is given a video and asked natural questions like:

  • "What emotion was the person feeling right before they smiled?"
  • "How many times did they try to hide their anger?"
  • "Describe the specific muscle movement around the eyes."

This is powered by Multimodal Large Language Models (MLLMs). Think of these models as super-smart students who have read every book on human emotion and watched millions of videos. They can look at a video and "chat" about what they see, combining visual clues with language skills.

The Two Main Challenges (The Tasks)

The competition has two levels, like a video game with a "Tutorial" and a "Boss Level."

Level 1: The Short-Clip Detective (ME-VQA)

  • The Scenario: You are shown a very short video clip (a few seconds long) of someone's face.
  • The Task: You are asked a specific question about that clip.
    • Example: "Did the person show a 'lip corner depressor' (a sign of sadness)?" or "Is this a happy or angry micro-expression?"
  • The Goal: The computer must answer in full sentences, explaining why it thinks that. It's like a teacher asking a student to explain their answer, not just give a number.

Level 2: The Long-Haul Detective (ME-LVQA)

  • The Scenario: This is the hard mode. You are given a long video (like a whole conversation or a tense meeting) that might last several minutes.
  • The Task: The video is full of normal talking, laughing, and frowning. Hidden inside are tiny, fleeting micro-expressions. You have to find them and answer complex questions.
    • Example: "How many times did the person try to suppress a smile during the meeting?" or "List all the different facial movements that happened."
  • The Challenge: This is like finding a needle in a haystack, but the haystack is moving, and the needle is invisible to the naked eye. The computer has to remember what happened 30 seconds ago to answer a question about what happened 5 minutes ago.

The "Test Drive" (Baseline Results)

The authors of the paper didn't just propose the idea; they tried it out using two powerful AI models (called Qwen). Think of these models as two different cars the researchers rented to see if they could win the race.

  • The Zero-Shot Test (Driving without a map): They asked the AI to do the job without any special training on micro-expressions.
    • Result: The AI was okay at spotting big, obvious emotions (like "Is he happy?"), but it was terrible at spotting the tiny, subtle ones. It was like a driver who can see a stop sign but misses a small pothole.
  • The Fine-Tuned Test (Driving with a map): They gave the AI a crash course using specific micro-expression videos.
    • Result: The AI got better at writing good sentences and describing what it saw. However, it still struggled with the hardest part: counting exactly how many micro-expressions happened and pinpointing exactly when they occurred in long videos.

The Takeaway

The paper concludes that while AI is getting smarter at "talking" about emotions, it still has a long way to go to truly "understand" the subtle, split-second lies and truths of the human face.

  • The Good News: AI can now describe micro-expressions in natural language, making the technology more useful for real-world applications (like lie detection or mental health analysis).
  • The Bad News: Long videos are still too confusing for current AI. The models get lost in the noise of normal facial movements and miss the tiny details.

In short: MEGC 2026 is inviting researchers to teach computers how to be better observers of human emotion, moving from simple "spotting" to complex "storytelling" about what people are really feeling.