Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

This paper proposes a decoupled, detector-agnostic framework for zero-shot Human-Object Interaction detection that leverages Multi-modal Large Language Models with a deterministic generation strategy and spatial-aware pooling to achieve superior generalization and training-free performance across diverse datasets.

Shiyu Xuan, Dongkai Wang, Zechao Li, Jinhui Tang

Published 2026-02-18
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a computer to understand a busy scene, like a park where people are doing various things: someone is riding a bike, another is holding a cup, and a third is sitting on a bench.

This task is called Human-Object Interaction (HOI) detection. It's not just about finding the person or the object; it's about understanding the relationship between them.

The problem is that the world is full of infinite combinations. You can train a computer on "riding a bike," but what happens when it sees a person "riding a skateboard" or "riding a horse" for the first time? This is the Zero-Shot challenge: recognizing interactions the computer has never seen before.

Here is how this paper solves that problem, broken down into simple concepts and analogies.

1. The Old Way: The "Tightly Woven Sweater"

Previously, most computer vision methods were like a sweater knitted with two different types of yarn (Object Detection and Interaction Recognition) that were inseparable.

  • The Flaw: If you wanted to upgrade the "Object Detection" part (the yarn that finds the bike), you had to unravel and re-knit the whole sweater. You couldn't just swap in a better yarn without ruining the whole pattern.
  • The Result: These systems were rigid. If the object detector made a mistake (like drawing a box that was slightly too big), the interaction recognizer got confused. They also relied on "dumb" features that couldn't understand complex new situations.

2. The New Idea: The "Detached Translator"

The authors propose a Decoupled Framework. Imagine taking that sweater apart.

  • Step 1: You use a super-smart "Spotter" (an Object Detector) to find all the people and objects and draw boxes around them. This Spotter can be any Spotter you want.
  • Step 2: You hand the boxes to a Super-Translator (a Multi-modal Large Language Model, or MLLM). This translator is like a genius who has read millions of books and seen millions of pictures. It knows what "riding a bike" looks like, even if it's never seen that specific bike before.

The Magic Trick: Instead of asking the computer to "calculate" the interaction, they ask the Super-Translator a Question:

"Here is a picture of a person and a bike. Is the person riding it, holding it, or sitting on it?"

3. Solving the "Chatty AI" Problem

Large Language Models (like the one used here) are great at writing stories, but they are terrible at math tests. If you ask them a multiple-choice question, they might write a whole paragraph explaining their reasoning instead of just saying "A, B, or C." This is bad for a computer system that needs precise data.

The Solution: Deterministic Generation
The authors invented a "Strict Exam Format." They force the AI to act like a machine, not a poet.

  • Instead of letting the AI write a story, they ask it to calculate the probability of each answer.
  • Analogy: Imagine a teacher asking a student to pick the right answer from a list. Instead of the student writing an essay, the teacher forces them to just circle the letter. The computer then scores how confident the AI is in that circle. This allows the system to work without retraining (Zero-Shot) because it just uses the AI's existing knowledge.

4. The "Safety Net" and the "Speed Boost"

Even with a genius translator, there are two problems:

  1. Noisy Boxes: Sometimes the "Spotter" draws a box that includes a little bit of the background or misses part of the person. The AI gets confused.
  2. Too Slow: If there are 100 possible interactions (riding, holding, sitting, eating, etc.), asking the AI to check them one by one is like reading a dictionary page by page. It takes forever.

The Fixes:

  • Spatial-Aware Pooling (The Safety Net): The system adds a "context layer." It doesn't just look at the person and the object in isolation; it looks at where they are relative to each other.
    • Analogy: If you see a person and a cup, but the cup is floating 10 feet in the air, your brain knows they aren't interacting. This module does the same thing, ignoring interactions that don't make spatial sense.
  • One-Pass Matching (The Speed Boost): Instead of asking the AI to check 100 options one by one, the system asks it to check all 100 at once in a single glance.
    • Analogy: Instead of asking a librarian to find 100 specific books one by one, you hand them a list and say, "Highlight the ones that exist on this shelf right now." The librarian does it in one sweep.

5. Why This Matters

This paper is a game-changer because:

  • It's Plug-and-Play: You can swap out the "Spotter" for a better one later without retraining the whole system.
  • It's Flexible: It can understand new interactions (like "riding a unicycle") without needing new training data.
  • It's Fast and Accurate: By combining the genius of a Large Language Model with a strict, efficient testing method, it beats previous state-of-the-art methods significantly.

In a nutshell: The authors stopped trying to force the computer to "learn" interactions from scratch. Instead, they built a system that asks a super-smart AI to describe what it sees, but with strict rules to ensure the answer is fast, accurate, and usable by any object detector.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →