Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

Imagine you are trying to teach a computer to understand a busy scene, like a park where people are doing various things: someone is riding a bike, another is holding a cup, and a third is sitting on a bench.

This task is called Human-Object Interaction (HOI) detection. It's not just about finding the person or the object; it's about understanding the relationship between them.

The problem is that the world is full of infinite combinations. You can train a computer on "riding a bike," but what happens when it sees a person "riding a skateboard" or "riding a horse" for the first time? This is the Zero-Shot challenge: recognizing interactions the computer has never seen before.

Here is how this paper solves that problem, broken down into simple concepts and analogies.

1. The Old Way: The "Tightly Woven Sweater"

Previously, most computer vision methods were like a sweater knitted with two different types of yarn (Object Detection and Interaction Recognition) that were inseparable.

The Flaw: If you wanted to upgrade the "Object Detection" part (the yarn that finds the bike), you had to unravel and re-knit the whole sweater. You couldn't just swap in a better yarn without ruining the whole pattern.
The Result: These systems were rigid. If the object detector made a mistake (like drawing a box that was slightly too big), the interaction recognizer got confused. They also relied on "dumb" features that couldn't understand complex new situations.

2. The New Idea: The "Detached Translator"

The authors propose a Decoupled Framework. Imagine taking that sweater apart.

Step 1: You use a super-smart "Spotter" (an Object Detector) to find all the people and objects and draw boxes around them. This Spotter can be any Spotter you want.
Step 2: You hand the boxes to a Super-Translator (a Multi-modal Large Language Model, or MLLM). This translator is like a genius who has read millions of books and seen millions of pictures. It knows what "riding a bike" looks like, even if it's never seen that specific bike before.

The Magic Trick: Instead of asking the computer to "calculate" the interaction, they ask the Super-Translator a Question:

"Here is a picture of a person and a bike. Is the person riding it, holding it, or sitting on it?"

3. Solving the "Chatty AI" Problem

Large Language Models (like the one used here) are great at writing stories, but they are terrible at math tests. If you ask them a multiple-choice question, they might write a whole paragraph explaining their reasoning instead of just saying "A, B, or C." This is bad for a computer system that needs precise data.

The Solution: Deterministic Generation
The authors invented a "Strict Exam Format." They force the AI to act like a machine, not a poet.

Instead of letting the AI write a story, they ask it to calculate the probability of each answer.
Analogy: Imagine a teacher asking a student to pick the right answer from a list. Instead of the student writing an essay, the teacher forces them to just circle the letter. The computer then scores how confident the AI is in that circle. This allows the system to work without retraining (Zero-Shot) because it just uses the AI's existing knowledge.

4. The "Safety Net" and the "Speed Boost"

Even with a genius translator, there are two problems:

Noisy Boxes: Sometimes the "Spotter" draws a box that includes a little bit of the background or misses part of the person. The AI gets confused.
Too Slow: If there are 100 possible interactions (riding, holding, sitting, eating, etc.), asking the AI to check them one by one is like reading a dictionary page by page. It takes forever.

The Fixes:

Spatial-Aware Pooling (The Safety Net): The system adds a "context layer." It doesn't just look at the person and the object in isolation; it looks at where they are relative to each other.
- Analogy: If you see a person and a cup, but the cup is floating 10 feet in the air, your brain knows they aren't interacting. This module does the same thing, ignoring interactions that don't make spatial sense.
One-Pass Matching (The Speed Boost): Instead of asking the AI to check 100 options one by one, the system asks it to check all 100 at once in a single glance.
- Analogy: Instead of asking a librarian to find 100 specific books one by one, you hand them a list and say, "Highlight the ones that exist on this shelf right now." The librarian does it in one sweep.

5. Why This Matters

This paper is a game-changer because:

It's Plug-and-Play: You can swap out the "Spotter" for a better one later without retraining the whole system.
It's Flexible: It can understand new interactions (like "riding a unicycle") without needing new training data.
It's Fast and Accurate: By combining the genius of a Large Language Model with a strict, efficient testing method, it beats previous state-of-the-art methods significantly.

In a nutshell: The authors stopped trying to force the computer to "learn" interactions from scratch. Instead, they built a system that asks a super-smart AI to describe what it sees, but with strict rules to ensure the answer is fast, accurate, and usable by any object detector.

1. Problem Statement

Human-Object Interaction (HOI) Detection involves localizing humans and objects in an image and recognizing the specific interaction (verb) between them. While recent advances in Open-Vocabulary Object Detection have solved the problem of localizing unseen objects, Interaction Recognition (IR) remains a significant bottleneck, particularly in Zero-Shot scenarios where the model must recognize interaction combinations (verb-object pairs) not seen during training.

Limitations of Existing Methods:

Tight Coupling: Most existing methods (both one-stage and two-stage) tightly couple IR with a specific object detector. This prevents the use of newer, stronger detectors without retraining the entire system.
Feature Limitations: They often rely on coarse-grained Vision-Language Model (VLM) features (e.g., CLIP) or detector-specific features that lack the fine-grained representational capacity needed to distinguish visually similar interactions.
Generalization: These methods struggle to generalize to unseen interactions because their core mechanisms are often constrained by the training distribution of the specific detector used.

2. Methodology

The authors propose a decoupled framework that separates object detection from interaction recognition, leveraging Multi-Modal Large Language Models (MLLMs) for the IR task. The framework consists of three main components:

A. MLLM-Based Training-Free Interaction Recognition (VQA Formulation)

Instead of using static embeddings, the authors formulate IR as a Visual Question Answering (VQA) task.

Input: For a detected human-object pair, the system constructs a prompt containing the image features, the specific human/object bounding box features, and a list of candidate interactions (e.g., "riding a bicycle," "holding a bicycle").
Deterministic Generation: Standard MLLMs generate open-ended text, which is unsuitable for multi-label classification and confidence scoring. The authors introduce a deterministic generation method. Instead of generating text, the model calculates the conditional likelihood (semantic similarity) of generating each candidate interaction token given the prompt. This allows the model to output confidence scores for all candidates without training.

B. Spatial-Aware Pooling (SAP)

To address the sensitivity of ROI-pooled features to imperfect bounding boxes and the lack of spatial context:

Appearance & Spatial Fusion: The module takes pooled human and object features and merges them via an MLP.
Cross-Attention: A cross-attention layer aggregates features from the broader image context (beyond the bounding box) to improve robustness against detection noise.
Pairwise Spatial Encoding: Explicit spatial vectors are constructed using box areas, aspect ratios, Intersection over Union (IoU), and relative direction (center-to-center). These are projected and fused into the interaction features.
Interactiveness Filtering: A linear classifier predicts an "interactiveness" score to filter out non-interacting human-object pairs before passing them to the MLLM, reducing computational load.

C. One-Pass Deterministic Matching

To solve the high computational cost of running the MLLM forward pass for every candidate interaction:

Feature Matching: The candidate interaction list is tokenized with special markers (e.g., <|hoi|>). The MLLM processes the entire prompt in a single forward pass.
Cosine Similarity: The output features corresponding to the special markers are extracted and compared via cosine similarity against the global interaction feature vector. This replaces the iterative generation process with a single feature matching step, drastically reducing inference time.

D. Training Strategy

The model is trained in two stages (with the visual encoder frozen):

Stage 1: Train the Spatial-Aware Pooling (SAP) module to predict interactiveness (binary classification).
Stage 2: Freeze SAP and fine-tune the LLM using LoRA (Low-Rank Adaptation) to learn the matching task, using binary focal loss on the candidate list.

3. Key Contributions

Detector-Agnostic Framework: The first work to fully decouple object detection from interaction recognition in HOI. The IR module can be plugged into any object detector (e.g., YOLO, Grounding-DINO, DETR) without retraining the IR component.
Training-Free Zero-Shot IR: By leveraging MLLMs with deterministic generation, the method achieves strong zero-shot performance without any task-specific training.
Efficiency Innovations: The introduction of Spatial-Aware Pooling and One-Pass Deterministic Matching significantly improves robustness to detection noise and reduces inference latency compared to standard MLLM generation approaches.
Superior Generalization: The framework demonstrates strong cross-dataset generalization, outperforming existing methods significantly when transferring from HICO-DET to V-COCO.

4. Experimental Results

The method was evaluated on HICO-DET and V-COCO benchmarks.

Zero-Shot Performance (HICO-DET):
- Achieved state-of-the-art results across all zero-shot settings (Unseen Combinations, Unseen Objects, Unseen Verbs).
- In the Unseen Object (UO) setting, the method achieved 48.67% mAP, surpassing the previous best (BC-HOI) by a significant margin.
- In the training-free setting, it achieved 31.50% mAP, outperforming the previous best training-free method (ADA-CM) by ~6%.
Cross-Detector Performance:
- The method maintained high performance when swapping the underlying object detector (e.g., from ResNet50-DETR to Grounding-DINO or YOLO-World) without retraining the IR module.
- Combined with Grounding-DINO, it reached 44.00% average mAP.
Cross-Dataset Generalization (HICO-DET $\to$ V-COCO):
- The model achieved 59.91% mAP on V-COCO, outperforming the second-best method (BCOM) by over 11%, demonstrating exceptional generalization capabilities.
Ablation Studies:
- Deterministic Generation: Essential for performance; removing it caused a drop from 39.87% to 31.61% (with SFT) and introduced high format error rates.
- SAP & One-Pass Matching: Reduced inference time from ~569ms to 91ms per image while improving accuracy.
- Spatial Cues: Removing pairwise spatial encoding or cross-attention led to noticeable performance drops, confirming their importance for robustness.

5. Significance

This paper establishes a new paradigm for HOI detection by shifting the burden of semantic understanding from specialized, detector-coupled networks to powerful, general-purpose MLLMs.

Modularity: It allows the HOI community to immediately benefit from advancements in object detection without needing to retrain complex interaction recognition heads.
Scalability: The ability to handle unseen interactions and objects with high accuracy makes it highly suitable for real-world applications (robotics, autonomous driving) where the environment is dynamic and unpredictable.
Efficiency: By converting generation into feature matching, the authors make MLLM-based HOI detection computationally feasible for practical deployment.

The code is publicly available at https://github.com/SY-Xuan/DA-HOI.