HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

Imagine you are teaching a robot how to use a new object it has never seen before, like a strange-looking toaster or a futuristic chair. You don't just want the robot to see the object; you want it to understand how to interact with it. Where should it grab the handle? Where should it press the button? This ability to understand "how an object can be used" is called Affordance.

The paper introduces a new AI system called HAMMER that is really good at this task. Here is how it works, explained through simple analogies.

The Problem: The "Blind" Robot

Previous robots trying to do this had two main problems:

They were too literal: They tried to describe the object in words first (e.g., "This is a red cup with a handle") and then guess where to grab it. This is like reading a manual before trying to open a jar; it's slow and often misses the point.
They were 2D-minded: They looked at a 2D picture of an object, drew a mask on the screen, and then tried to "project" that flat drawing onto a 3D object. This is like trying to wrap a flat piece of paper around a basketball; it gets wrinkled, torn, and doesn't fit right.

The Solution: HAMMER (The "Intuitive" Robot)

The authors built HAMMER to mimic how humans learn. When you see a picture of someone holding a hammer, you don't need a manual to know where to hold it. You just get it. HAMMER does the same thing using a Multimodal Large Language Model (MLLM)—think of this as a super-smart brain that has read the entire internet and seen millions of images.

Here is the three-step "magic" HAMMER uses:

1. The "Contact-Aware" Intuition (The Spark)

Instead of writing a long description, HAMMER looks at the image and asks: "If a human were doing this, where would their hand touch?"

The Analogy: Imagine you are looking at a photo of someone opening a door. HAMMER doesn't say, "The person is holding a brass knob." Instead, it instantly creates a mental "glow" or a highlight on the exact spot where the hand touches the knob. It captures the intention of the action in a single, powerful signal.

2. The "Cross-Modal" Handshake (The Fusion)

HAMMER takes that "glow" (the intention) and shakes hands with the 3D shape of the object.

The Analogy: Imagine the 3D object is a sculpture made of clay. HAMMER takes the "intention signal" from the picture and pours it into the clay. It's like a smart paint that knows exactly which parts of the sculpture are "touchable" based on the picture. It doesn't just guess; it uses the super-smart brain's knowledge to refine the shape, ensuring the robot understands that the handle is for holding, but the body is not.

3. The "Geometry Lifter" (The 3D Upgrade)

This is the most clever part. The "intention signal" comes from a flat 2D picture, so it lacks depth. HAMMER has a special module that acts like a 3D printer.

The Analogy: Imagine you have a shadow of a person on a wall (the 2D intention). You know the shadow is of a person, but you don't know if they are tall or short. HAMMER looks at the actual 3D object (the clay sculpture) and says, "Okay, the shadow says 'hand here,' and the 3D shape says 'this is a handle.' Let's combine them." It lifts the flat idea into 3D space, adding depth and structure so the robot knows exactly how far to reach and at what angle.

Why is HAMMER Better?

The paper tested HAMMER against other methods using a "corrupted" benchmark. This means they took the 3D objects and added "noise"—like static on a TV, missing pieces, or shaking the camera.

The Result: When the data was messy, other robots got confused and pointed to the wrong spots. HAMMER, however, was like a seasoned detective. Even with a blurry photo or a broken 3D model, it could still figure out, "Ah, despite the noise, the intention is clearly to grasp the handle."

Summary

HAMMER is a robot brain that:

Sees an image and instantly understands the intention (where to touch).
Merges that intention with the 3D shape of the object using a super-smart AI.
Lifts the flat idea into 3D space so the robot knows exactly where to reach.

It's like giving a robot the "common sense" to look at a picture of a human using an object and immediately knowing how to pick it up, even if the object is weird, broken, or the picture is blurry.

1. Problem Statement

Intention-Driven 3D Affordance Grounding aims to predict actionable regions (affordances) on 3D point clouds based on a corresponding 2D interaction image. Unlike language-driven approaches that rely on text prompts, this task requires the model to infer how an object should be used by observing a human interaction in an image and transferring that intent to a 3D geometric structure.

Key Challenges:

Modality Gap: Bridging the semantic understanding of 2D images (intentions, context) with the geometric representation of 3D point clouds.
Spatial Awareness: 2D images lack explicit 3D spatial information, making precise localization of affordance regions on complex 3D shapes difficult.
Limitations of Existing Methods:
- Generation-based methods (e.g., GREAT): Rely on generating explicit text descriptions, which can be inefficient and require manual templates.
- Render-based methods (e.g., InteractVLM): Project 3D points to 2D, predict masks, and back-project. This often leads to geometric inconsistencies, detail loss, and error accumulation.

2. Methodology: HAMMER Framework

The authors propose HAMMER, a framework that leverages Multimodal Large Language Models (MLLMs) to extract interaction intentions and integrates them into 3D representations without relying on intermediate text generation or 2D mask back-projection.

Core Components:

Affordance-Guided Intention Embedding:
- Input: An interaction image $I$ and a text prompt $T$ containing the object category.
- Mechanism: A pre-trained MLLM (Qwen2.5-VL) processes the image-text pair. A special token [CONT] is introduced to aggregate contact-related information.
- Output: The hidden state of the [CONT] token is projected into a contact-aware intention embedding ( $f_c$ ).
- Auxiliary Task: The MLLM is guided to predict textual affordance labels (e.g., "grasp," "sit") as an auxiliary loss. This ensures the embedding captures rich object semantics and contextual cues, not just visual features.
Hierarchical Cross-Modal Integration:
- Goal: To enrich 3D point features with the semantic knowledge from the MLLM.
- Process:
  - Stage 1 (Bottleneck): Point cloud features are extracted via a 3D backbone (PointNet++). The MLLM's hidden states are projected into a latent space. A cross-attention mechanism allows point features to attend to interaction cues, refining the global object representation.
  - Stage 2 (Feature Level): Multi-scale features are extracted from the decoder. A gating mechanism adaptively weights the MLLM's hidden states to create a global descriptor, which is concatenated with the upsampled point features.
- Benefit: This fuses global context and local geometric details, aligning 2D semantics with 3D geometry.
Multi-Granular Geometry Lifting:
- Problem: The intention embedding $f_c$ derived from 2D images lacks explicit 3D spatial structure.
- Solution: A progressive lifting module injects multi-scale geometric features (from coarse structure to fine details) into the intention embedding.
- Mechanism: Using an attention mechanism, the embedding queries the multi-scale point features at each layer of the decoder. Residual connections and Feed-Forward Networks (FFN) update the embedding iteratively.
- Result: The final embedding ( $f_c^{3D}$ ) becomes "3D-aware," containing both interaction intent and geometric characteristics, enabling precise localization.
Affordance Decoding:
- The refined point features and the 3D-aware intention embedding are combined via a point-to-intention attention layer.
- A decoder predicts the final affordance map (probability of each point being part of the interaction region).

3. Key Contributions

Novel Framework (HAMMER): The first approach to harness MLLMs for intention-driven 3D affordance grounding without generating intermediate text or 2D masks, directly aggregating intent into a contact-aware embedding.
Hierarchical Cross-Modal Integration: A mechanism that effectively fuses MLLM hidden states with 3D point features at multiple levels (bottleneck and feature map), improving representation alignment.
Multi-Granular Geometry Lifting: A unique module that infuses 3D spatial characteristics into the 2D-derived intention embedding, solving the lack of geometric awareness in standard MLLM outputs.
Robust Benchmark: The construction of a new corrupted benchmark introducing various noise types (jitter, dropout, rotation, additive noise) to point clouds to rigorously test model robustness.

4. Experimental Results

The method was evaluated on PIAD and PIADv2 datasets, as well as the new corrupted benchmark.

Performance on Standard Datasets:
- PIAD: HAMMER outperforms state-of-the-art (SOTA) methods like GREAT and IAGNet. On the "Unseen" split (novel objects), it improved aIOU by 5.39% over GREAT.
- PIADv2: Achieved the best performance across all splits (Seen, Unseen Object, Unseen Affordance). For example, on the "Unseen Object" split, it surpassed GREAT by 5.12% in aIOU.
Robustness:
- On the corrupted benchmark, HAMMER demonstrated superior resilience against noise (e.g., +9.31% aIOU improvement over GREAT under local dropout).
Ablation Studies:
- Removing the Hierarchical Integration caused significant drops in performance, validating the need for cross-modal fusion.
- Removing Geometry Lifting resulted in a ~3.5% drop in aIOU on unseen data, proving its necessity for spatial localization.
- Fine-tuning the MLLM language component (via LoRA) was crucial; freezing the entire MLLM led to a massive performance drop (aIOU dropped from 13.71 to 7.55 on Unseen).

5. Significance

Paradigm Shift: Moves away from "generate text then predict" or "project 2D to 3D" paradigms, offering a more direct and geometrically consistent way to leverage MLLMs for 3D tasks.
Generalization: The ability to handle unseen objects and affordances suggests the model learns a robust understanding of object functionality rather than memorizing specific shapes.
Real-World Applicability: The demonstrated robustness against point cloud noise makes the framework highly suitable for real-world robotics applications (e.g., manipulation, AR) where sensor data is often imperfect.
Efficiency: Despite using a large MLLM, the framework is efficient by freezing the vision encoder and using LoRA for the language component, balancing performance with computational cost.

In conclusion, HAMMER successfully bridges the gap between 2D visual intentions and 3D geometric reality, setting a new state-of-the-art for intention-driven affordance grounding.