HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

HAMMER is a novel framework that leverages multimodal large language models to achieve intention-driven 3D affordance grounding by aggregating interaction intentions into contact-aware embeddings and employing hierarchical cross-modal integration with multi-granular geometry lifting for accurate 3D localization.

Lei Yao, Yong Chen, Yuejiao Su, Yi Wang, Moyun Liu, Lap-Pui Chau

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you are teaching a robot how to use a new object it has never seen before, like a strange-looking toaster or a futuristic chair. You don't just want the robot to see the object; you want it to understand how to interact with it. Where should it grab the handle? Where should it press the button? This ability to understand "how an object can be used" is called Affordance.

The paper introduces a new AI system called HAMMER that is really good at this task. Here is how it works, explained through simple analogies.

The Problem: The "Blind" Robot

Previous robots trying to do this had two main problems:

  1. They were too literal: They tried to describe the object in words first (e.g., "This is a red cup with a handle") and then guess where to grab it. This is like reading a manual before trying to open a jar; it's slow and often misses the point.
  2. They were 2D-minded: They looked at a 2D picture of an object, drew a mask on the screen, and then tried to "project" that flat drawing onto a 3D object. This is like trying to wrap a flat piece of paper around a basketball; it gets wrinkled, torn, and doesn't fit right.

The Solution: HAMMER (The "Intuitive" Robot)

The authors built HAMMER to mimic how humans learn. When you see a picture of someone holding a hammer, you don't need a manual to know where to hold it. You just get it. HAMMER does the same thing using a Multimodal Large Language Model (MLLM)—think of this as a super-smart brain that has read the entire internet and seen millions of images.

Here is the three-step "magic" HAMMER uses:

1. The "Contact-Aware" Intuition (The Spark)

Instead of writing a long description, HAMMER looks at the image and asks: "If a human were doing this, where would their hand touch?"

  • The Analogy: Imagine you are looking at a photo of someone opening a door. HAMMER doesn't say, "The person is holding a brass knob." Instead, it instantly creates a mental "glow" or a highlight on the exact spot where the hand touches the knob. It captures the intention of the action in a single, powerful signal.

2. The "Cross-Modal" Handshake (The Fusion)

HAMMER takes that "glow" (the intention) and shakes hands with the 3D shape of the object.

  • The Analogy: Imagine the 3D object is a sculpture made of clay. HAMMER takes the "intention signal" from the picture and pours it into the clay. It's like a smart paint that knows exactly which parts of the sculpture are "touchable" based on the picture. It doesn't just guess; it uses the super-smart brain's knowledge to refine the shape, ensuring the robot understands that the handle is for holding, but the body is not.

3. The "Geometry Lifter" (The 3D Upgrade)

This is the most clever part. The "intention signal" comes from a flat 2D picture, so it lacks depth. HAMMER has a special module that acts like a 3D printer.

  • The Analogy: Imagine you have a shadow of a person on a wall (the 2D intention). You know the shadow is of a person, but you don't know if they are tall or short. HAMMER looks at the actual 3D object (the clay sculpture) and says, "Okay, the shadow says 'hand here,' and the 3D shape says 'this is a handle.' Let's combine them." It lifts the flat idea into 3D space, adding depth and structure so the robot knows exactly how far to reach and at what angle.

Why is HAMMER Better?

The paper tested HAMMER against other methods using a "corrupted" benchmark. This means they took the 3D objects and added "noise"—like static on a TV, missing pieces, or shaking the camera.

  • The Result: When the data was messy, other robots got confused and pointed to the wrong spots. HAMMER, however, was like a seasoned detective. Even with a blurry photo or a broken 3D model, it could still figure out, "Ah, despite the noise, the intention is clearly to grasp the handle."

Summary

HAMMER is a robot brain that:

  1. Sees an image and instantly understands the intention (where to touch).
  2. Merges that intention with the 3D shape of the object using a super-smart AI.
  3. Lifts the flat idea into 3D space so the robot knows exactly where to reach.

It's like giving a robot the "common sense" to look at a picture of a human using an object and immediately knowing how to pick it up, even if the object is weird, broken, or the picture is blurry.