LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

LocateAnything3D introduces a vision-language model-native framework that reframes 3D detection as a next-token prediction problem using a Chain-of-Sight reasoning strategy to achieve state-of-the-art performance and zero-shot generalization on the Omni3D benchmark.

Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu

Published 2026-02-24
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot how to navigate a messy living room. You could give it a 2D photo and say, "There's a chair there." But if the robot tries to walk over, it might trip because it doesn't know how far away the chair is, how big it is, or which way it's facing.

For a long time, Vision-Language Models (VLMs)—the AI brains that can see pictures and talk about them—were great at describing 2D images but terrible at understanding the 3D world. They were like a tour guide who could describe a painting perfectly but had no idea how deep the room was.

LocateAnything3D is a new method that finally teaches these AI models to "see" in 3D, and it does so by mimicking how humans naturally think. Here is the simple breakdown:

1. The Core Idea: "Chain-of-Sight" (CoS)

Think of the old way of doing 3D detection as trying to guess the entire shape of a mystery object in the dark all at once. It's hard, and you often get it wrong.

The authors propose a new way called Chain-of-Sight. Imagine you are looking at a photo of a coffee cup on a table. Instead of guessing the 3D shape immediately, your brain does this:

  1. Step 1 (The 2D Anchor): "Okay, I see a round shape in the middle of the photo. That's the cup." (You locate it in the picture).
  2. Step 2 (The 3D Leap): "Since it's in the middle and looks small, it must be a few feet away. It's probably 4 inches tall."

The AI does the exact same thing. It first predicts the 2D box (a rectangle on the screen) and then uses that box as a stepping stone to predict the 3D box (the actual size and distance in the real world).

The Analogy: It's like building a house. You don't just magically conjure the roof; you first lay the foundation (the 2D box), then build the walls (the size), and finally put on the roof (the rotation). By forcing the AI to lay the foundation first, the whole structure becomes much more stable.

2. The Order Matters: "Near to Far"

When a human looks at a room, they usually notice the coffee cup on the table before they notice the painting on the far wall. The AI used to just scan images like a robot reading a book (left to right, top to bottom). But in 3D, that's confusing because a small object far away might look right next to a big object nearby.

LocateAnything3D teaches the AI to scan Near to Far.

  • Why? Objects close to the camera give the AI the strongest clues. Once the AI knows exactly where the "near" objects are, it can use them as a ruler to guess where the "far" objects are.
  • The Metaphor: Imagine trying to guess the distance of a mountain. If you don't know where the trees in the foreground are, the mountain looks ambiguous. But if you know the trees are 10 feet away, you can use them to estimate the mountain is 1,000 feet away. The AI uses nearby objects as a "ruler" for the rest of the scene.

3. The "Recipe" for Success

The paper introduces a specific "recipe" for the AI to follow, which they call a Curriculum (like a school syllabus):

  • First: Find the object in the 2D picture (The "Where").
  • Second: Figure out how big it is (The "How Big").
  • Third: Figure out which way it's turned (The "Which Way").

This order is crucial. It's much easier to guess "where" something is before guessing "how big" it is. If you get the location wrong, your guess about the size will be wildly off. This step-by-step approach stops the AI from getting confused and "hallucinating" (making things up).

4. Why This is a Big Deal

  • It's "Open-World": You can ask the AI to find "a red chair," "a weirdly shaped vase," or "a cat," and it will work. It doesn't need to be pre-trained on a specific list of objects.
  • It's Flexible: You can talk to it ("Find the chair") or point at it on the screen ("Click here, what's the 3D box?"), and it understands both.
  • It's the Best: On the toughest tests (the Omni3D benchmark), this method beat all previous models by a huge margin. It even beat models that were given "cheat codes" (perfect 2D boxes to start with).

Summary

LocateAnything3D is like teaching a robot to drive by first teaching it to recognize traffic signs on a flat map, and then teaching it how far away those signs are in the real world. By breaking the complex problem of 3D vision into a simple, step-by-step conversation (2D first, then 3D; near first, then far), the AI finally learns to perceive the world in a way that is safe, accurate, and ready for real-world tasks like self-driving cars or home robots.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →