360{\deg} Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method

Imagine you are wearing a pair of 360-degree VR goggles. You can look up, down, left, right, and even spin around to see the entire world in a single glance. Now, imagine asking a super-smart AI assistant, "Where is the fire hydrant?" or "How many red cars are there?"

While these AI assistants (called Multimodal Large Language Models or MLLMs) are amazing at looking at normal, flat photos, they often get completely lost when you hand them a 360-degree image. They struggle to understand the shape of the world, the distance between objects, and how things wrap around the edges.

This paper is like a report card and a new set of training wheels for these AIs to help them navigate the 360-degree world.

Here is the breakdown in simple terms:

1. The Problem: The "Unrolled Carpet" Confusion

Think of a 360-degree image like a giant, inflatable balloon. To show it on a computer screen, we have to "pop" the balloon and lay it flat like a rug.

The Distortion: When you unroll a balloon, the top and bottom (the poles) get stretched out like taffy, while the middle stays normal.
The AI's Struggle: Current AIs are trained on flat, normal photos. When they see this "unrolled rug," they get confused. They might think a stretched-out building is actually three different buildings, or they can't tell if two objects are next to each other or on opposite sides of the room.

2. The Test: "360Bench" (The Driving Test)

The authors created a new test called 360Bench.

The Setup: They gathered 7,000-resolution high-quality 360-degree photos (super sharp!) from cities, indoors, and even drone shots.
The Questions: They wrote 1,500+ questions that humans can answer easily but AIs find tricky. Examples include:
- "How many remote controls are on the table?" (Counting is hard when objects are stretched).
- "Is the toy store across from the grocery store?" (Spatial reasoning).
- "What does the sign on the trash can say?" (Reading text on distorted surfaces).
The Result: They tested 13 different AIs. Even the smartest ones only got about 46% correct. Humans, by comparison, got 86% correct. The AIs were basically guessing more often than they were understanding.

3. The Solution: "Free360" (The Smart Tour Guide)

Since retraining these massive AIs is expensive and slow (like rebuilding a car engine just to fix the radio), the authors invented Free360. It's a "training-free" method, meaning it works with the AI you already have, just by giving it better instructions.

Think of Free360 as a smart tour guide that helps the AI solve the puzzle in four steps:

Spot the Objects (The Detective): Instead of looking at the whole distorted "rug," the guide cuts out small, clean pieces of the image where the objects are. It's like zooming in on a specific part of a map so the AI isn't confused by the stretching.
Describe the Details (The Reporter): The guide asks the AI to describe only that small piece. "This is a red sign that says 'Toys'."
Spin the World (The Navigator): This is the magic trick. To figure out where two objects are relative to each other, Free360 rotates the 360-degree image so that both objects are right in the center, facing the AI. It's like spinning the globe until the two cities you are interested in are right in front of you, making it easy to see if they are neighbors or far apart.
Draw the Map (The Architect): The guide puts all these clues into a Scene Graph. Imagine a flowchart or a family tree, but for the room.
- Node 1: Toy Store (Behind the viewer).
- Node 2: Grocery Store (To the right).
- Connection: "Toy Store is across from Grocery Store."

Finally, the AI reads this neat, organized "map" and gives the correct answer.

4. The Results: A Big Jump

When they used this "Tour Guide" method:

The AI's score jumped from 38% to 45% (a huge improvement in the AI world).
It solved specific hard problems (like "Where is the object relative to me?") with up to 23% more accuracy.
It did all this without needing to retrain the AI model, saving time and money.

The Takeaway

This paper shows that while AI is getting smarter, it still needs help understanding the "curved" world of 360-degree images. By breaking the problem down into small, manageable steps and using a "map" to organize the information, we can make these AIs much better at seeing the world as a whole, not just as a flat, distorted picture.

In short: They built a harder test to show where AIs fail, and then built a clever, step-by-step helper system that lets the AI "think" its way through the 360-degree world without needing a total makeover.

1. Problem Statement

Multimodal Large Language Models (MLLMs) excel at understanding conventional 2D images but struggle significantly with 360° (omnidirectional) images. The core challenges include:

Geometric Distortion: 360° images are typically projected onto 2D planes (e.g., Equirectangular Projection - ERP or CubeMap Projection - CMP), causing stretching at the poles and fragmentation of objects across image boundaries.
Complex Spatial Reasoning: Unlike flat images, 360° scenes require holistic spatial reasoning across a continuous spherical space, where object relationships (e.g., "behind," "across from") are non-intuitive in 2D projections.
Lack of High-Resolution Benchmarks: Existing datasets often lack high resolution, rely on automated or template-based annotations, or focus on videos rather than single high-fidelity images.
Training Limitations: Fine-tuning MLLMs for 360° tasks is computationally expensive, risks catastrophic forgetting of pre-trained knowledge, and is not scalable for diverse downstream tasks.

2. Key Contributions

A. 360Bench: A High-Resolution Benchmark

The authors introduce 360Bench, a comprehensive Visual Question Answering (VQA) benchmark designed to rigorously evaluate MLLMs on 360° perception.

Scale & Quality: Contains 1,532 unique samples derived from 643 high-resolution 360° images (up to 7K resolution, 7296 × 3648).
Annotations: Features human-crafted annotations (no templates) across seven subtasks covering four categories:
1. Fine-grained Perception (FP): Instance Recognition (FP-IR) and Instance Counting (FP-IC).
2. Projection-distorted Perception (PP): Recognition (PP-IR) and Counting (PP-IC) of objects split or distorted by projection boundaries.
3. Spatial Reasoning (SR): Object-to-Object (SR-Os) and Object-to-Viewer (SR-OV) relations.
4. Direction-Giving (DG): Multi-step route planning based on spatial cues.
Findings from Benchmarking: Evaluation of 7 MLLMs and 6 enhancement methods revealed a massive performance gap. The best model (Gemini Pro 2.5) achieved only 46.5% accuracy, compared to 86.3% for human annotators.
- Projection Format Insight: CMP (CubeMap) performs better on distortion-heavy tasks (up to +14.1%), while ERP (Equirectangular) is superior for spatial reasoning tasks (up to +14.6%), suggesting complementary strengths.

B. Free360: A Training-Free Framework

To address these limitations without fine-tuning, the authors propose Free360, a scene-graph-based framework that decomposes reasoning into modular steps.

Core Mechanism: It constructs a question-relevant scene graph ( $G = (N, R)$ ) containing entity nodes, view nodes (representing the 6 faces of a cube map), and their relations.
Four-Step Pipeline:
1. Entity Identification: Uses CMP images to robustly detect and localize entities relevant to the question, mitigating geometric distortion.
2. Attribute Extraction: Crops detected entities and uses the MLLM to extract fine-grained attributes (e.g., text on signs) from the cropped regions.
3. Inter-Entity Relation Detection: Applies adaptive spherical rotations to the ERP image to center specific entity pairs. This allows the MLLM to reason about spatial relationships (e.g., "left of," "across from") as if viewing them from a human perspective.
4. Entity-View Relation Detection: Maps entities to specific "view nodes" (e.g., Front, Behind, Left) based on their position in the CMP, modeling the relationship between objects and the viewer.
Reasoning: The resulting scene graph is serialized into text and fed back to the MLLM to generate the final answer and reasoning analysis.

3. Methodology Details

Hybrid Projection Strategy: Free360 leverages the strengths of both projections:
- CMP is used for initial object detection (Step 1) to avoid pole distortion.
- ERP is used for spatial reasoning (Step 3) via spherical rotation to maintain global context.
Scene Graph Construction:
- Nodes: Entity nodes ( $N_e$ ) and View nodes ( $N_v$ , representing the 6 cube faces).
- Relations: Inter-entity ( $N_e \to N_e$ ), Entity-View ( $N_e \to N_v$ ), and Inter-view ( $N_v \to N_v$ ).
Training-Free: The method relies entirely on the pre-trained capabilities of the base MLLM (e.g., Qwen2.5-VL-7B) guided by structured prompts and in-context examples, requiring no parameter updates.

4. Experimental Results

Performance Improvement: Free360 significantly outperforms its base model (Qwen2.5-VL-7B) and other state-of-the-art methods.
- Overall Accuracy: Achieved 45.3%, a +7.3% improvement over the base model using CMP.
- Task-Specific Gains: Showed massive improvements in spatial reasoning, particularly Object-to-Viewer (SR-OV) with a +22.9% gain.
- Comparison: Outperformed other training-free methods like Omni-CoT and ZoomEye, and approached human performance more closely than any other model.
Efficiency:
- Inference time increased from ~2.1s (base) to 22.5s (Free360).
- This latency is still within the range of human response time (~~28.9s) and is significantly faster than other heavy enhancement methods like DC2 (~~700s).
Ablation Studies:
- Cropping: Improved accuracy by 2.7% and reduced time by focusing on details.
- Spherical Rotation: Added 0.7% accuracy, crucial for spatial reasoning.
- Entity-View Relations: Added 2.6% accuracy, essential for viewer-centric questions.
- Scalability: The method consistently improved performance across model scales (3B, 7B, and 32B parameters).

5. Significance and Impact

Benchmarking Standard: 360Bench sets a new standard for evaluating 360° perception, highlighting that current MLLMs are fundamentally limited in understanding spherical geometry and projection distortions.
Practical Solution: Free360 demonstrates that training-free approaches can effectively bridge the gap between human and machine perception in complex 360° environments. This is critical for applications where fine-tuning is impractical (e.g., robotics, autonomous driving, assistive tech).
Methodological Insight: The paper proves that decomposing reasoning into modular steps (detection $\to$ attribute extraction $\to$ spatial rotation $\to$ graph integration) and utilizing adaptive spherical transformations are key to unlocking the potential of MLLMs for omnidirectional data.
Future Directions: The work opens avenues for extending scene-graph reasoning to 360° video and developing finer-grained perception modules for local reasoning.

In summary, this paper provides a critical assessment of the current state of 360° vision in MLLMs and offers a robust, scalable, and training-free framework (Free360) that significantly advances the field of omnidirectional image understanding.

360{\deg} Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method

1. The Problem: The "Unrolled Carpet" Confusion

2. The Test: "360Bench" (The Driving Test)

3. The Solution: "Free360" (The Smart Tour Guide)

4. The Results: A Big Jump

The Takeaway

1. Problem Statement

2. Key Contributions

A. 360Bench: A High-Resolution Benchmark

B. Free360: A Training-Free Framework

3. Methodology Details

4. Experimental Results

5. Significance and Impact

More like this

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

WebXSkill: Skill Learning for Autonomous Web Agents