PanoAffordanceNet: Towards Holistic Affordance Grounding in 360{\deg} Indoor Environments

Imagine you are a robot waiter trying to navigate a busy, circular living room. You need to know exactly where a person can sit, where they can put a cup down, or where they can lean back.

Most current robot "brains" are like people wearing blinders. They only look at one small square of the room at a time (like a standard photo). If they see a chair, they know "sit" applies there. But if the room is a 360-degree panorama, these robots get confused. They miss the chair behind them, or they get dizzy because the wide-angle camera stretches the image at the top and bottom (like a map of the world that stretches the poles).

This paper introduces PanoAffordanceNet, a new system designed to give robots a "god's eye view" of a room, helping them understand what can be done (affordances) anywhere in a full 360-degree circle, not just in front of them.

Here is how it works, broken down into simple analogies:

1. The Problem: The "Stretched Map" Issue

Standard panoramic cameras use a projection called ERP (Equirectangular Projection). Imagine taking a photo of a globe and flattening it out like a piece of paper.

The Equator (middle): Looks normal.
The Poles (top and bottom): Get stretched out like taffy.
The Result: A robot looking at a lamp near the "ceiling" of the image sees it as a giant, distorted blob. Existing AI models get confused by this stretching and can't tell where the "sit" zone on a sofa actually is.

2. The Solution: PanoAffordanceNet

The authors built a three-part "brain" to fix this:

Part A: The "Distortion Glasses" (DASM)

Think of this as a pair of smart glasses that the robot wears.

How it works: The system knows that the top and bottom of the image are stretched. It uses a special filter (a "spectral modulator") to look at the image in two ways at once:
- High Frequency: Looking for sharp edges (like the edge of a table).
- Low Frequency: Looking at the big picture (the shape of the room).
The Magic: It "un-stretches" the top and bottom parts of the image mathematically, so the robot sees a lamp near the ceiling as a normal lamp, not a giant smudge.

Part B: The "Connect-the-Dots" Head (OSDH)

In a 360-degree room, clues about where to sit or stand are often scattered. Maybe you see one leg of a chair, but not the seat.

The Problem: The robot's initial guess is "spotty." It sees a dot here, a dot there, but no whole chair.
The Fix: The Omni-Spherical Densification Head acts like a super-smart connect-the-dots artist. It looks at the scattered dots and asks, "If this is a chair leg, and I know how chairs work, where must the rest of the chair be?"
The Result: It fills in the gaps, turning scattered dots into a complete, solid shape that wraps perfectly around the curved room.

Part C: The "Teacher" (Multi-Level Training)

To teach the robot, the authors didn't just show it pictures; they gave it a three-part lesson plan:

Pixel Level: "Is this specific pixel part of a 'sit' zone?"
Shape Level: "Does the whole shape look like a valid sitting area, or is it just random noise?"
Language Level: "Does this area match the word 'sit'?"
This prevents the robot from getting confused. For example, it stops the robot from thinking a "sit" zone is actually a "stand" zone just because they look similar.

3. The New Playground: 360-AGD

You can't test a new car without a new track. The researchers built 360-AGD, the first-ever "driving school" specifically for 360-degree affordance.

It contains hundreds of real panoramic photos of rooms.
Humans carefully marked exactly where you can sit, lie down, or place objects.
They split it into "Easy" (clean rooms) and "Hard" (messy, complex rooms) to really stress-test the robot.

4. The Results: Why It Matters

When they tested PanoAffordanceNet against other methods:

Old Methods: Got lost in the distortion. They would point to the ceiling when asked where to sit, or break a sofa into tiny, confusing pieces.
PanoAffordanceNet: Got it right. It handled the stretching, filled in the missing parts, and understood the language.
Real World: They even tested it on a robot wearing a camera on its head in a real office. It successfully found places to sit and put things down, even in messy, real-life lighting.

The Big Picture

This paper is a huge step forward for Embodied AI (robots that live in our world).

Before: Robots were like people with tunnel vision, only understanding the world in flat, 2D snapshots.
Now: PanoAffordanceNet gives them holistic vision. They can understand the entire room at once, respecting the curve of the world, and knowing exactly how humans can interact with every object in that space.

It's the difference between a robot that sees a "chair" and a robot that understands, "Ah, that's a whole room, and I know exactly where I can sit, where I can put my coffee, and where I can walk without bumping into anything."

Here is a detailed technical summary of the paper "PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments."

1. Problem Statement

The paper addresses a critical gap in embodied AI: the inability of current visual affordance grounding systems to operate effectively in 360° panoramic indoor environments.

Limitations of Existing Methods: Traditional affordance grounding is largely object-centric and designed for perspective views (limited Field of View). When applied to 360° Equirectangular Projection (ERP) images, these models fail due to:
- Severe Geometric Distortion: ERP causes significant stretching and compression of features, particularly near the poles, disrupting local interaction details and global structures.
- Semantic Dispersion: Functional regions (affordances) are often sparse and scattered across the panoramic view, leading to fragmented activations.
- Cross-Scale Alignment: Difficulty in aligning abstract affordance semantics (e.g., "sit," "grasp") with multi-scale regions in complex, cluttered scenes, often resulting in semantic drift.
The Gap: There is no existing benchmark or method specifically designed for holistic, scene-level affordance grounding in 360° spaces, which is essential for service robots operating in unstructured environments.

2. Methodology: PanoAffordanceNet

The authors propose PanoAffordanceNet, an end-to-end, one-shot learning framework designed to handle the unique challenges of panoramic imagery. The architecture consists of four key components:

A. Feature Extraction with Parameter-Efficient Adaptation

Utilizes pre-trained DINOv2 (ViT-B/14) for visual features and CLIP (ViT-B/16) for text embeddings.
Employs Low-Rank Adaptation (LoRA) inserted into the Transformer attention layers. This allows the model to adapt to the specific geometric distortions of ERP and the one-shot learning setting without overfitting or requiring massive retraining.

B. Distortion-Aware Spectral Modulator (DASM)

To counteract ERP-induced geometric distortion and semantic dispersion:

Dual-Frequency Spectral Distillation: The visual features are decoupled into High-Frequency (HF) and Low-Frequency (LF) components.
- HFEM (High-Frequency Enhancement Module): Sharpens interaction boundaries in equatorial regions while suppressing artifacts amplified at the poles.
- LFSM (Low-Frequency Stabilization Module): Maintains global structural consistency near the poles to prevent semantic fragmentation.
Hybrid Gated Modulation: A language-driven channel gate and a self-adaptive spatial gate fuse these branches, ensuring task-relevant semantics are preserved while correcting for latitude-dependent distortions.

C. Spherical-Aware Hierarchical Decoder & Omni-Spherical Densification Head (OSDH)

To address the sparsity of affordance cues in panoramic views:

Global Semantic Discovery: A lightweight decoder uses text embeddings to generate initial, albeit fragmented, affordance maps via cross-attention.
OSDH: This module restores topological continuity on the spherical manifold.
- It constructs a symmetric affinity matrix based on cosine similarity between visual tokens projected onto a unit hypersphere.
- It selects high-confidence seed points (top-k) and propagates their activations to neighboring regions.
- A confidence map suppresses noise, and a learnable residual scalar ( $\alpha$ ) refines the final dense map, ensuring geometrically coherent functional areas.

D. Multi-Level Training Objective

To suppress semantic drift under low supervision (one-shot), the framework integrates three loss functions:

Pixel-wise Loss ( $L_{BCE}$ ): Ensures accurate activation at the pixel level.
Distributional Loss ( $L_{KL}$ ): Uses KL Divergence to enforce topological consistency between the predicted heatmap and the ground truth distribution.
Region-Text Contrastive Loss ( $L_{RTC}$ ): An InfoNCE-based loss that aligns pooled visual region features with text embeddings, ensuring the correct semantic label is grounded to the correct visual area.

3. Key Contributions

New Task Definition: Introduced Holistic Affordance Grounding in 360° Indoor Environments, shifting the paradigm from isolated object-level understanding to holistic scene-level reasoning.
Novel Framework (PanoAffordanceNet): Proposed a specialized architecture featuring DASM for distortion calibration and OSDH for signal densification, effectively solving geometric and sparsity issues in ERP images.
360-AGD Dataset: Constructed the first high-quality dataset for this task, containing precise annotations of interaction regions in panoramic images. It includes an "Easy Split" (clean environments) and a "Hard Split" (complex, high-fidelity scenes) with 19 affordance classes.
State-of-the-Art Performance: Demonstrated significant improvements over existing methods in both panoramic and standard perspective domains.

4. Experimental Results

Dataset Performance: On the proposed 360-AGD dataset, PanoAffordanceNet significantly outperformed one-shot baselines (OOAL, OS-AGDO).
- Hard Split: Achieved a KLD of 1.306 (vs. ~2.9 for baselines), SIM of 0.474, and NSS of 4.398.
- The model showed superior ability to handle semantic drift and maintain spatial coherence in cluttered scenes.
Generalization: The model maintained competitive performance on the standard perspective AGD20K dataset, proving its robustness across different visual projections.
Ablation Studies:
- Removing DASM increased KLD error significantly, confirming its role in correcting geometric distortion.
- Removing OSDH led to fragmented predictions, validating its necessity for topological continuity.
- The full multi-level loss function was crucial for semantic alignment, particularly improving SIM and NSS scores.
Real-World Validation: Field experiments using a wearable 360° camera on a head-mounted cap demonstrated the model's ability to accurately localize affordances (e.g., "sit," "display") in unstructured office and domestic environments under varying lighting and distortion.

5. Significance

This work represents a fundamental shift in how embodied agents perceive their environment. By moving from object-centric, perspective-limited views to holistic, 360° scene-level reasoning, PanoAffordanceNet enables robots to understand the functional affordances of an entire room simultaneously.

Robustness: It solves the specific geometric and sampling challenges of panoramic vision, which have previously hindered the deployment of AI agents in real-world 360° spaces.
Foundation for Embodied AI: The introduction of the 360-AGD dataset and the open-source framework provides a necessary benchmark for future research in global perception, facilitating the development of service robots capable of safe and effective interaction in complex indoor environments.