REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

Imagine you are looking at a busy street scene. Your brain doesn't just see "a blob of pixels"; it instantly understands: "There is a person (subject) riding (predicate) a bicycle (object)."

In the world of Artificial Intelligence, this task is called Scene Graph Generation (SGG). It's like turning a photo into a structured story or a map of relationships. This is crucial for robots, self-driving cars, and smart cameras so they can "understand" what's happening around them, not just see it.

However, there's a big problem: Current AI models are like slow, heavy librarians. They can tell you exactly who is riding what, but they take so long to check their books that by the time they finish, the robot has already crashed into a wall. They are accurate but too slow for real-time use.

Enter REACT++, the new "speedster" of the AI world. Here is how it works, explained simply:

1. The Old Way: The "Two-Step Dance" (and why it was slow)

Previously, most AI models used a "Two-Stage" approach.

Stage 1: A detective (Object Detector) finds all the people and bikes.
Stage 2: A second detective (Relation Predictor) looks at the list from Stage 1 and tries to guess who is doing what to whom.

The problem? The second detective was using a very slow, old-fashioned map-reading tool called ROI Align. Imagine trying to measure a pizza slice by drawing a perfect grid over it and calculating every single crumb. It's precise, but it takes forever. Also, the second detective often forgot to look at the whole room, focusing only on the two people in front of them, missing the context (e.g., they are in a kitchen, so "eating" is more likely than "swimming").

2. The REACT++ Solution: The "Speedy Detective"

The authors of this paper built REACT++ to fix these bottlenecks. They made three major upgrades:

A. The New Tool: DAMP (The "Snappy Snap")

Instead of the slow, grid-based measuring tool (ROI Align), they invented DAMP (Detection-Anchored Multi-scale Pooling).

The Analogy: Imagine the old tool was like a surveyor walking around a house measuring every inch with a tape measure. The new tool, DAMP, is like a smart drone. It knows exactly where the object is because the first stage (the detector) already told it the coordinates. It just "snaps" a photo of that exact spot instantly.
Result: It's much faster and doesn't waste time calculating things it already knows.

B. The New Brain: CARPE (The "Contextual Connector")

The old models treated relationships as symmetrical. They thought "Person on Bike" was the same as "Bike on Person." But in reality, relationships have a direction!

The Analogy: Think of the old model as a two-way street where traffic flows both ways equally. REACT++ introduces CARPE (Cross-Attention Rotary Prototype Embedding), which is like a one-way highway with traffic lights. It understands that the "Person" is the driver and the "Bike" is the vehicle. It also adds a "spatial GPS" (Rotary Position Embedding) so the AI knows that if the person is above the bike, they are likely "riding" it, but if they are below, they might be "fixing" it.
Bonus: It also looks at the "Global Context" (using a module called AIFI). It's like the detective stepping back to look at the whole room. If the room is a beach, the AI guesses "swimming" is more likely than "driving."

C. The Smart Filter: DCS (The "Bouncer")

In the old days, the AI would try to check every single possible pair of objects in the image (e.g., "Is the lamp riding the cat?"). This is a waste of time.

The Analogy: REACT++ uses Dynamic Candidate Selection (DCS), which acts like a smart bouncer at a club. Instead of letting everyone in to check for relationships, the bouncer quickly checks the ID (confidence score) and only lets the most likely candidates (the top 47 people, for example) into the VIP room for the relationship check.
Result: It cuts out the noise and focuses only on the important stuff, saving massive amounts of time.

The Grand Result

By combining these three upgrades, REACT++ achieves a "Holy Grail" in AI:

It's Fast: It runs in about 26 milliseconds. That's faster than a human blink. It's the first model to be truly "real-time."
It's Smart: It didn't just get faster; it got smarter. It predicts relationships 10% more accurately than the previous version.
It's Efficient: It uses fewer computer resources (parameters) than its competitors.

Why Should You Care?

Imagine a robot waiter in a restaurant.

Old AI: Takes 2 seconds to realize a customer is holding a glass. By then, the waiter has already bumped into the table.
REACT++: Instantly sees the customer, the glass, and the action "holding," and tells the robot to gently move the tray.

This paper proves that we don't have to choose between "smart" and "fast." With the right architecture, AI can be both, paving the way for robots that can actually interact with the real world in real-time.

1. Problem Statement

Scene Graph Generation (SGG) involves extracting objects and their relationships from an image to form a structured graph (triplets of <subject, predicate, object>). While SGG is crucial for downstream tasks like embodied agent reasoning and visual question answering, current methods face a significant trade-off between accuracy (Object Detection and Relation Prediction) and inference latency.

Existing approaches generally fall into two categories, both with limitations:

Two-Stage (TS) Methods: (e.g., Faster-RCNN based) High accuracy in object detection but slow inference due to complex pipelines (ROI Align, sequential training) and heavy backbones (ResNeXt-101). They often suffer from performance degradation in the object detection task after relation training.
One-Stage (OS) Methods: (e.g., DETR-based) Faster but often sub-optimal in object detection accuracy compared to TS methods.

The authors identify three specific bottlenecks in the previous state-of-the-art (REACT) and general TS architectures:

Inefficient Feature Extraction: The reliance on ROI Align consumes significant parameters and latency.
Lack of Global Context: Models often fail to leverage global scene information to resolve context-dependent predicates (e.g., "eating" vs. "drinking").
Symmetric Representation: Previous models treat subject and object representations symmetrically, failing to capture the inherent asymmetry of relationships (e.g., <person, eating, pizza> $\neq$ <pizza, eating, person>).

2. Methodology: REACT++ Architecture

REACT++ introduces a Decoupled Two-Stage (DTS) architecture that replaces the traditional Faster-RCNN backbone with YOLO (specifically YOLOv8/v12) for real-time efficiency, while introducing novel modules to maintain high accuracy.

A. Decoupled Two-Stage (DTS) Design

Backbone Replacement: Replaces the heavy ResNeXt-101/Faster-RCNN with a lightweight YOLO backbone (CSPNet-based).
Decoupling: The object detector (YOLO) is frozen after training. The relation predictor does not re-classify objects; it uses the original YOLO class probabilities. This prevents the "accuracy drop" in object detection often seen when fine-tuning TS models for relations.

B. DAMP (Detection-Anchored Multi-scale Pooling)

Problem: Standard TS methods use ROI Align, which involves bilinear interpolation and is computationally expensive ( $O(N \times 7^2)$ ).
Solution: DAMP leverages the anchor-free nature of modern YOLO. Instead of interpolation, it directly indexes feature maps at the bounding box coordinates.
Mechanism:
1. Multi-scale Gathering: Aggregates features from multiple FPN levels ( $P3, P4, P5$ ).
2. Gaussian Neighborhood: Applies a $3 \times 3$ Gaussian-weighted neighborhood around the anchor point to capture local context without heavy interpolation.
3. Efficiency: Reduces gather operations by $5.4\times$ compared to ROI Align with no learnable parameters in the gathering stage.

C. AIFI (Attention-based Intra-scale Feature Interaction)

Purpose: To inject global context into the relation prediction without heavy computational cost.
Mechanism: Inspired by RT-DETR, a lightweight attention module processes the scene features to create a global context vector. This vector is fused with local subject/object features, helping the model understand scene-level semantics (e.g., distinguishing a "kitchen" from a "beach" to predict correct predicates).

D. CARPE (Cross-Attention Rotary Prototype Embedding)

Purpose: To model the asymmetry of relations and encode spatial information efficiently.
Cross-Attention: Instead of linearly fusing subject and object features, CARPE uses a cross-attention mechanism where visual tokens query a bank of semantic predicate prototypes. This allows the model to selectively blend relevant semantic signals.
Rotary Position Embedding (RoPE): Replaces the heavy, separate spatial feature extractor (Conv blocks) used in previous work. Spatial coordinates (width, height, center, area) are encoded as RoPE and added as a bias to the attention logits. This allows the model to learn directional spatial biases (e.g., "above" vs. "below") directly within the attention layer.
Prototype Stability: Uses an Exponential Moving Average (EMA) buffer for predicate prototypes to stabilize learning for long-tail (rare) classes.

E. Dynamic Candidate Selection (DCS)

Strategy: Instead of processing a fixed number of proposals (e.g., 100) for every image, DCS dynamically determines the optimal number of proposals ( $k$ ) based on the image complexity and a performance-latency curve.
Benefit: Reduces the computational complexity of the $N \times (N-1)$ relation matching stage by filtering out low-confidence candidates early.

3. Key Contributions

DAMP: A new pooling algorithm for one-stage detectors that outperforms ROI Align in both latency and accuracy.
Global Context Integration: Introduction of a low-cost AIFI module to enhance relation prediction with scene-level context.
CARPE: A novel relation head using cross-attention and RoPE to model relation asymmetry and spatial biases without extra spatial extractors.
DCS: An inference-time strategy that significantly reduces latency by adaptively selecting the number of candidate pairs.
State-of-the-Art Performance: Achieves the best trade-off between Object Detection (OD) accuracy, Relation Prediction (RelPred) accuracy, and inference speed.

4. Experimental Results

The model was evaluated on PSG, IndoorVG, and VG150 datasets.

Performance vs. Speed:
- Latency: REACT++ achieves 25.9ms inference time (on a laptop GPU), which is 20% faster than the original REACT and significantly faster than all other TS and OS models. With DCS, latency drops further to <20ms.
- Accuracy:
  - Object Detection: Improved mAP by 54.37% compared to TS methods using Faster-RCNN on PSG.
  - Relation Prediction: Improved meanRecall@K by 10% and F1@K by 5 points compared to the original REACT.
- Parameters: Reduced total parameters by 17% compared to REACT (35.8M vs 43.3M).
Comparisons:
- Outperforms One-Stage methods (EGTR, RelTR) in Object Detection accuracy (mAP).
- Outperforms Two-Stage methods (PE-NET, VCTree, Motifs) in both speed and F1@K scores.
- On the PSG dataset, REACT++ with YOLO12-m is the first model to achieve an F1@K score of 30.0.
Ablation Studies:
- DAMP vs. ROI Align: DAMP reduced latency by ~32% with a negligible drop in accuracy compared to direct indexing, and a gain over standard ROI Align.
- AIFI: Adding global context improved F1@K by 0.42 points.
- DCS: Reduced latency by 66.5% on average with only a ~1% loss in F1@K.

5. Significance

REACT++ represents a paradigm shift in real-time Scene Graph Generation. It successfully demonstrates that Decoupled Two-Stage architectures, when combined with efficient one-stage detectors (YOLO) and advanced attention mechanisms, can surpass the traditional "heavy" two-stage pipelines.

Practical Impact: The sub-20ms latency makes SGG viable for embodied agents and robotics, where real-time scene understanding is critical for navigation and reasoning.
Efficiency: By removing ROI Align and separate spatial encoders, the model is lighter and faster without sacrificing the ability to learn complex, asymmetric relationships.
Future Work: The authors plan to deploy REACT++ on physical robotic platforms to test its utility in real-world navigation and reasoning tasks.

The code and benchmark are publicly available at: https://github.com/Maelic/SGG-Benchmark.