TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding

Imagine you are looking at a painting. A standard AI (like a typical Large Vision-Language Model) looks at the whole picture at once and says, "I see a dog." It's like looking at a forest from a helicopter and just saying, "Trees." It gets the general idea, but it misses the details of where you are looking and how your eyes moved to find that dog.

TraceVision is like giving that AI a pair of glasses that can see not just the image, but also your finger tracing a path across the screen. It understands that when you point at a specific spot, move to another, and then circle a third, you are telling a story about what you see.

Here is a breakdown of how TraceVision works, using some everyday analogies:

1. The Problem: The "Helicopter View" vs. The "Finger Trace"

Current AI models are great at describing a whole scene, but they struggle with spatial attention.

The Old Way: If you ask an AI, "What is on the table?" it might guess based on the whole image. It doesn't know which table you mean if there are three, or it might get distracted by a chair in the background.
The Human Way: When humans look at a complex scene, our eyes don't just jump randomly. We follow a path. We might look at a red hat, then trace our eyes down to the blue shoes, then sweep over to the dog. This path is called a trajectory.

TraceVision is the first AI that treats these eye-movement paths as a crucial part of the conversation, not just an afterthought.

2. The Magic Ingredient: "Geometric Simplification" (The Art of Editing)

Raw eye-tracking data is messy. It's like a shaky video recording of a hand waving; it has thousands of tiny, jittery points that don't really mean anything.

The Analogy: Imagine you have a 410-page handwritten diary, but most of the pages are just scribbles. You want to keep the story but lose the noise.
The Solution: TraceVision uses a smart "editor" (called Geometric Simplification). It looks at the path you drew and asks, "Is this part of the path important?"
- If you slowly traced a circle around a dog, the AI keeps those points because the dog is important.
- If you quickly swiped your finger across the empty sky, the AI deletes those points because they are just "noise."
- Result: It turns a messy 410-point scribble into a clean, 37-point path that perfectly captures the intent of your gaze.

3. The Brain: The "Trajectory-Aware Visual Perception" (TVP) Module

This is the engine under the hood. Think of the AI's brain as having two friends talking to each other:

The Visual Friend: "I see a picture of a room with a chair and a lamp."
The Trajectory Friend: "But the user's finger just traced a loop around the chair!"

In older models, these two friends barely talked. In TraceVision, they have a two-way conversation (Bidirectional Fusion).

The Trajectory Friend tells the Visual Friend: "Focus on the chair, ignore the lamp."
The Visual Friend tells the Trajectory Friend: "Ah, that loop you drew? That's definitely a chair, not a table."

They keep refining each other's understanding until they agree on exactly what the user is looking at.

4. The Training: "The 320,000-Student Classroom"

To teach the AI this skill, the researchers couldn't just use old textbooks. They built a new, massive classroom called RILN (Reasoning-based Interactive Localized Narratives).

The Analogy: Imagine teaching a student to be a tour guide.
- Old Data: Just showing them a photo and a list of facts.
- RILN Data: Showing them a photo, a video of a tour guide's finger pointing at things, and a transcript of the guide explaining why they pointed there.
They used super-smart AI (like GPT-4o) to generate 320,000 of these "pointing and explaining" examples. This taught TraceVision not just to see, but to reason about why someone is looking at something.

5. What Can It Do Now?

Because it understands the "finger trace," TraceVision can do things other AIs can't:

The "Follow the Finger" Game: You draw a path on a picture, and it tells you exactly what objects you were looking at.
The "Draw to Describe" Game: You say, "Describe the red car," and the AI draws the path your eyes would take to find that car.
The "Video Detective": It can watch a video and track how attention moves from frame to frame, understanding how a story unfolds over time.
The "Precision Surgeon": It can cut out (segment) specific objects from a photo with extreme accuracy, guided by the path you drew.

Summary

TraceVision is like upgrading an AI from a tourist who takes a blurry photo of a whole city, to a local guide who walks beside you, points at specific buildings, and explains the story of the city based on exactly where you are looking. It bridges the gap between "what the computer sees" and "what the human is thinking."

1. Problem Statement

Current Large Vision-Language Models (LVLMs) excel at global image understanding and natural language generation but suffer from significant limitations in spatial attention modeling:

Lack of Continuity: Existing methods rely on static, discrete localization elements (e.g., bounding boxes, points) which fail to capture the continuity and temporal dynamics of human visual attention.
Context Neglect: LVLMs often focus on primary regions while ignoring surrounding context or getting distracted by irrelevant areas, unlike humans who use gestures or eye movements to guide attention.
Data Scarcity: There is a lack of high-quality datasets that align complex reasoning, instructions, and continuous human attention trajectories with text.
The Core Question: How can LVLMs be enhanced to understand and respond to continuous spatial attention patterns to achieve human-like spatial reasoning?

2. Methodology

TraceVision is an end-to-end unified vision-language model designed to process human attention trajectories bidirectionally. It treats trajectories as fine-grained, temporally structured records of human intent.

A. Core Architecture

The model integrates three main components:

Visual Encoder: Uses QwenViT to process images and video frames into visual tokens.
Large Language Model (LLM): Built on Qwen2.5-VL-7B, handling text tokens and trajectory representations.
Trajectory-aware Visual Perception (TVP) Module: The core innovation. It performs bidirectional fusion between visual features and trajectory information via cross-attention:
- Trajectory-Aware Visual Enhancement (TVF): Visual features act as queries, while trajectory embeddings act as keys/values to guide attention to relevant regions.
- Visually-Informed Trajectory Refinement (VTR): Enhanced visual features refine trajectory representations to disambiguate pointing intentions based on visual context.
- This iterative refinement creates robust multimodal embeddings that integrate spatial attention patterns with visual understanding.

B. Trajectory Preprocessing: Semantic-Guided Geometric Simplification

Raw human attention trajectories contain noise and redundancy (e.g., micro-saccades). TraceVision employs a two-stage preprocessing approach:

Semantic Segmentation: Trajectories are segmented into word-aligned phrases based on temporal boundaries.
Semantic-Weighted Douglas-Peucker (DP) Algorithm:
- Unlike standard geometric simplification, this method assigns importance weights to phrases (1–5 scale) based on their semantic contribution to object identification (e.g., "red hat" is high weight, "the" is low weight).
- The DP algorithm uses a dynamic tolerance ( $\epsilon_i = \epsilon_{base} / w_i$ ). Critical phrases retain more geometric detail, while less important segments are aggressively simplified.
- Result: Reduces trajectory density by ~91% (e.g., 410 points to 37 keypoints) while preserving spatial structure and semantic relevance.
Tokenization: Simplified coordinates are normalized and quantized into discrete tokens compatible with the LLM vocabulary.

C. Segmentation Extension

For fine-grained segmentation, TraceVision introduces a lightweight Segmentation Decoder and a learnable Codebook (6 tokens). When the LLM generates a [SEG] token, the decoder uses visual features and trajectory-conditioned embeddings to generate pixel-level masks, avoiding heavy decoders like SAM.

D. Dataset Construction: RILN

To address the lack of reasoning data, the authors constructed the Reasoning-based Interactive Localized Narratives (RILN) dataset (320k samples):

Source: Built upon COCO, ADE20K, Flickr30k, and video datasets (OVIS, UVO, Oops).
Generation: Uses multi-model collaboration (GPT-4o, Qwen2.5VL-72B, Gemini-2.5 Pro) to generate diverse instructional samples.
Tasks: Covers Referential Trajectory Interpretation (describe what the trajectory sees), Referential Trajectory Grounding (predict trajectory for a description), and Interactive Trajectory Reasoning QA.
Structure: Includes hierarchical reasoning trees (Global $\to$ Object $\to$ Paragraph $\to$ Local) and multi-turn dialogue synthesis.

E. Training Strategy

A three-stage curriculum learning approach:

Stage 1 (Pretraining): Aligns trajectory, visual, and language features using COCO Localized Narratives. Only TVP and trajectory embeddings are trained.
Stage 1.5 (Joint Training): Unfreezes all parameters for end-to-end optimization across diverse tasks (captioning, grounding, segmentation, video).
Stage 2 (Instruction Fine-tuning): Fine-tunes on the RILN dataset to enhance complex reasoning and instruction-following capabilities.

3. Key Contributions

TraceVision Model: The first end-to-end LVLM that models human attention trajectories for bidirectional trajectory–language understanding, enabling both trajectory-guided captioning and text-guided trajectory prediction.
TVP Module & Simplification: A novel bidirectional fusion module and a semantic-guided geometric simplification strategy that effectively reduces noise while preserving critical spatial details.
RILN Dataset: A large-scale (320k), reasoning-rich dataset specifically designed to enhance logical reasoning and spatial understanding in trajectory-based tasks.
Unified Framework: Extends capabilities from static images to video scene understanding and precise segmentation within a single architecture.

4. Experimental Results

TraceVision achieves State-of-the-Art (SOTA) performance across multiple benchmarks:

Trajectory-Guided Captioning: Outperforms baselines (LLaVA, Ferret, Qwen2.5-VL) on the Localized Narratives dataset with significant gains in BLEU-4 (0.328 vs 0.295) and METEOR (0.276 vs 0.260).
Text-Guided Trajectory Prediction: Achieves the lowest LBM (Localization Boundary Metric) errors, demonstrating superior accuracy in predicting continuous paths from text.
Regional Captioning: Sets new records on Visual Genome (METEOR: 21.5) and RefCOCOg (METEOR: 28.8), outperforming specialized models like VP-SPHINX and RegionGPT.
Referring Localization & Segmentation:
- Bounding Box: 90.4% P@0.5 on RefCOCO val (SOTA).
- Segmentation: 83.4% cIoU on RefCOCO, achieving competitive results with a lightweight 12M parameter decoder compared to massive decoders like SAM (636M params).
Video Understanding: Achieves SOTA on HC-STVG and VideoRefer-Bench-D, validating its ability to handle temporal dynamics.
Ablation Studies: Confirm the necessity of the bidirectional TVP module, the fixed-order curriculum training, and the RILN dataset (which alone provides a 23% improvement in spatial reasoning accuracy).

5. Significance

Human-Like Interaction: TraceVision bridges the gap between AI and human cognitive processes by simulating how humans use continuous gestures and eye movements to explore and understand visual scenes.
Interpretability: By explicitly modeling attention trajectories, the model's decisions become more interpretable, showing where and in what order it "looks" to generate an answer.
Efficiency: The trajectory-guided approach allows for high-precision segmentation and grounding without the computational overhead of heavy, specialized segmentation decoders.
Foundation for Future AI: Establishes a new paradigm for intuitive spatial interaction, paving the way for applications in autonomous driving, virtual reality, and human-robot collaboration where understanding the "flow" of attention is critical.