RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

Imagine you are trying to find a specific friend in a crowded, foggy park at night. You have a description: "The person wearing a red jacket walking a golden retriever."

If you only have regular eyes (RGB cameras), the fog and darkness make it nearly impossible. You might see a blurry shape, but you can't tell if it's your friend or a stranger. You might lose them the moment they step behind a tree.

But what if you also had X-ray vision (Thermal cameras)? Even in the pitch dark or thick smoke, you would instantly see the warm, glowing outline of your friend and their dog, regardless of the fog.

This paper introduces a new way to solve this problem using a combination of both "eyes" and "X-ray vision," guided by your voice. Here is the breakdown in simple terms:

1. The New Game: "Find It in the Dark" (RT-RMOT)

The authors created a new challenge called RT-RMOT.

The Old Way: Previous computer programs could only use regular cameras. If it was night, rainy, or smoky, they got confused and stopped tracking.
The New Way: This system uses two cameras at once: a regular color camera and a heat-sensing (thermal) camera.
- The Color Camera sees the details (like the red jacket).
- The Thermal Camera sees the heat (the person's body), which works perfectly in the dark.
The Goal: You speak a sentence (e.g., "Find the person walking the dog"), and the computer must find and follow that specific target, day or night, in any weather.

2. The New Map: "RefRT" (The Dataset)

To teach computers how to do this, you need practice data. The authors built the RefRT dataset, which is like a massive training gym for AI.

It contains 166,000+ scenes of people and vehicles.
It includes 388 specific descriptions (like "The student running with a backpack").
Crucially, every single frame has both a color photo and a heat photo perfectly lined up, so the AI can learn how to combine them.

3. The Brain: "RTrack" (The Framework)

They built a smart system called RTrack to solve the problem. Think of it as a super-intelligent detective with three special tools:

Tool 1: The Multilingual Detective (MLLM): This is a "Large Language Model" (like a super-smart chatbot) that can look at pictures and read your instructions. It understands that "red jacket" means a specific color and "walking" means a specific motion.
Tool 2: The Crystal Ball (Trajectory Prediction): If the target hides behind a tree, this tool guesses where they will pop out next based on how they were moving before. It's like a baseball player predicting where a fly ball will land.
Tool 3: The Matchmaker (Identity Association): When the target reappears, this tool makes sure the computer knows, "Yes, that is the same person I was tracking, not a new person."

4. The Training: "The Coach" (Reinforcement Learning)

Just giving the detective the tools isn't enough; they need to practice. The authors used a special training method called Reinforcement Learning (like training a dog with treats).

The Problem: Sometimes the AI gets too excited and makes wild guesses, or it gets scared and stops learning.
The Solution (GSPO & CAS): They created a "coach" that gives feedback.
- The "Clipped Advantage" (CAS): Imagine the AI gets a huge "TREAT" for a good guess. If the treat is too big, the dog goes crazy. This strategy "clips" the treat size so the AI stays calm and learns steadily without exploding into chaos.
- The "Structured Reward": The coach tells the AI, "Don't just guess randomly. Give me the answer in a neat box format." This forces the AI to be precise and organized.

5. The Result: "Super Vision"

When they tested this system, it was a huge success.

In normal light, it was good.
In total darkness, smoke, or rain, it was much better than any previous system.
It successfully tracked people and cars that other systems completely lost.

Summary Analogy

Think of the old systems as a blindfolded person trying to find a friend in a dark room. They can hear you say "Find the guy in the hat," but they can't see anything.

This new system is like giving that person night-vision goggles (Thermal) and flashlights (RGB), while also giving them a smart guide (the AI) who understands your voice perfectly. Even if the room is filled with smoke, the guide knows exactly where to look, and the goggles ensure they never lose sight of the target.

In short: They built a new dataset and a smart AI framework that lets computers track objects by voice, even when it's pitch black or smoky, by combining the best of human sight (color) and thermal vision (heat).

tags and coordinates withintags in a specific format[x1, y1, x2, y2]`.
* Length Reward: Uses a sine-window function to regularize response length, preventing overly verbose or truncated outputs.

Comprehensive Detection Reward ( $R_{ctr}$ ):
- Output Encouragement Reward: Encourages the model to detect as many valid targets as possible (increasing recall).
- Precision Detection Reward: Rewards high IoU matches while penalizing excessive redundant predictions near high-IoU regions, balancing precision and recall.

4. Experimental Results

The framework was evaluated on the RefRT test set against state-of-the-art (SOTA) methods, including traditional RMOT models (TransRMOT, TempRMOT) and RGB-T trackers combined with tracking modules (DeformCAT, Unismot, etc.).

Performance Metrics: The evaluation uses HOTA (Holistic Accuracy), DetA (Detection Accuracy), AssA (Association Accuracy), and their respective Recall/Precision variants.
Key Findings:
- RTrack achieves SOTA performance: It significantly outperforms all baselines.
- Improvements: Compared to the second-best method, RTrack improves HOTA by 6.84%, DetA by 9.8%, and DetRe by 17.1%.
- Modality Impact: Experiments show that using RGB-T input significantly outperforms RGB-only input (HOTA increase of ~3% in trained models), proving the necessity of thermal data for low-visibility tracking.
- Ablation Studies:
  - Removing the CAS strategy leads to unstable training and lower performance.
  - Removing Structured Output or Detection Rewards significantly degrades detection accuracy and tracking consistency.
  - Qwen2.5-VL (3B) was found to be the most effective baseline MLLM compared to larger models like LLaVA-NeXT (8B), likely due to better modality fusion capabilities.

5. Significance

This work represents a paradigm shift in Referring Multi-Object Tracking by:

Expanding the Operational Domain: Moving RMOT from daylight-only applications to all-day, all-weather scenarios by integrating thermal sensing.
Bridging the Data Gap: Providing the RefRT dataset, which is essential for training and benchmarking multimodal tracking systems in low-visibility environments.
Advancing MLLM Application: Demonstrating how Reinforcement Learning with specialized reward shaping (CAS, structured rewards) can effectively fine-tune large multimodal models for precise, structured computer vision tasks like object tracking, overcoming the instability often associated with RL fine-tuning.

The proposed RTrack framework sets a new baseline for robust, language-guided tracking, offering a solution for critical applications such as autonomous driving, surveillance, and search-and-rescue operations in challenging environments.

RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

1. The New Game: "Find It in the Dark" (RT-RMOT)

2. The New Map: "RefRT" (The Dataset)

3. The Brain: "RTrack" (The Framework)

4. The Training: "The Coach" (Reinforcement Learning)

5. The Result: "Super Vision"

Summary Analogy

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation