AR2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

The paper proposes AR2-4FV, a novel framework for long-term language-guided referring in fixed-view videos that leverages a static background-derived Anchor Bank and a ReID-Gating mechanism to maintain identity continuity and accelerate re-capture during occlusions or scene exits, significantly outperforming existing baselines in re-capture rate and latency.

Teng Yan, Yihan Liu, Jiongxu Chen, Teng Wang, Jiaqi Li, Bingzhuo Zhong

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are a security guard watching a live feed from a single, fixed camera pointed at a busy lobby. Your job is to find a specific person described by a text note: "The man in the gray jacket standing near the pillar."

In a normal video, this is easy. But in the real world, things get tricky:

  1. The man walks behind a large potted plant and disappears for 30 seconds.
  2. He walks out of the camera's view entirely to go to the bathroom.
  3. He comes back 2 minutes later, but now he's wearing a different hat, and the lighting has changed because the sun moved.

Most computer systems would get confused here. They would say, "I lost him!" or "Is that a new person?" because they rely too much on what the person looks like at that exact second.

This paper introduces AR2-4FV, a new system designed specifically for these "fixed camera" scenarios. Think of it as a super-smart guard who doesn't just look at the person, but memorizes the room itself.

Here is how it works, broken down into simple concepts:

1. The "Anchor Bank": Memorizing the Room, Not the Person

Instead of trying to remember the man's face (which changes when he puts on a hat or the light shifts), the system first studies the room.

  • The Analogy: Imagine the room has invisible "sticky notes" on the walls, floor, and pillars. These notes say, "This is the pillar," "This is the door," "This is the fountain."
  • How it helps: When the system hears "The man near the pillar," it doesn't just look for a man; it locks onto the "pillar" sticky note. Even if the man disappears, the system knows exactly where to look because the pillar never moves. This is called the Anchor Bank.

2. The "Anchor Map": The Persistent Memory

When the man walks away, the system doesn't panic. It creates a mental map called the Anchor Map.

  • The Analogy: It's like a GPS that keeps the "You are here" dot blinking on the pillar even when the car (the person) is gone.
  • How it helps: While the man is gone, the system keeps the "search zone" active around the pillar. It knows, "He isn't here right now, but he will be back near this specific spot." This prevents the system from drifting and looking at the wrong part of the room.

3. The "Re-Entry Prior": The "He's Coming Back" Instinct

When the man finally walks back into the frame, he might look different. A normal system might think, "That's a stranger."

  • The Analogy: Imagine you are waiting for a friend at a bus stop. Even if they are wearing a coat you've never seen before, you know they are coming from the direction of the bus. You don't scan the whole city; you focus on the bus stop.
  • How it helps: The system uses the Re-Entry Prior. It knows the man was last seen near the pillar, so when someone walks into the frame near the pillar, the system gives them a "second chance" to be identified, rather than immediately rejecting them as a stranger.

4. The "ReID-Gating": The Final ID Check

Once the system spots a candidate near the pillar, it does a quick check to make sure it's really the right person.

  • The Analogy: It's like a bouncer at a club. The bouncer sees a guy near the door (the anchor). He checks the guy's ID (appearance) and asks, "Did you just walk in from the street, or did you come from the VIP lounge?"
  • How it helps: The system checks three things:
    1. Does he look like the man? (Appearance)
    2. Is he near the pillar? (Anchor evidence)
    3. Did he move a weird distance? (Displacement check)
      If the answer is "Yes" to all, the system says, "Got him!" and updates its memory.

Why is this a big deal?

Previous systems were like a person trying to find a friend in a crowd by only looking at their face. If the friend turns around or puts on sunglasses, you lose them.

AR2-4FV is like a person who knows the layout of the building. They know their friend usually hangs out near the coffee machine. Even if the friend disappears for an hour, the system knows exactly where to wait. When the friend returns, the system is ready to spot them immediately, even if they look slightly different.

The Results:
The paper tested this on a new dataset (AR2-4FV-Bench) filled with videos where people disappear and reappear. The system was 10% better at finding people after they returned and 24% faster at spotting them again compared to the best existing methods.

In short: It's a system that stops trying to memorize the person and starts memorizing the place, making it incredibly good at tracking people in fixed cameras over long periods.