Designing Multi-Robot Ground Video Sensemaking with Public Safety Professionals

Through collaboration with six police agencies, this paper presents a multi-robot video sensemaking testbed and a prototype tool (MRVS) that integrates LLM-based explanations to reduce operator workload and increase confidence, while highlighting key design requirements and concerns regarding false alarms and privacy.

Puqi Zhou, Ali Asgarov, Aafiya Hussain, Wonjoon Park, Amit Paudyal, Sameep Shrestha, Chia-wei Tang, Michael F. Lighthiser, Michael R. Hieb, Xuesu Xiao, Chris Thomas, Sungsoo Ray Hong

Published 2026-02-17
📖 6 min read🧠 Deep dive

Imagine you are the captain of a ship, but instead of one lookout, you have a fleet of 10 robotic scouts patrolling the ocean around you. Each robot is streaming live video back to your bridge. Your job is to spot a shark, a storm, or a pirate ship.

The problem? If you try to watch 10 video feeds at once, your brain will explode. You'll miss the shark because you were staring at a seagull. This is exactly the problem public safety professionals (like police officers) are facing today. They are understaffed, overworked, and drowning in video feeds from drones, body cameras, and now, ground robots.

This paper is about building a "Super-Pilot's Dashboard" to help them manage this flood of information without losing their minds.

Here is the story of how they built it, broken down into simple steps:

1. The Problem: The "Juggling Act"

Police officers currently have to manually scan hours of video to find a few seconds of something important. It's like trying to find a specific needle in a haystack by looking at every single piece of hay one by one.

  • The Risk: If they miss something, people could get hurt.
  • The Burden: If they watch too much, they get exhausted and make mistakes.
  • The Gap: Existing robots (like the one in New York City) were retired because they were too hard to use and didn't fit into how police actually work.

2. Study 1: Asking the Experts (The "Recipe" Phase)

Before building anything, the researchers went to the "kitchen" and asked 5 experienced police officers what they actually needed. They didn't just guess; they asked: "What should the robot look for?" and "How do you solve a case?"

They discovered three big things:

  • Context is King: A person sitting on a bench isn't suspicious at a park, but it is suspicious in a bank lobby at 3 AM. The system needs to understand the "recipe" of the situation, not just the visual.
  • The "Needle" Problem: Officers don't need a list of every single thing that happened; they need a summary that says, "Hey, check this out. A guy in a red hoodie dropped a bag here."
  • Teamwork: Police work in shifts. If Officer A finds a clue, Officer B needs to know about it instantly without reading a 50-page email chain.

The Result: They created a list of 38 "Events of Interest" (like "suspicious loitering," "assault," or "abandoned bag") and 6 Rules for how the software should behave.

3. The Testbed: The "Training Gym"

To test their ideas, they needed data. But you can't wait for a real crime to happen to test your software.

  • The Solution: They turned a university campus into a giant movie set.
  • The Cast: 22 student actors.
  • The Script: They acted out 38 different scenarios (like a fight, a car crash, or someone stealing a bike) while a ground robot patrolled around them.
  • The Dataset: They recorded 20 hours of video (day and night) with these actors. This became the "gym" where they trained their AI.

4. The Solution: MRVS (The "Magic Dashboard")

They built a system called MRVS (Multi-Robot Video Sensemaking System). Think of it as a smart, interactive map that does the heavy lifting for the officer.

How it works (The Magic Features):

  • The "Highlight Reel" (Video Debrief): Instead of watching 30 minutes of a robot walking down a street, the AI watches it for you. It cuts the video into "chapters" and gives you a summary card: "At 2:00 PM, a person dropped a bag. Confidence: High." You can click the card and jump straight to that second.
  • The "Big Picture" (Situational Overview): It shows all 10 robots on one map. If Robot A sees a fight and Robot B sees a car crash, they both pop up on the map with color-coded urgency. You can see the whole story at a glance.
  • The "Detective's Search" (Descriptor Search): This is the coolest part. You don't need a face or a license plate. You can type: "Find me a person wearing a blue jacket and red shoes." The system scans all the videos and shows you the matches. It's like using Google Images, but for video evidence.
  • The "Shared Notebook" (Collaboration): If an officer finds a clue, they can tag it and share it with the whole team. The next shift can pick up right where the last one left off, like a shared Google Doc for police work.

5. The Results: Does it Work?

They tested this with 9 real police professionals.

  • The Good News: The officers loved it. They said it saved them hours of work. They felt more confident because the AI explained why it flagged something (e.g., "I flagged this because the person ran away from the camera").
  • The Bad News: The AI isn't perfect. Sometimes it gets it wrong (false alarms), and sometimes it misses small details (like something happening in the corner of the screen).
  • The Verdict: The officers said, "Don't let the AI make the final call. Let it be our assistant that points us in the right direction, but we will make the final decision."

6. The Big Picture: Why This Matters

This paper isn't just about cool robots; it's about trust and safety.

  • For the Police: It stops them from burning out. It lets one officer supervise 10 robots effectively, which is crucial because police departments are short-staffed.
  • For the Public: It means faster responses to real emergencies and less time wasted on false alarms.
  • The Catch: The researchers warn that we have to be careful. If we use this to watch everyone all the time, it invades privacy. The system needs to be transparent, so people know why a robot is looking at them, and the data must be protected.

Summary Analogy

Imagine you are a teacher with 30 students.

  • Old Way: You have to watch a live video feed of every single student for 8 hours a day to see who is misbehaving. You will get a headache and miss the kid who is actually in trouble.
  • New Way (MRVS): You have a smart assistant. The assistant watches the feeds, highlights the kid who is throwing a paper airplane, and puts a sticky note on your desk saying, "Check this out, it looks like a fight in the back row." You still make the decision to intervene, but you don't have to stare at the screen all day.

This paper shows us how to build that smart assistant for police, making the world safer for everyone without overwhelming the people trying to keep us safe.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →