FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

FLoC is a training-free, model-agnostic framework that leverages the facility location function and a lazy greedy algorithm to efficiently select a compact, diverse subset of visual tokens for long video understanding, significantly reducing computational costs while maintaining near-optimal performance across diverse benchmarks.

Janghoon Cho, Jungsoo Lee, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are trying to explain a three-hour movie to a friend, but you only have one minute to do it.

If you try to describe every single frame, you'll run out of time before you even get to the plot. If you just pick random moments, you might miss the villain's face or the crucial clue. If you just summarize the "boring" parts, you lose the excitement.

This is exactly the problem computers face when trying to understand long videos (like security footage, lectures, or home movies).

The Problem: Too Much Data, Too Little Brainpower

Modern AI models (called Large Multimodal Models) are like super-smart detectives. They can look at a video and answer questions like, "What is the person wearing?" or "What happened in the middle?"

However, to "see" a video, the AI breaks it down into millions of tiny pieces called visual tokens.

  • The Issue: A long video generates so many tokens that the AI's "brain" (memory) gets overwhelmed. It's like trying to read a 10,000-page book in one sitting; the AI gets tired, forgets things, or simply crashes.
  • The Current Fix: Most methods try to solve this by either:
    1. Skipping pages: Randomly deleting frames (like skipping every 10th page). Risk: You might miss the plot twist.
    2. Grouping similar pages: Clustering similar scenes together. Risk: If a rare, important event happens (like a key falling on the floor), it might get lumped in with "boring background" and deleted.

The Solution: FLoC (The "Smart Librarian")

The authors propose a new method called FLoC (Facility Location-based Efficient Visual Token Compression).

Think of the video as a massive library with thousands of books (tokens). The AI only has room to read 10 books (the budget). How do you pick the 10 books that tell the entire story?

1. The "Facility Location" Concept

Imagine you are opening a chain of coffee shops in a new city. You have a budget to open only 5 shops.

  • Goal: You want to place them so that every resident in the city is close to a shop, but you don't want two shops right next to each other (wasting money).
  • The Strategy: You don't just pick random spots. You pick spots that cover the most ground while ensuring diversity. You pick one in the north, one in the south, one in the busy downtown, one in the quiet suburbs, etc.

FLoC does this with video tokens:

  • It looks at all the visual "moments" in the video.
  • It selects a small group of moments that represent the whole video (like the coffee shops covering the city).
  • Crucially, it ensures it doesn't just pick 5 shots of the same boring wall. It picks the wall, the person walking by, the car driving past, and the rare moment the dog barks. It balances representativeness (covering the main story) with diversity (catching the rare details).

2. The "Lazy Greedy" Trick (The Speed Boost)

Usually, finding the perfect 5 coffee shop locations is a math nightmare that takes forever to calculate.

  • The Old Way: Check every possible combination of 5 shops. (Takes hours).
  • FLoC's Way (Lazy Greedy): It uses a clever shortcut. It picks the best spot first. Then, instead of recalculating everything for the second spot, it uses a "lazy" estimate to see if the next best spot is still good enough. If it is, it picks it. If not, it moves on.
  • The Result: It finds a near-perfect selection in a fraction of a second. It's like a librarian who can instantly scan a shelf and grab the 10 most important books without reading the whole library first.

Why This Matters in Real Life

Because FLoC is training-free (it doesn't need to be taught how to do this) and plug-and-play (it works with any existing AI), it's a game-changer for:

  • Security Cameras: Instead of storing terabytes of footage, the AI can instantly compress days of footage into a "highlight reel" of important events, saving massive storage space.
  • Smart Glasses: If you wear glasses that record your day, FLoC allows the AI on your phone to understand what you saw without draining your battery or needing a supercomputer in the cloud.
  • Robotics: A robot navigating a warehouse can process hours of video in real-time to find a specific item, rather than getting stuck trying to process every single pixel.

The Bottom Line

FLoC is like a master editor. Instead of randomly cutting a movie or just keeping the "average" scenes, it intelligently selects the specific frames that tell the whole story, ensuring no important detail is lost, all while doing it incredibly fast. It allows AI to finally "watch" long videos without getting a headache.