FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

Imagine you are trying to explain a three-hour movie to a friend, but you only have one minute to do it.

If you try to describe every single frame, you'll run out of time before you even get to the plot. If you just pick random moments, you might miss the villain's face or the crucial clue. If you just summarize the "boring" parts, you lose the excitement.

This is exactly the problem computers face when trying to understand long videos (like security footage, lectures, or home movies).

The Problem: Too Much Data, Too Little Brainpower

Modern AI models (called Large Multimodal Models) are like super-smart detectives. They can look at a video and answer questions like, "What is the person wearing?" or "What happened in the middle?"

However, to "see" a video, the AI breaks it down into millions of tiny pieces called visual tokens.

The Issue: A long video generates so many tokens that the AI's "brain" (memory) gets overwhelmed. It's like trying to read a 10,000-page book in one sitting; the AI gets tired, forgets things, or simply crashes.
The Current Fix: Most methods try to solve this by either:
1. Skipping pages: Randomly deleting frames (like skipping every 10th page). Risk: You might miss the plot twist.
2. Grouping similar pages: Clustering similar scenes together. Risk: If a rare, important event happens (like a key falling on the floor), it might get lumped in with "boring background" and deleted.

The Solution: FLoC (The "Smart Librarian")

The authors propose a new method called FLoC (Facility Location-based Efficient Visual Token Compression).

Think of the video as a massive library with thousands of books (tokens). The AI only has room to read 10 books (the budget). How do you pick the 10 books that tell the entire story?

1. The "Facility Location" Concept

Imagine you are opening a chain of coffee shops in a new city. You have a budget to open only 5 shops.

Goal: You want to place them so that every resident in the city is close to a shop, but you don't want two shops right next to each other (wasting money).
The Strategy: You don't just pick random spots. You pick spots that cover the most ground while ensuring diversity. You pick one in the north, one in the south, one in the busy downtown, one in the quiet suburbs, etc.

FLoC does this with video tokens:

It looks at all the visual "moments" in the video.
It selects a small group of moments that represent the whole video (like the coffee shops covering the city).
Crucially, it ensures it doesn't just pick 5 shots of the same boring wall. It picks the wall, the person walking by, the car driving past, and the rare moment the dog barks. It balances representativeness (covering the main story) with diversity (catching the rare details).

2. The "Lazy Greedy" Trick (The Speed Boost)

Usually, finding the perfect 5 coffee shop locations is a math nightmare that takes forever to calculate.

The Old Way: Check every possible combination of 5 shops. (Takes hours).
FLoC's Way (Lazy Greedy): It uses a clever shortcut. It picks the best spot first. Then, instead of recalculating everything for the second spot, it uses a "lazy" estimate to see if the next best spot is still good enough. If it is, it picks it. If not, it moves on.
The Result: It finds a near-perfect selection in a fraction of a second. It's like a librarian who can instantly scan a shelf and grab the 10 most important books without reading the whole library first.

Why This Matters in Real Life

Because FLoC is training-free (it doesn't need to be taught how to do this) and plug-and-play (it works with any existing AI), it's a game-changer for:

Security Cameras: Instead of storing terabytes of footage, the AI can instantly compress days of footage into a "highlight reel" of important events, saving massive storage space.
Smart Glasses: If you wear glasses that record your day, FLoC allows the AI on your phone to understand what you saw without draining your battery or needing a supercomputer in the cloud.
Robotics: A robot navigating a warehouse can process hours of video in real-time to find a specific item, rather than getting stuck trying to process every single pixel.

The Bottom Line

FLoC is like a master editor. Instead of randomly cutting a movie or just keeping the "average" scenes, it intelligently selects the specific frames that tell the whole story, ensuring no important detail is lost, all while doing it incredibly fast. It allows AI to finally "watch" long videos without getting a headache.

1. Problem Statement

Long video understanding using Large Multimodal Models (LMMs) faces a critical scalability bottleneck. As video duration increases, the number of visual tokens generated by vision encoders grows exponentially, often exceeding the context window limits of LLMs (typically 4K–32K tokens).

The Challenge: Processing every token is computationally infeasible for high-resolution or long-duration videos (e.g., CCTV footage, smart glasses streams).
Limitations of Existing Methods:
- Uniform Sampling/Pooling: Ignores semantic importance, discarding critical cues.
- Clustering (e.g., K-Means): Often fails to capture rare but important events (e.g., a small object like keys in a cluttered room) because it focuses on dense regions of the feature space. It is also computationally expensive ( $O(n^3)$ for spectral clustering).
- Query-Aware Compression: Requires prior knowledge of the user query, limiting applicability in zero-shot or dynamic scenarios.
- Learnable Compression: Requires extensive retraining and is not model-agnostic.

The core problem is to select a compact, representative, and diverse subset of visual tokens within a strict budget ( $K$ ) without training, while maintaining high inference accuracy and minimizing computational overhead.

2. Methodology: FLoC

The authors propose FLoC, a training-free, plug-and-play framework based on the Facility Location function, a submodular optimization approach.

A. Core Concept: Facility Location Function

The method frames token selection as maximizing a coverage function $f(S)$ over a ground set of tokens $V$ , subject to a budget $K$ :
$S^* = \arg \max_{S \subseteq V, |S| \le K} f(S)$
Where the facility location objective is defined as:
$f(S) = \sum_{v \in V} \max_{u \in S} \text{sim}(v, u)$

Mechanism: For every token in the original set $V$ , the function finds the most similar token in the selected subset $S$ .
Goal: This ensures that every original token is "covered" by at least one selected token. It naturally balances representativeness (covering dense regions) and diversity (ensuring sparse but important regions are not ignored).
Similarity Metric: Cosine similarity between token embeddings.

B. Optimization: Lazy Greedy Algorithm

Finding the optimal subset for the facility location function is NP-hard. The authors employ the Lazy Greedy Algorithm (Minoux, 1978) to achieve near-optimal solutions efficiently.

Submodularity: The function exhibits the "diminishing returns" property: the marginal gain of adding a token decreases as the set grows.
Efficiency: Instead of recalculating marginal gains for all candidates at every step, the algorithm maintains a priority queue of upper bounds. It only recomputes the exact gain for the top candidate if necessary.
Complexity: Reduces time complexity from $O(nK)$ (naive greedy) to near-linear empirical performance, making it suitable for real-time processing.

C. Framework Workflow

Input: Video is parsed into visual tokens.
Temporal Blocking: To handle streaming and reduce memory, the video is divided into temporal blocks of length $T$ .
Selection: FLoC selects a diverse subset of tokens within each block using the lazy greedy algorithm.
Integration: Selected tokens are concatenated with text prompts and fed into the video-LMM.
Properties: The method is training-free, model-agnostic (works with any video-LMM), and query-agnostic (compression happens once, independent of the user's question).

3. Key Contributions

Novel Formulation: Introduces the Facility Location function to visual token compression, mathematically guaranteeing a balance between representativeness and diversity, overcoming the "dense cluster bias" of traditional clustering.
Algorithmic Efficiency: Implements a lazy greedy strategy that significantly reduces computational overhead compared to clustering baselines, enabling real-time processing of long videos.
Versatility: A plug-and-play solution that requires no retraining, works with diverse backbone models (Qwen2.5-VL, InternVL3, LLaVA), and handles arbitrary video lengths.
Comprehensive Evaluation: Extensive benchmarking across four major datasets (Video-MME, MLVU, LongVideoBench, EgoSchema) demonstrating superior performance over state-of-the-art methods.

4. Experimental Results

The authors evaluated FLoC against 9 baselines (including LongVU, DyCoke, TS-LLaVA, DivPrune, and various clustering methods) on multiple benchmarks.

Accuracy Performance:
- FLoC consistently outperformed all baselines across different compression ratios (1/8, 1/16, 1/32) and models (Qwen2.5-VL, InternVL3).
- Example: On Video-MME with Qwen2.5-VL-7B at a 1/8 compression ratio, FLoC achieved 63.33% accuracy, surpassing the next best method (LongVU at 62.19%) and significantly outperforming random sampling (60.30%).
- Long-Context Capability: When processing up to 7,200 frames (compressed to optimal token length), FLoC improved accuracy by 1.21 points (7B model) and 2.44 points (32B model) compared to standard frame-limited processing.
Fine-Grained Tasks: FLoC showed particular strength in "Needle-in-a-Haystack" tasks (e.g., Needle QA, Ego Reasoning) on the MLVU benchmark, where it retained critical sparse details that other methods discarded.
Efficiency:
- Speed: FLoC achieved the lowest compression time among all methods. Clustering-based methods (K-Means, Spectral) were up to 10x slower.
- Scalability: As the block length ( $T$ ) increased, FLoC's runtime remained negligible compared to the quadratic/cubic growth of clustering methods.
Qualitative Analysis: t-SNE visualizations confirmed that FLoC selects tokens that are evenly distributed across the feature space, covering both major clusters and sparse, critical regions (e.g., specific objects like hats or sunglasses) that K-Means often misses.

5. Significance and Impact

Bridging the Gap: FLoC effectively bridges the gap between the massive token requirements of long videos and the limited context windows of current LLMs, enabling human-level performance on extended sequences.
Real-World Applicability: Its training-free, low-latency nature makes it ideal for resource-constrained edge devices (smart glasses, mobile robots) and surveillance systems where real-time analysis is critical.
Paradigm Shift: Moves the field away from query-dependent or heavy retraining approaches toward principled, submodular optimization that guarantees diversity and coverage without computational bloat.

In conclusion, FLoC presents a robust, efficient, and theoretically grounded solution to the visual token bottleneck, setting a new standard for long video understanding in Large Multimodal Models.