EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use

Imagine you are trying to watch a movie that never ends. It's a continuous stream of video, like a live feed from a security camera or a self-driving car's view, that goes on forever. Now, imagine your brain (or in this case, a computer model) has a very small "working memory"—like a sticky note that can only hold a few sentences at a time.

This is the problem EventMemAgent tries to solve.

The Problem: The "Leaky Bucket" vs. The "Infinite River"

Most current AI models are like a leaky bucket. As the video stream (the river) flows in, the bucket fills up. Once it's full, the oldest water (the old video frames) spills out to make room for new water.

The Issue: If the bucket is small, you lose the context of what happened 10 minutes ago. If you try to keep the whole river in the bucket, the bucket breaks (the computer runs out of memory).
The Old Way: Previous methods just tried to squeeze more water into the bucket by ignoring some drops (pruning) or shuffling the water around. They were passive; they just watched the water flow and hoped they didn't forget the important stuff.

The Solution: The "Smart Librarian" (EventMemAgent)

The authors created EventMemAgent, which acts like a super-smart, active librarian managing a library of infinite books. Instead of just letting books pile up on a table, the librarian actively organizes them.

Here is how it works, broken down into three simple parts:

1. The Two-Drawer System (Hierarchical Memory)

The agent uses a two-layer memory system, like a desk and a filing cabinet.

The Desk (Short-Term Memory): This is your immediate workspace. It holds the current "scene" or Event.
- How it works: Instead of just grabbing random frames, the agent watches for boundaries. If a person is painting a rooster, that's one event. When they stop and start cooking an egg, that's a new event.
- The Trick: If an event is boring and repetitive (like a hand holding a brush for 30 seconds), the agent doesn't save every single second. It uses a technique called "Reservoir Sampling" (think of it as taking a perfect, representative snapshot of the whole 30 seconds) so it doesn't waste space.
The Filing Cabinet (Long-Term Memory): Once an event is finished (e.g., the painting is done), the agent doesn't throw it away. It files it away neatly.
- It creates a summary card (a caption), saves a key photo (visual anchor), and writes a change log (what happened next).
- This allows the agent to remember the story of the video forever, even if it can't keep all the raw video in its "desk" at once.

2. The Detective Toolkit (Multi-Granular Perception)

Sometimes, the summary card isn't enough. Maybe the question is, "Did the person break the egg?" or "What does the sign on the wall say?"

The Old Way: The AI would just guess based on the blurry summary.
EventMemAgent's Way: It acts like a detective. It has a toolkit with special magnifying glasses:
- Search Memory: "Hey, did we see anything about 'breaking' earlier?"
- OCR (Reading Glasses): "Let me zoom in and read that sign."
- Object Detection: "Let me scan that specific frame to find the broken egg."
- The agent actively decides which tool to use to get the exact evidence it needs, rather than just hoping the answer is in the general summary.

3. The "Learning by Doing" Coach (Agentic RL)

How does the agent know when to use the detective tools?

The Old Way: Humans had to write strict rules (prompts) telling the AI what to do. "If you see a question, look for X." This is rigid and often fails.
EventMemAgent's Way: The agent was trained using Reinforcement Learning (like training a dog with treats).
- It tried many different strategies: "Should I search the cabinet? Should I zoom in? Should I just guess?"
- When it got the answer right, it got a "treat" (reward). When it failed, it learned not to do that again.
- Over time, it internalized the skill. It learned to think, "Oh, this is a question about the past, so I should check the filing cabinet first," without anyone telling it to.

The Result

In tests, EventMemAgent was able to watch these "infinite" video streams and answer complex questions with high accuracy, using very little computer power.

It didn't get overwhelmed by the endless video.
It didn't forget the beginning of the movie.
It knew exactly when to zoom in for details and when to look at the big picture.

In a nutshell: EventMemAgent is an AI that stops being a passive observer and starts being an active investigator. It organizes the chaos of an endless video into neat "stories," keeps a permanent archive of those stories, and knows exactly which tools to grab to solve a mystery.

1. Problem Statement

Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams (e.g., autonomous driving, surveillance). The core challenge is the conflict between the unbounded nature of streaming inputs and the finite context window of Multimodal Large Language Models (MLLMs).

Current approaches suffer from two main limitations:

Passive Processing: Most methods rely on passive strategies like token pruning, sliding windows, or fixed-length segmentation. These lead to information decay, semantic fragmentation (cutting continuous actions arbitrarily), and an inability to capture fine-grained details when processing vast numbers of tokens in a single pass.
Ineffective Memory & Tool Use: Existing agents often use simplistic memory modules or rely heavily on manual prompt engineering. They lack the ability to actively retrieve relevant historical information or iteratively perceive specific details, failing to internalize reasoning and tool-use strategies through end-to-end training.

2. Methodology: EventMemAgent

The authors propose EventMemAgent, an active agent framework that shifts from passive processing to proactive perception. The architecture consists of three core components:

A. Hierarchical Memory Module

To manage infinite streams within a fixed context, the framework employs a dual-layer memory strategy centered on events rather than fixed time intervals:

Short-Term Memory (STM):
- Event-Centric Segmentation: Instead of fixed time windows, the system dynamically detects event boundaries. It compares the grayscale histogram of a new frame against the current event's average histogram. If the correlation drops below a threshold ( $\rho < \delta$ ), a new event is triggered.
- Event-Granular Reservoir Sampling: Within an active event, if the frame count exceeds the buffer capacity ( $K$ ), the system uses reservoir sampling. This ensures the STM holds an unbiased, representative summary of the event stream, preventing a single long event from exhausting the context budget while maintaining semantic continuity.
Long-Term Memory (LTM):
- When an event is evicted from the STM, it is archived as a structured tuple: {Visual Anchor (first frame), Caption, Semantic Embedding, Change Log}.
- The Change Log records state transitions between consecutive events, preserving narrative continuity and preventing semantic fragmentation.

B. Multi-Granular Perception Toolkit

The agent is equipped with a toolkit to actively capture evidence across different granularities:

Memory Search: Retrieves historical context via Temporal Retrieval (time-range filtering) or Semantic Retrieval (cosine similarity of embeddings).
Specialized Perception: Utilizes OCR for text extraction and Object Detection to localize specific entities.
Active Iteration: The agent can autonomously choose to apply these tools to visual anchors in the LTM or specific frames in the STM to gather precise evidence before answering.

C. Agentic Reinforcement Learning (Agentic RL)

To move beyond manual prompt engineering, the framework uses Group Relative Policy Optimization (GRPO) to end-to-end train the agent.

Objective: The agent learns to decompose tasks, select appropriate tools, and reason iteratively.
Reward Signal: Training is guided solely by the correctness of the final answer (binary reward), forcing the model to internalize effective reasoning paths and tool-invocation strategies without relying on a separate critic model.

3. Key Contributions

Active Agentic Framework: Proposes a paradigm shift from passive information processing to an active agent that decomposes tasks and iteratively retrieves/perceives relevant information.
Hierarchical Event-Centric Memory: Introduces a novel memory architecture combining online event segmentation and reservoir sampling for the STM, and structured event-tuple archiving for the LTM. This solves the trade-off between long-range context and fine-grained detail.
Agentic RL Integration: Successfully internalizes reasoning and tool-use strategies into the agent's intrinsic capabilities via end-to-end RL, eliminating the need for rigid manual prompt engineering.
Efficiency: Demonstrates that high-performance online video understanding can be achieved with limited input frames (e.g., $\le$ 32 frames) and fixed hardware constraints.

4. Experimental Results

The model was evaluated on two major benchmarks: OVO-Bench and StreamingBench.

Performance on OVO-Bench:
- EventMemAgent (8B parameters, $\le$ 32 frames) achieved an overall accuracy of 60.75%.
- It outperformed all open-source models and even the proprietary GPT-4o (59.54%).
- It showed significant gains in Real-Time Visual Perception (+4.27% over best open-source) and Backward Tracing.
Performance on StreamingBench:
- Achieved 77.00% average accuracy across 12 diverse real-time tasks, surpassing specialized online MLLMs like StreamForest (77.26% vs 77.00% is close, but EventMemAgent is competitive with fewer frames) and significantly outperforming others like Dispider.
- Notably, it excelled in tasks requiring precise perception of current content without needing complex long-term history integration.
Ablation Studies:
- Removing the hierarchical memory (replacing with fixed-length segmentation) caused a performance drop, confirming the value of event-centric management.
- Removing specific tools (OCR or Object Detection) significantly reduced performance on tasks requiring fine-grained detail, validating the necessity of the multi-granular toolkit.
- Case Studies showed that the RL-trained agent successfully internalized tool-use strategies (e.g., searching memory for past events or using OCR for text), whereas untrained agents failed to utilize tools flexibly.

5. Significance

EventMemAgent represents a significant advancement in online video understanding by addressing the fundamental bottleneck of context window limitations. Its significance lies in:

Scalability: It enables MLLMs to handle "infinite" video streams efficiently without memory overflow or semantic loss.
Autonomy: By using Agentic RL, it creates agents that can self-correct and adapt their perception strategies dynamically, moving closer to true autonomous agents for dynamic visual environments.
Resource Efficiency: It achieves state-of-the-art results with low hardware costs (limited frame inputs), making it viable for real-time applications like robotics and surveillance where computational resources are constrained.

The code is open-sourced, providing a strong foundation for future research in continuous perception and agentic video reasoning.