Bounded State in an Infinite Horizon: Proactive Hierarchical Memory for Ad-Hoc Recall over Streaming Dialogues

Imagine you are having a conversation with a friend that never ends. It's like a radio show that has been playing for years, with thousands of episodes, inside jokes, and changing plotlines.

Now, imagine you are the host of this show. Every time a new listener calls in with a question about something that happened three years ago, you have to answer immediately.

The Problem: The "Too Much Noise" Dilemma
Current AI systems (like the ones in your phone or computer) try to remember everything by keeping the entire conversation history open in their mind.

The Analogy: It's like trying to find a specific needle in a haystack that keeps growing bigger every second. If the conversation is short, it's easy. But if the conversation is infinite, the "haystack" becomes so huge that the AI gets overwhelmed. It either takes forever to find the answer (too slow) or it starts hallucinating and making things up because it can't focus on the right part of the story.

The New Solution: ProStream
The authors of this paper, "ProStream," propose a smarter way to handle this. Instead of keeping the whole haystack, they build a smart, organized library that updates itself in real-time.

Here is how ProStream works, broken down into simple steps:

1. The "Short-Term Buffer" (The Coffee Table)

When you are talking, you keep the last few minutes of conversation on your "coffee table" (Short-Term Sensing Buffer). This is fresh, immediate stuff.

Why? Because sometimes the answer is just what your friend said two sentences ago. You don't need to look in the library for that.

2. The "Distillation" (The Summarizer)

As the conversation moves past the coffee table, ProStream doesn't just throw the old words away. Instead, it acts like a super-efficient editor.

It reads a chunk of the conversation and asks: "What is the main point here?"
It turns a 10-minute rant into a single sentence summary (an "Event").
It groups these summaries into bigger categories like "Work," "Family," or "Vacation" (Scenes).
It pulls out specific facts, like "Alice has a cat" or "Bob hates broccoli" (Atomic Memories).
The Magic: It turns a messy, infinite stream of words into a neat, organized tree structure.

3. The "Adaptive Optimization" (The Janitor)

This is the most clever part. The library has a limited amount of shelf space. You can't keep everything forever.

The Rule: ProStream uses a "Utility Score." It asks: "How likely is it that we will need this fact again?"
If a fact is used often (like "Alice has a cat"), it stays on the shelf.
If a fact is old and nobody talks about it anymore, the system gently removes it to make room for new, important information.
The Result: The AI's memory stays a manageable size, no matter how long the conversation lasts. It never gets slow.

4. The "On-Demand Recall" (The Librarian)

When a question comes in (e.g., "What did Alice say about her cat last year?"), the AI doesn't scan the whole library.

It goes straight to the "Cat" section of the tree.
It grabs the specific summary and the key fact.
It combines this with the current conversation on the coffee table to give a perfect answer instantly.

Why is this a big deal?

The paper introduces a new test called STEM-Bench (like a final exam for AI memory) to prove this works. They found that:

Old methods were either too slow (reading the whole history) or too forgetful (only remembering the last few words).
ProStream is fast and accurate. It solves the "Fidelity-Efficiency Dilemma" (the struggle between being accurate and being fast).

In a Nutshell:
Think of ProStream not as a giant hard drive that stores every word ever spoken, but as a smart, self-cleaning brain that constantly summarizes the past, throws out the trash, and keeps only the most useful, organized facts ready to be pulled out the moment they are needed. This allows AI to have conversations that feel infinite without ever getting tired or confused.

1. Problem Statement

The paper addresses a critical limitation in current Large Language Model (LLM) dialogue systems: the inability to operate effectively within infinite-horizon streaming dialogues while maintaining bounded computational states.

The Misalignment: Existing memory mechanisms largely follow a "read-then-think" paradigm, assuming a static, fully accessible context. This fails in real-world streaming scenarios where dialogues unfold continuously, and ad-hoc memory recall (retrieving specific historical context on-demand) is required at any moment.
The Fidelity-Efficiency Dilemma:
- Full-Context Approaches: Maintaining the entire dialogue history ensures high reasoning fidelity but incurs unbounded latency and prohibitive computational costs as the stream grows.
- Retrieval-Based Approaches (e.g., RAG): These maintain low latency but often fragment the context, leading to reasoning degradation, hallucinations, and an inability to capture global dependencies or temporal causality.
The Goal: Develop a memory framework that operates as a bounded state machine within an infinite stream, enabling high-fidelity ad-hoc recall without sacrificing inference speed.

2. Methodology

The authors propose ProStream, a proactive hierarchical memory framework, and STEM-Bench, a new benchmark to evaluate it.

A. STEM-Bench (The Benchmark)

To rigorously test streaming memory, the authors introduced the first benchmark for STreaming Evaluation of Memory.

Data Construction: Converts 14,938 QA pairs from LongDialQA into synthesized audio streams (using ChatTTS) to simulate real-world streaming with strict causal linearity.
Evaluation Dimensions:
1. High-Fidelity Perception (HFP): Tests atomic detail retention and resilience against "Lost-in-the-Middle" phenomena and compression hallucinations (Single-hop, Adversarial tasks).
2. Structural Logical Reasoning (SLR): Assesses the ability to bridge fragmented events across time (Multi-hop, Comparative tasks).
3. Dynamic Global Awareness (DGA): Evaluates the maintenance of evolving statistical states and temporal causality without backtracking (Aggregative, Temporal tasks).

B. ProStream Framework

ProStream transforms memory into an active, bounded state machine through four key stages:

Proactive Semantic Stream Perception:
- Uses a Short-Term Sensing Buffer (STSB) to accumulate incoming interaction units (audio + speaker ID).
- Performs online boundary detection using semantic drift thresholds to segment continuous streams into discrete semantic blocks ( $T_{block}$ ), preserving boundary-spanning dependencies.
Hierarchical Multi-Granular Distillation:
- Converts unstructured semantic blocks into a hierarchical tree topology ( $H$ ) with three layers:
  - Scene ( $c$ ): Coarse-grained thematic clustering.
  - Event ( $e$ ): Temporal context segmentation.
  - Atomic Memory Unit (AMU, $o$ ): Fine-grained factual retention (entities and relations).
- Uses a generative model for recursive summarization (Event $\to$ Scene) and GLiNER for extracting relational triplets to populate the AMU layer.
Adaptive Spatiotemporal Optimization:
- Solves the Online Budgeted Learning problem under a strict capacity constraint ( $T_{max}$ ).
- Utility Function: Calculates the "need probability" of a memory node based on Frequency (access count) and Recency (time decay), modeled as a time-decaying Poisson process.
- Optimization Policy: When capacity is exceeded, it employs a Greedy Marginal-Utility Policy:
  - Least-Regret Pruning: Discards nodes with the lowest utility-to-cost ratio.
  - Semantic Merging: Collapses similar nodes to save space.
  - Cascading Abstraction: Removes parent nodes if their children are pruned, maintaining a coherent hierarchy.
Probabilistic Evidence-Grounded Generation:
- Synthesizes a unified context $K$ comprising the STSB, a pending buffer, and the top- $k$ retrieved paths from the hierarchical tree.
- Uses a top-down traversal to retrieve evidence, scoring candidates based on semantic similarity and time-variant utility weights.

3. Key Contributions

STEM-Bench: The first benchmark specifically designed for ad-hoc memory recall in streaming dialogues, evaluating perception, reasoning, and global awareness under infinite-horizon constraints.
ProStream Framework: A novel architecture that replaces reactive context scanning with proactive hierarchical state maintenance. It decouples inference latency from stream length, achieving constant-time efficiency.
Theoretical Guarantee: The paper provides a theoretical analysis proving that the greedy pruning strategy achieves a $(1 - 1/e)$ approximation ratio of the optimal memory value, ensuring bounded optimality gaps even in infinite streams.
Paradigm Shift: Demonstrates that structured, distilled memory outperforms raw full-context inputs in both accuracy and efficiency, challenging the assumption that "more context is always better."

4. Experimental Results

Performance: ProStream achieves State-of-the-Art (SOTA) results on STEM-Bench, outperforming baselines like Full-Context, RAG, RAPTOR, GraphRAG, and MemGAS.
- Accuracy: It significantly improves high-order reasoning (measured by Gemini-2.5-Pro) and generation quality (BLEU-4).
- Efficiency: It drastically reduces inference latency compared to Full-Context models. While Full-Context latency grows unboundedly with dialogue length, ProStream maintains constant, low latency.
Ablation Studies:
- Removing the STSB leads to catastrophic degradation (loss of recent context).
- Removing the Hierarchical Tree ( $M_{tree}$ ) impairs global consistency and cross-temporal reasoning.
- Removing the Pending Buffer ( $M_{pend}$ ) causes information gaps during state transitions.
Scalability: ProStream scales positively with larger LLM backbones (Qwen-3B to 14B). Interestingly, smaller models sometimes struggle with the dense structured context, while larger models leverage the topology for superior reasoning.
Robustness: The framework performs consistently across different embedding models (MiniLM, UAE, BGE-M3), indicating its efficacy stems from the topological optimization rather than just vector quality.

5. Significance and Impact

Real-Time Applicability: By solving the fidelity-efficiency dilemma, ProStream enables the deployment of long-term memory agents in real-time applications such as personalized education, customer service, and assistive technologies where latency must be stable.
Privacy and Governance: The bounded nature of the memory state inherently restricts the infinite accumulation of raw data. This transforms memory from an opaque, emergent behavior into an explicit, auditable, and controllable mechanism, facilitating better data governance and privacy safeguards (e.g., controllable forgetting).
Future Direction: The work highlights the need for future research into conflict detection (handling contradictory instructions vs. facts) and activity-aware attention to mitigate semantic noise in dense dialogues.

In summary, this paper establishes a new paradigm for long-term dialogue systems, proving that proactive, hierarchical, and bounded memory is the necessary path forward for scalable, real-time, and reasoning-capable AI agents.

Bounded State in an Infinite Horizon: Proactive Hierarchical Memory for Ad-Hoc Recall over Streaming Dialogues

1. The "Short-Term Buffer" (The Coffee Table)

2. The "Distillation" (The Summarizer)

3. The "Adaptive Optimization" (The Janitor)

4. The "On-Demand Recall" (The Librarian)

Why is this a big deal?

1. Problem Statement

2. Methodology

A. STEM-Bench (The Benchmark)

B. ProStream Framework

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems