Hybrid Self-evolving Structured Memory for GUI Agents

Imagine you are trying to teach a robot butler how to use a computer to do complex tasks, like booking a flight, buying a gift, or finding a specific recipe.

Right now, most AI robots are like amnesiacs. They have a short-term memory that lasts only a few seconds. If a task takes 20 steps, by step 15, they often forget what they were doing in step 2, or they get confused because the screen changed. They try to solve every new problem from scratch, which leads to mistakes.

Other researchers tried to fix this by giving the robot a notebook. They wrote down summaries of past tasks. But this notebook was messy. It was just a long, flat list of sentences. If the robot needed to find a specific tip about "booking flights," it had to read through thousands of unrelated notes about "buying shoes" or "checking the weather." It was like trying to find a specific needle in a haystack of loose paper.

The Solution: HYMEM (The "Smart Brain" for Robots)

The authors of this paper created HYMEM (Hybrid Self-evolving Structured Memory). Think of this not as a notebook, but as a living, breathing brain for the robot.

Here is how it works, using simple analogies:

1. The Two-Part Brain (Hybrid Memory)

Human brains are amazing because they have two ways of remembering things:

The "Big Picture" Brain (Symbolic/Discrete): You remember the strategy. "To buy a flight, I first check prices, then filter by date, then click 'book'." This is like a high-level map.
The "Sensory" Brain (Continuous/Embeddings): You remember the feeling and details. You remember exactly what the "Book" button looked like, the color of the screen, and the tiny text you had to read.

HYMEM does both.

It creates Nodes (dots on a map) that hold the "Big Picture" strategies (like a recipe card).
It attaches Photos/Videos (continuous data) to those dots so the robot remembers exactly what the screen looked like.
Why it matters: The robot doesn't just know what to do; it knows how it looked when it worked before.

2. The Living Library (Self-Evolving)

Most computer memories are static. You add a file, and it sits there forever.
HYMEM is a living library.

The Librarian (The Judge): Every time the robot finishes a task, a special "Librarian" AI checks the new experience against the library.
The Decision:
- Is this totally new? → ADD a new book to the shelf.
- Is this the same as an old book but with a better tip? → MERGE them. Update the old book with the new info.
- Is this a better way to do the old task? → REPLACE the old book with the new, better one.
The Result: The library gets smarter and cleaner over time. It doesn't just pile up junk; it organizes itself, deleting bad advice and keeping the best strategies.

3. The Active Guide (On-the-Fly Refresh)

Imagine you are driving to a party. You have a map (your memory).

Old Way: You look at the map at the start, memorize the route, and drive. If you hit a roadblock, you panic because your map is outdated.
HYMEM Way: The robot has a GPS that updates in real-time.
- As the robot clicks through a website, it constantly checks: "Wait, I just moved from 'Searching' to 'Checkout'. My old instructions about 'searching' are useless now. I need to refresh my memory to focus on 'payment'."
- It instantly swaps out the old context for the new, relevant context. This keeps the robot focused and prevents it from getting lost in long tasks.

The Magic Result

The paper tested this on open-source AI models (which are like "student" robots).

Without HYMEM: The student robots failed often, getting stuck or confused.
With HYMEM: These same student robots became so smart they could beat the "super-robots" (expensive, closed-source models like GPT-4o or Gemini).

The Analogy:
It's like taking a smart high school student and giving them a perfect, self-updating encyclopedia that knows exactly which page to open based on the current situation. Suddenly, that high school student can solve problems better than a genius who has to rely only on what they remember in their head.

In a Nutshell

HYMEM gives AI agents a brain that:

Organizes knowledge like a graph (connecting ideas), not a list.
Learns from every mistake and success, updating its own library automatically.
Adapts instantly when the task changes, keeping the right information front and center.

This allows smaller, cheaper AI models to perform complex, long-term computer tasks with human-like reliability.

Here is a detailed technical summary of the paper "Hybrid Self-evolving Structured Memory for GUI Agents".

1. Problem Statement

While Vision-Language Models (VLMs) have enabled Graphical User Interface (GUI) agents to interact with computers, they struggle with long-horizon workflows, diverse interfaces, and intermediate errors in real-world tasks.

Limitations of Current Approaches: Existing methods equip agents with external memory but rely on flat retrieval mechanisms. These typically use either:
- Discrete summaries: Textual tokens that lack fine-grained visual fidelity.
- Continuous embeddings: Dense vectors that preserve sensory details but create information bottlenecks for explicit reasoning.
The Gap: Current systems lack the structured organization and self-evolving capabilities of human memory. They cannot efficiently associate multimodal episodes with high-level strategies, nor can they dynamically update knowledge as new experiences arrive without uncontrolled growth.

2. Methodology: HYMEM

The authors propose Hybrid Self-evolving Structured Memory (HYMEM), a brain-inspired, graph-based external memory system. It mimics the human brain's dual processing: a Hippocampal-like Continuous Pathway for raw experience and a Neocortical-like Discrete Pathway for abstract strategies.

A. Hybrid Structured Memory (Graph Schema)

HYMEM organizes memory as an evolving graph $G = (V, E)$ where nodes represent successful interaction sequences. Each node $v_i$ is a tuple containing three components:

High-level Strategy ( $c_i$ ): Discrete symbolic tokens summarizing the core heuristic (e.g., "filter low-to-high").
Middle-level Attributes ( $A_i$ ): Semantic tags (e.g., #search, $price) providing cues about UI elements and domains.
Low-level Trajectory Embeddings ( $m_i$ ): Continuous embeddings preserving fine-grained multimodal details (visuals and actions).

Connectivity: Edges connect nodes sharing identical attributes, enabling multi-hop retrieval and associative search.

B. Self-Evolving Memory Construction

The memory is not static; it evolves via a three-stage pipeline when new trajectories arrive:

Retrieval: Uses CLIP-based multimodal embeddings (text + image) and FAISS to find top- $K$ similar nodes.
Redundancy Checking: A VLM judge evaluates the new trajectory against retrieved neighbors to decide the update action:
- ADD: New strategy/attributes found.
- MERGE: Same strategy but complementary insights (e.g., new UI variant).
- REPLACE: The new trajectory is strictly superior (fewer steps, higher success) and replaces the old one.
Update: The graph is modified (nodes added/merged/replaced), and edges are strengthened based on new co-occurrences, ensuring the memory remains coherent and non-redundant.

C. Memory Utilization (Inference)

During agent execution, HYMEM employs a dynamic workflow:

Structured Retrieval: Starts with semantic matching, then expands via 1-hop graph neighbors to gather diverse, conceptually relevant experiences that may not be visually similar.
Working Memory Initialization:
- Discrete View: VLM synthesizes retrieved strategies into concise Guidance Instructions for the system prompt.
- Continuous View: Raw trajectory embeddings are concatenated to the VLM input for implicit visual grounding.
On-the-fly Refresh: After each action, a VLM detects phase shifts (e.g., moving from "search" to "checkout"). If a shift occurs, it triggers a re-retrieval to refresh the working memory, discarding stale context while preserving long-term goals.

3. Key Contributions

Hybrid Architecture: First GUI agent memory to tightly couple discrete symbolic guidance (for reasoning) with continuous multimodal embeddings (for grounding) within a single graph structure.
Self-Evolving Mechanism: Introduces a dynamic update loop (Add/Merge/Replace) that allows the memory to accumulate knowledge, prune redundancy, and refine strategies without manual intervention.
Dynamic Context Management: Implements an "on-the-fly" working memory refresh that adapts to changing task phases during long-horizon execution.
Cost-Effectiveness: Demonstrates that lightweight open-source models (7B/8B parameters) enhanced with HYMEM can match or surpass massive closed-source models.

4. Experimental Results

The authors evaluated HYMEM on three benchmarks: WebVoyager, Multimodal-Mind2Web, and MMInA.

Performance Gains:
- Qwen2.5-VL-7B: Improved from a baseline of 12.5% to 35.0% (+22.5% absolute gain).
- Comparison: The enhanced 7B model outperformed Gemini2.5-Pro-Vision (by 5.4%) and GPT-4o (by 15.3%) on average.
- Consistency: Similar significant improvements were observed on Qwen3-VL-8B and UI-TARS-1.5-7B.
Ablation Studies:
- Self-Evolution: Global evolution (learning from history) provided ~25% gains on Amazon tasks; Local evolution (working memory refresh) provided ~15% gains by adapting to UI state changes.
- Memory Size: Performance scaled positively with memory size, showing diminishing returns only at very large scales, proving the graph's compression efficiency.
- Retrieval Strategy: A balanced approach (5 seed nodes + 5 graph-expanded nodes) outperformed strategies relying solely on similarity or pure diversity.

5. Significance

This work represents a paradigm shift in GUI agent design:

Bridging the Capability Gap: It proves that memory architecture is as critical as model scale. A small, well-structured memory can enable a 7B model to outperform proprietary 100B+ models.
Biological Inspiration: By mimicking the brain's separation of episodic (continuous) and semantic (discrete) memory, HYMEM achieves a balance between reasoning efficiency and perceptual precision.
Scalability: The self-evolving nature ensures the system can learn continuously from new data without requiring retraining of the base model, offering a practical path toward truly autonomous, long-term computer-use agents.