LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

This paper introduces LifeBench, a novel benchmark designed to evaluate AI agents' long-horizon, multi-source memory capabilities by simulating complex, real-world scenarios that require integrating both declarative and non-declarative memory, revealing that current state-of-the-art systems struggle with a 55.2% accuracy rate on these tasks.

Zihao Cheng, Weixin Wang, Yu Zhao, Ziyang Ren, Jiaxuan Chen, Ruiyang Xu, Shuai Huang, Yang Chen, Guowei Li, Mengshi Wang, Yi Xie, Ren Zhu, Zeren Jiang, Keda Lu, Yihong Li, Xiaoliang Wang, Liwei Liu, Cam-Tu Nguyen

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot how to be the perfect personal assistant for a human being.

Currently, most AI assistants are like amnesiacs with a photographic memory for facts. If you tell them, "I like pizza," they remember that forever. But if you ask, "What did I have for dinner last Tuesday?" or "Why do I always run when I'm stressed?", they often get lost. They can recall explicit facts, but they can't figure out your habits, your routines, or the subtle patterns of your life.

The paper "LifeBench" introduces a new way to test if AI is ready to be a true, long-term companion. Here is the breakdown in simple terms:

1. The Problem: The "Goldfish" vs. The "Biographer"

Most current AI memory tests are like a quiz show. They ask simple questions like, "What color was the car I mentioned yesterday?"

  • The Issue: Real life isn't a quiz show. Real life is a messy, 24/7 movie.
  • The Missing Piece: Humans have two types of memory:
    1. Declarative: "I went to Paris in 2020." (Facts)
    2. Non-Declarative: "I always feel anxious before a big meeting, so I drink extra coffee." (Habits, skills, feelings)
  • The Gap: Current AI is great at the first type but terrible at the second. It doesn't know why you do things, only what you did.

2. The Solution: Building a "Digital Twin"

To fix this, the researchers built LifeBench. Think of this as a massive, hyper-realistic video game where they created 10 fake people (avatars) and lived their entire lives for a whole year inside a computer.

  • The Simulation: They didn't just write a diary. They simulated every text message, every calendar appointment, every photo taken, every health record (steps, sleep, heart rate), and every chat with an AI assistant.
  • The "Noise": Just like your real phone, these fake people have thousands of notifications, ads, and random texts that have nothing to do with the main story. The AI has to find the signal in the noise.
  • The Scale: For just one person, they generated over 5,000 events and 8,000 pieces of digital data (texts, calls, photos, etc.). That's a lot of data to remember!

3. The Challenge: The "Needle in a Haystack" Test

Once they built these digital lives, they asked the AI some tricky questions to see if it could understand the person.

  • Simple Question: "What time did I wake up on Christmas?" (Easy, just look at the calendar).
  • Hard Question: "How many times did I go running this year?" (Hard, because the data is scattered across health apps, photos, and text messages).
  • Super Hard Question: "I seem to run more when I'm stressed. Can you prove it?" (This requires connecting the dots between a work email, a high heart rate, and a running log).

4. The Results: The AI Got Stuck

They tested the smartest AI memory systems available today on this new benchmark.

  • The Score: The best AI only got about 55% of the answers right.
  • The Verdict: Even the smartest AIs are struggling. They are like a librarian who can find a specific book if you give them the exact title, but if you ask, "What books did I read when I was sad last summer?", they get confused.

5. Why This Matters

This isn't just about passing a test. If we want AI to be a true partner in our lives, it needs to understand us better than we understand ourselves.

  • Health: It could notice, "You've been sleeping poorly and eating more sugar since your promotion; let's adjust your schedule."
  • Personalization: It could say, "You usually hate Mondays, but you loved the hiking trip last month. Let's plan a hike for next Monday."

The Big Metaphor

Imagine you are trying to hire a butler.

  • Old AI: A butler who only remembers the exact words you said to them. If you say, "I'm hungry," they bring food. If you don't say it, they don't know you're hungry.
  • LifeBench AI: A butler who has watched you for a year. They know you get hungry at 4 PM, that you prefer tea over coffee when you're stressed, and that you usually skip breakfast on rainy days. They anticipate your needs before you speak.

LifeBench is the new "interview" to see if our AI butlers are ready to stop being just robots and start being real helpers. Right now, they are still in training, but this paper gives us the map to teach them how to truly remember us.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →