Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

This paper introduces MIKASA, a comprehensive benchmark suite featuring a classification framework and two distinct datasets (MIKASA-Base and MIKASA-Robo) to systematically evaluate and advance memory-enhanced reinforcement learning agents across diverse scenarios, with a specific focus on tabletop robotic manipulation.

Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, Aleksandr I. Panov

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to do chores around the house. You might tell it, "Go to the kitchen, find the red cup under the towel, and bring it to me."

If the robot has a "short-term memory" like a goldfish, it will see the red cup, but the moment the towel covers it, the robot forgets it exists. It will wander around confused, asking, "Where did the cup go?"

This paper introduces a new way to test how good robots (and AI agents) are at remembering things, especially when they can't see them all the time. The authors call their new testing suite MIKASA.

Here is a simple breakdown of what they did and why it matters:

1. The Problem: The "Goldfish" Robot

Right now, many AI benchmarks (tests for AI) are like testing a human's memory by asking them to solve a math problem they can see on a piece of paper. But real life is different. Real life is like a game of Shell Game (where a ball is hidden under one of three cups).

  • The Issue: If a robot can't remember which cup the ball is under once it's covered, it fails.
  • The Gap: Scientists didn't have a standard "report card" to test if a robot is good at remembering different kinds of things (like locations, colors, or sequences of events). Every researcher built their own tiny test, making it impossible to compare robots fairly.

2. The Solution: The MIKASA "Memory Gym"

The authors built a massive, standardized gym called MIKASA (Memory-Intensive Skills Assessment Suite for Agents). Think of it as a Universal Driver's License Test for robot memory.

They organized the tests into four main categories, using simple analogies:

  • Object Memory (The "Where's Waldo" Test):
    • The Task: A robot sees a red ball, then a cup covers it. The robot must remember the ball is still there and touch the right cup.
    • Real Life: Finding your keys that you put under a magazine.
  • Spatial Memory (The "Mental Map" Test):
    • The Task: A robot sees a peg in a specific spot, then has to rotate it to a new angle without losing track of where it started.
    • Real Life: Walking through your house in the dark because you remember where the furniture is.
  • Sequential Memory (The "Recipe" Test):
    • The Task: The robot sees a red cube, then a blue one, then a green one. Later, it has to pick them up in that exact order.
    • Real Life: Remembering the steps to bake a cake: flour, then eggs, then sugar.
  • Memory Capacity (The "Grocery List" Test):
    • The Task: The robot has to remember the colors of seven different cubes shown all at once, then pick them all out later.
    • Real Life: Trying to remember a 7-digit phone number you just heard.

3. The "Reality Check": Robots Are Bad at This

The authors tested many of the smartest robots and AI models currently available (including some that can talk and see, called VLA models) on these 32 new tasks.

The results were humbling:

  • In a perfect world (Full Memory): When the robot could see everything perfectly (like a video game with no fog), the robots were amazing. They got 100% of the tasks right.
  • In the real world (Partial Memory): As soon as they had to remember something because it was hidden or happened in the past, the robots crashed and burned.
    • Even the "smartest" models, which can write poetry and chat, failed to remember a simple color if it was hidden for just a few seconds.
    • It's like a super-genius who can solve complex physics equations but forgets your name five seconds after you introduce yourself.

4. Why This Matters

The paper argues that for robots to truly help us in the real world (cleaning, cooking, caring for the elderly), they need better memory, not just better eyes or stronger arms.

Currently, robots are like amnesiacs. They can react to what they see right now, but they struggle to connect the dots between what happened a moment ago and what they need to do next.

The Takeaway:
MIKASA is a new tool that forces AI researchers to stop building robots that only live in the "now." It pushes them to build robots that can actually remember the past, so they can handle the messy, hidden, and complex tasks of real life.

In short: We are building robots with great eyes but terrible memories. MIKASA is the test that proves it, and the roadmap to fix it.