MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

MALLVI is a multi-agent framework that leverages large language and vision models to enable robust, closed-loop robotic manipulation by coordinating specialized agents for planning, perception, and targeted error recovery, thereby improving zero-shot generalization in dynamic environments.

Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

Published 2026-02-26
📖 5 min read🧠 Deep dive

Imagine you hire a brilliant but slightly scattered architect to build a complex Lego castle for you. You give them a simple instruction: "Build a red tower on the blue base."

In the past, if you asked a robot to do this, you'd have to write a very specific, rigid script: "Move arm 5cm left, rotate 90 degrees, close gripper." If the Lego piece was slightly out of place, the robot would crash or fail, and it wouldn't know how to fix itself. It was like a car driving with its eyes closed.

MALLVi is a new way of giving robots "eyes" and a "team of specialists" to figure things out on the fly. Instead of one giant, confused brain trying to do everything at once, MALLVi uses a team of four (or five) specialized AI agents working together like a well-oiled construction crew.

Here is how the MALLVi team works, using a simple analogy:

The MALLVi Team: A Construction Crew

Think of the robot's task as a complex construction project. MALLVi doesn't use one worker; it uses a specialized crew:

  1. The Decomposer (The Project Manager)

    • Role: You tell the Project Manager, "Build a red tower." The Manager doesn't try to build it themselves. Instead, they break the big idea down into tiny, manageable steps: Find the red block, pick it up, move it over the blue base, drop it.
    • Analogy: Like a chef reading a recipe and writing out the shopping list and the step-by-step cooking instructions before turning on the stove.
  2. The Descriptor (The Scene Investigator)

    • Role: Before any work starts, this agent looks at the room and creates a mental map. It says, "Okay, I see a red block here, a blue base there, and a wooden block acting as a distraction." It builds a "spatial graph" (a map of where everything is relative to each other).
    • Analogy: Like a real estate agent walking into a house and writing down exactly where the furniture is so the movers know where to put things.
  3. The Localizer (The Sharp-Eyed Spotter)

    • Role: This agent is the eyes of the team. It looks at the camera feed and says, "That's the red block! And here is the perfect spot on the block to grab it so it doesn't slip." It uses advanced vision tools to find the exact 3D coordinates for the robot's hand.
    • Analogy: Like a spotter in gymnastics who points out exactly where the athlete should land to avoid falling.
  4. The Thinker (The Planner)

    • Role: Once the Localizer finds the spot, the Thinker figures out the math. It calculates the exact angles the robot's arm needs to bend and the speed it needs to move to pick up the block without knocking it over.
    • Analogy: Like a GPS calculating the exact route and turns needed to get from point A to point B.
  5. The Actor (The Doer)

    • Role: This is the robot arm itself. It receives the precise coordinates from the Thinker and physically moves to grab the object.
    • Analogy: The construction worker actually lifting the brick.
  6. The Reflector (The Quality Control Inspector)

    • Role: This is the most important new feature. After the Actor tries to move the block, the Reflector looks at the result.
      • Did it work? Great! Move to the next step.
      • Did it fail? (e.g., the block slipped). The Reflector says, "Whoops, that didn't work." It doesn't make the whole team start over. It just tells the Localizer or Thinker to try that specific step again with a new plan.
    • Analogy: Like a quality inspector on an assembly line. If a car door doesn't fit, they don't scrap the whole car; they just tell the door-fitting team to adjust the hinge and try again.

Why is this a big deal?

The Old Way (Open-Loop):
Imagine a blindfolded robot trying to stack cups. It guesses where the cup is, moves its hand, and hopes for the best. If the cup was slightly moved, the robot misses, knocks the cup over, and keeps going, thinking it succeeded. It's fragile and prone to "hallucinations" (thinking it did something it didn't).

The MALLVi Way (Closed-Loop):
MALLVi is like a robot with eyes and a brain that checks its work constantly.

  • It adapts: If the red block is moved by a cat, the Descriptor notices, and the team recalculates.
  • It recovers: If the robot drops the block, the Reflector catches the error and tells the team to pick it up again, rather than giving up.
  • It specializes: By splitting the work, the "Project Manager" doesn't get confused trying to do math, and the "Math Guy" doesn't get confused trying to understand language.

The Result

The paper tested this team on real robots and in simulations. They asked the robots to do things like:

  • Stack blocks in a specific order.
  • Sort shapes into a sorter.
  • Even solve simple math problems using physical blocks (e.g., "Put the block that equals 9 plus 4").

The result? MALLVi succeeded much more often than previous methods. It proved that by giving robots a team of specialized AI agents that talk to each other and check their own work, we can make robots that are much more reliable, adaptable, and ready for the messy, unpredictable real world.

In short: MALLVi turns a clumsy, blind robot into a smart, self-correcting construction crew that never gives up until the job is done right.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →