ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments

This paper introduces ACE-Brain-0, a generalist foundation brain that leverages spatial intelligence as a universal scaffold and employs a Scaffold-Specialize-Reconcile (SSR) paradigm to unify diverse embodied tasks like autonomous driving and robotics within a single multimodal large language model, achieving state-of-the-art performance across 24 benchmarks.

Ziyang Gong, Zehang Luo, Anke Tang, Zhe Liu, Shi Fu, Zhi Hou, Ganlin Yang, Weiyun Wang, Xiaofeng Wang, Jianbo Liu, Gen Luo, Haolan Kang, Shuang Luo, Yue Zhou, Yong Luo, Li Shen, Xiaosong Jia, Yao Mu, Xue Yang, Chunxiao Liu, Junchi Yan, Hengshuang Zhao, Dacheng Tao, Xiaogang Wang

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a single student to be an expert in four very different jobs: driving a car, flying a drone, controlling a robot arm, and solving complex 3D puzzles.

If you try to teach them all at once by mixing the textbooks together, the student gets confused. The lessons for driving (which need to be fast and reactive) clash with the lessons for flying a drone (which need to be precise and high up). This is called gradient interference—the student's brain is trying to learn two opposite things at the same time, so they end up learning nothing well.

If you teach them one job at a time (first driving, then flying), they get good at driving, but then they forget everything they learned about driving when they start learning to fly. This is called catastrophic forgetting.

ACE-Brain-0 is a new kind of "super-student" (an AI model) that solves this problem using a clever three-step strategy called Scaffold-Specialize-Reconcile.

Here is how it works, using a simple analogy:

1. The Core Idea: "Spatial Intelligence" is the Universal Language

The researchers realized that whether you are driving a car, flying a drone, or moving a robot arm, you all need to understand 3D space. You need to know:

  • Where am I?
  • How far is that object?
  • If I move forward, will I hit something?

They call this Spatial Intelligence. It's like a universal "skeleton" or "scaffold" that all these different machines need to stand on.

2. The Three-Step Training Recipe (SSR)

Instead of mixing everything up, ACE-Brain-0 follows a specific recipe:

Step 1: Build the Scaffold (The "Architect")

First, they train the AI only on spatial puzzles. They teach it to understand 3D shapes, distances, and how objects move in space.

  • Analogy: Think of this as training a master architect. This architect doesn't know how to drive or fly yet, but they are an expert at understanding blueprints, gravity, and how buildings fit together. This creates a strong, shared foundation.

Step 2: Specialize (The "Apprentices")

Next, they take that master architect and train separate "apprentices" for specific jobs, using the architect's knowledge as a base.

  • The Driver Apprentice: Takes the spatial knowledge and learns traffic rules.
  • The Drone Apprentice: Takes the spatial knowledge and learns wind patterns and aerial navigation.
  • The Robot Apprentice: Takes the spatial knowledge and learns how to grab cups and open doors.
  • Why this works: Because they all started with the same "Architect" knowledge, they don't have to relearn how 3D space works. They just learn the specific rules of their job. This prevents them from forgetting the basics.

Step 3: Reconcile (The "Merging")

Now, the researchers have four different experts. They want one brain that can do all four jobs. Instead of mixing their training data (which causes confusion), they use a mathematical trick to merge their brains without looking at any new data.

  • Analogy: Imagine you have four different chefs. One is great at Italian food, one at Chinese, one at French, and one at Mexican. Instead of forcing them to cook all four cuisines at once (which would result in a messy stew), you take their recipes and mathematically blend their "flavor profiles" into one "Master Chef" who knows how to cook all four perfectly.
  • This step combines the skills without the "forgetting" or "confusion" that usually happens.

3. The Final Polish: Reinforcement Learning

Finally, they let this "Master Chef" practice on real-world scenarios and give it feedback (like a coach saying "Good job!" or "Try that again"). This sharpens its decision-making skills.

Why is this a Big Deal?

Before this, AI models were usually "jacks of all trades, masters of none," or they were great at one thing but terrible at others.

ACE-Brain-0 proved that if you build a strong Spatial Foundation first, you can teach a single AI to:

  • Drive a car safely.
  • Fly a drone through a city.
  • Manipulate objects with a robot arm.
  • Solve complex 3D puzzles.

It does all of this better than previous models, and it doesn't forget how to do one job when it learns another. It's like having a single brain that can be a pilot, a driver, and a handyman all at the same time, because they all share the same understanding of how the physical world works.