ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments

Imagine you are trying to teach a single student to be an expert in four very different jobs: driving a car, flying a drone, controlling a robot arm, and solving complex 3D puzzles.

If you try to teach them all at once by mixing the textbooks together, the student gets confused. The lessons for driving (which need to be fast and reactive) clash with the lessons for flying a drone (which need to be precise and high up). This is called gradient interference—the student's brain is trying to learn two opposite things at the same time, so they end up learning nothing well.

If you teach them one job at a time (first driving, then flying), they get good at driving, but then they forget everything they learned about driving when they start learning to fly. This is called catastrophic forgetting.

ACE-Brain-0 is a new kind of "super-student" (an AI model) that solves this problem using a clever three-step strategy called Scaffold-Specialize-Reconcile.

Here is how it works, using a simple analogy:

1. The Core Idea: "Spatial Intelligence" is the Universal Language

The researchers realized that whether you are driving a car, flying a drone, or moving a robot arm, you all need to understand 3D space. You need to know:

Where am I?
How far is that object?
If I move forward, will I hit something?

They call this Spatial Intelligence. It's like a universal "skeleton" or "scaffold" that all these different machines need to stand on.

2. The Three-Step Training Recipe (SSR)

Instead of mixing everything up, ACE-Brain-0 follows a specific recipe:

Step 1: Build the Scaffold (The "Architect")

First, they train the AI only on spatial puzzles. They teach it to understand 3D shapes, distances, and how objects move in space.

Analogy: Think of this as training a master architect. This architect doesn't know how to drive or fly yet, but they are an expert at understanding blueprints, gravity, and how buildings fit together. This creates a strong, shared foundation.

Step 2: Specialize (The "Apprentices")

Next, they take that master architect and train separate "apprentices" for specific jobs, using the architect's knowledge as a base.

The Driver Apprentice: Takes the spatial knowledge and learns traffic rules.
The Drone Apprentice: Takes the spatial knowledge and learns wind patterns and aerial navigation.
The Robot Apprentice: Takes the spatial knowledge and learns how to grab cups and open doors.
Why this works: Because they all started with the same "Architect" knowledge, they don't have to relearn how 3D space works. They just learn the specific rules of their job. This prevents them from forgetting the basics.

Step 3: Reconcile (The "Merging")

Now, the researchers have four different experts. They want one brain that can do all four jobs. Instead of mixing their training data (which causes confusion), they use a mathematical trick to merge their brains without looking at any new data.

Analogy: Imagine you have four different chefs. One is great at Italian food, one at Chinese, one at French, and one at Mexican. Instead of forcing them to cook all four cuisines at once (which would result in a messy stew), you take their recipes and mathematically blend their "flavor profiles" into one "Master Chef" who knows how to cook all four perfectly.
This step combines the skills without the "forgetting" or "confusion" that usually happens.

3. The Final Polish: Reinforcement Learning

Finally, they let this "Master Chef" practice on real-world scenarios and give it feedback (like a coach saying "Good job!" or "Try that again"). This sharpens its decision-making skills.

Why is this a Big Deal?

Before this, AI models were usually "jacks of all trades, masters of none," or they were great at one thing but terrible at others.

ACE-Brain-0 proved that if you build a strong Spatial Foundation first, you can teach a single AI to:

Drive a car safely.
Fly a drone through a city.
Manipulate objects with a robot arm.
Solve complex 3D puzzles.

It does all of this better than previous models, and it doesn't forget how to do one job when it learns another. It's like having a single brain that can be a pilot, a driver, and a handyman all at the same time, because they all share the same understanding of how the physical world works.

1. Problem Statement

The development of Universal Embodied Intelligence—a single AI brain capable of operating across heterogeneous physical embodiments (e.g., autonomous vehicles, drones/UAVs, and humanoid robots)—faces three critical challenges when training a unified model:

Gradient Interference: Jointly training on mixed data from different domains often leads to conflicting gradients, where optimization for one task degrades performance in another.
Catastrophic Forgetting: Sequentially fine-tuning a model on specific domains (e.g., training on driving then drones) causes the model to overwrite previously learned capabilities.
Long-Tail Data & Domain Dilution: Heterogeneous embodiments have vastly different morphologies and action spaces. Simply mixing data often results in "average" performance that lacks the specialization required for safety-critical tasks in any single domain.

The core question addressed is: How can we unify spatial reasoning, autonomous driving, low-altitude sensing, and embodied manipulation within a single foundation model without sacrificing domain-specific proficiency?

2. Methodology: The Scaffold-Specialize-Reconcile (SSR) Paradigm

The authors propose ACE-Brain-0, a generalist foundation brain built upon a Multimodal Large Language Model (MLLM) architecture. The core innovation is the Scaffold-Specialize-Reconcile (SSR) training paradigm, which decouples shared structural learning from domain specialization.

A. Architectural Foundation

Model: A unified MLLM (based on Qwen3-VL) that accepts diverse inputs (single-view images, multi-view images, videos) and natural language instructions.
Output: Autoregressive generation of text, reasoning traces, spatial descriptions, or action sequences depending on the task.
Input Processing: Visual features are extracted via a Vision Encoder and projected into the LLM embedding space, organized conceptually by domain (General, Spatial, Driving, Aerial, Embodied).

B. The SSR Training Pipeline

The training process consists of five distinct stages:

Stage 1: Spatial Scaffold Training (The Foundation)
- Goal: Establish a shared, domain-agnostic "spatial scaffold."
- Method: Train a base model on general data, then fine-tune it exclusively on large-scale Spatial Intelligence datasets (e.g., VSI, SAT, MindCube).
- Insight: Spatial intelligence (3D mental modeling, object relations, geometry) serves as a universal prior. All embodiments (cars, drones, robots) rely on understanding 3D space, making this a transferable "scaffold."
Stage 2: Supervised Specialized Expert Fine-Tuning (Isolation)
- Goal: Cultivate domain-specific experts without interference.
- Method: Initialize separate expert models from the Spatial Scaffold ( $\theta_{spatial}$ $θ_{s p a t ia l}$ ) and fine-tune them independently on their respective domains:
  - $\theta_{AD}$ : Autonomous Driving (perception, planning).
  - $\theta_{UAV}$ : Low-Altitude Sensing (navigation, aerial reasoning).
  - $\theta_{Embodied}$ : Robotic manipulation and interaction.
- Benefit: This isolation prevents gradient interference during the specialization phase.
Stage 3: Across-Embodiment Reconcile (Data-Free Merging)
- Goal: Synthesize the experts into a single unified model.
- Method: Use data-free model merging (specifically an optimization-based approach using Task Vectors). The method approximates the linear subspace of fine-tuning data for each expert and merges them by minimizing task interference.
- Key Technique: Instead of simple averaging, the method optimizes the merged parameters to minimize the distance between the merged model's behavior and the individual experts' behaviors on their respective data distributions (without requiring the data itself).
Stage 4: Embodied Supervised Fine-Tuning (SFT)
- Goal: Refine the merged model for fine-grained embodied interaction.
- Method: Apply SFT on large-scale embodied and ego-centric multimodal data to strengthen task planning and action prediction capabilities.
Stage 5: Reinforcement Learning with GRPO
- Goal: Align the model for decision quality and complex reasoning.
- Method: Apply Group Relative Policy Optimization (GRPO). The model samples multiple responses to a query, and the policy is optimized based on relative rewards within the group, enhancing the model's ability to handle multi-step planning and uncertainty.

3. Key Contributions

Spatial Intelligence as a Universal Scaffold: The paper empirically demonstrates that spatial cognition is not just a standalone task but a structural prior that significantly boosts learning across diverse physical domains (AD, UAV, Robotics).
The SSR Training Paradigm: A novel framework that solves the stability-plasticity dilemma by separating the learning of shared spatial structures from domain-specific skills, and reconciling them via data-free merging. This avoids both gradient interference (joint training) and catastrophic forgetting (sequential training).
ACE-Brain-0 Model: The release of a generalist foundation model that achieves state-of-the-art (SOTA) or competitive performance across 24 benchmarks spanning four distinct physical domains.

4. Experimental Results

ACE-Brain-0 was evaluated on 24 benchmarks across four categories, outperforming both closed-source giants (GPT-4o, Gemini 2.5-Pro) and specialized open-source embodied brains.

Spatial Intelligence:
- SAT: 92.0% (vs. 79.3% for Gemini 2.5-Pro).
- MindCube: 82.1% (vs. 57.6% for Gemini 2.5-Pro).
- BLINK: 83.9% (vs. 81.8% for Gemini 2.5-Pro).
Autonomous Driving:
- NuPlanQA: 91.7% (SOTA).
- MME-RealWorld: 71.2%.
- DriveAction: 81.3%.
Low-Altitude (UAV) Sensing:
- AircopBench: 70.3% (SOTA).
- UrbanVideo-Bench: 56.9%.
Embodied Interaction:
- EmbSpatial-Bench: 77.3%.
- EgoPlan-Bench2: 55.3% (SOTA).

Ablation Studies confirmed that:

Initializing experts from the Spatial Scaffold yields massive gains (+19.3% in AD, +16.5% in UAV) compared to initializing from a generic base model.
Data-free merging (Reconcile) outperforms simple weight averaging and sequential training, effectively synthesizing complementary skills without catastrophic forgetting.

5. Significance and Future Outlook

Theoretical Impact: The paper provides a principled blueprint for Cross-Embodiment Learning, suggesting that physical intelligence can be organized hierarchically: a shared geometric core (scaffold) supports diverse morphological specializations.
Practical Impact: ACE-Brain-0 demonstrates that a single model can replace multiple specialized agents, reducing deployment complexity for multi-robot fleets or hybrid systems (e.g., a system controlling both drones and ground robots).
Future Directions: The authors plan to extend ACE-Brain-0 to closed-loop visuomotor policies (VLA), incorporate physics-aware continuous prediction, and develop continual learning mechanisms for lifelong capability accumulation across new embodiments (e.g., legged robots, underwater vehicles).

In summary, ACE-Brain-0 proves that by treating spatial intelligence as a universal scaffold and using a specialized training-reconciliation pipeline, it is possible to build a truly generalist embodied AI that excels in diverse physical worlds without compromising on domain-specific expertise.