Demystifying Action Space Design for Robotic Manipulation Policies

This paper presents a large-scale empirical study demonstrating that the design of a robotic manipulation policy's action space significantly impacts learning performance, revealing that predicting delta actions improves results while joint-space and task-space representations offer complementary benefits for stability and generalization, respectively.

Yuchun Feng, Jinliang Zheng, Zhihao Wang, Dongxiu Liu, Jianxiong Li, Jiangmiao Pang, Tai Wang, Xianyuan Zhan

Published 2026-03-02
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot to do chores, like picking up a cup or stacking blocks. You want the robot to learn by watching you do it (this is called "Imitation Learning").

For years, researchers have been trying to make these robots smarter by giving them more data and bigger brains (better AI models). But they kept ignoring one crucial question: How exactly do we tell the robot what to move?

This paper is like a massive, scientific "taste test" to figure out the best way to give instructions to a robot arm. The authors ran over 13,000 real-world experiments with actual robots to find the "secret sauce" for robot control.

Here is the breakdown of their findings using simple analogies:

1. The Two Big Questions

The researchers realized that giving a robot an instruction is like giving directions to a friend. You have to decide two things:

  • The "Map" (Spatial Abstraction): Do you tell the robot, "Move your shoulder and elbow to these specific angles" (Joint Space), or do you say, "Move your hand to these specific X, Y, Z coordinates in the room" (Task Space)?
  • The "Step" (Temporal Abstraction): Do you say, "Go to the cup" (Absolute), or do you say, "Move your hand 2 inches to the right" (Delta/Relative)?

2. The "Map" Dilemma: Angles vs. Coordinates

  • Joint Space (The Angles): Imagine telling a dancer, "Bend your knee 45 degrees, rotate your hip 30 degrees."
    • Pros: It's very stable. The robot knows exactly how its own body works.
    • Cons: It's hard to learn. The robot has to figure out how those angles translate to moving a cup in the real world. It's like trying to navigate a city by only knowing how many steps to take with each foot, without looking at the street signs.
  • Task Space (The Coordinates): Imagine telling the dancer, "Walk to the red chair."
    • Pros: It's intuitive. The robot sees the cup and knows where to go.
    • Cons: It can be shaky. If the math used to convert "red chair" into "knee angles" is slightly off, the robot might miss the target or get stuck.

The Verdict:

  • If you are training a robot on one specific arm with lots of data and time, Joint Space (Angles) usually wins. It's like a professional dancer who knows their body perfectly.
  • If you want the robot to learn quickly or switch between different robots (like moving from a small arm to a big arm), Task Space (Coordinates) is better. It's like a GPS that works on any car, regardless of the brand.

3. The "Step" Dilemma: Destination vs. Direction

This was the paper's biggest discovery.

  • Absolute (Destination): "Go to the cup."
  • Delta (Direction): "Move your hand 2 inches toward the cup."

The Analogy:
Imagine you are walking through a foggy forest.

  • Absolute (Destination): You try to guess the exact location of the campfire from where you are standing. If you guess wrong by a little bit, you might be miles off. As you try to plan a long path, your errors pile up, and you get lost.
  • Delta (Direction): You just take one small step toward the fire. Then you look again and take another small step. Even if you stumble a little, you can correct it on the next step.

The Verdict:
Delta (Direction) is almost always better.
The paper found that telling the robot "move a little bit" (Delta) is much more stable and easier to learn than telling it "go to this exact spot" (Absolute). It prevents the robot from getting confused by small errors.

4. The "Chunking" Secret

Modern robots don't just predict one move; they predict a whole sequence of moves at once (like a video clip of movement). The researchers found that how you link these moves matters.

  • Bad Way: Linking moves like a chain reaction (Step-wise). If the first link is slightly wrong, the error gets multiplied down the line, and the robot goes wildly off course.
  • Good Way: Linking moves as a single block (Chunk-wise). Every move in the sequence is calculated relative to the start of the sequence. If one part is slightly off, it doesn't ruin the whole plan.

The Final "Cheat Sheet" for Robot Designers

Based on 13,000+ tries, here is the recipe for the best robot policy:

  1. Always use "Delta" instructions: Tell the robot to "move a little bit" rather than "go to a specific spot." It's more stable.
  2. Use "Chunk-wise" grouping: Predict a whole block of moves at once, but calculate them all relative to the start of that block.
  3. Choose your "Map" based on your goal:
    • For maximum performance on one specific robot: Use Joint Space (angles) + Delta (direction). This is the "Power User" combo.
    • For robots that need to work on different machines or learn fast: Use Task Space (coordinates) + Delta (direction). This is the "Generalist" combo.

Why This Matters

Before this paper, people were guessing which method worked best. Some used old methods from 10 years ago; others tried random new tricks. This study provides a clear, scientific rulebook. It tells us that the way we "speak" to robots is just as important as the robot's "brain" itself. By speaking the right language (Delta + Chunking), we can make robots that are more reliable, learn faster, and actually get the job done.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →