Push Anything: Single- and Multi-Object Pushing From First Sight with Contact-Implicit MPC

Imagine a robot arm that doesn't just pick things up and move them (like a human grabbing a cup), but instead pushes them around like a game of air hockey or a game of pool. This is called "non-prehensile manipulation." It's incredibly hard for robots because pushing is messy: things slide, they get stuck, they bump into each other, and they might tip over.

This paper introduces a new system called "Push Anything" that teaches a robot how to push almost any object, even when there are many of them cluttered on a table, without needing to know exactly how heavy or slippery they are beforehand.

Here is the breakdown of how they did it, using some everyday analogies:

1. The Problem: The "Local Minima" Trap

Imagine you are trying to push a heavy box across a room to a specific spot. You are standing right next to it. If you just push it straight, it might get stuck against a wall.

Old Robots: They were like people who only looked at the immediate inch in front of their nose. They would push the box, hit a wall, get stuck, and give up. They couldn't see that if they walked around to the other side of the box and pushed it from there, the whole puzzle would solve itself.
The Challenge: In a room full of furniture (clutter), figuring out the right angle to push a specific item so it slides past three other items is a math nightmare. The number of possibilities explodes.

2. The Solution: The "Smart Scout" Strategy

The authors combined two ideas to solve this:

A. The Scout (Sampling)
Instead of just pushing from where the robot arm currently is, the system acts like a scout. It quickly imagines, "What if I walked over to this spot on the table and pushed from there? Or that spot?"

It picks a few random "good spots" to stand in.
It checks: "If I stand here, can I push the object to the goal?"
It picks the best spot, walks there (without touching anything), and then starts pushing. This helps the robot escape the "traps" where it would get stuck.

B. The Brain (C3+ Algorithm)
Once the robot is in the right spot, it needs to figure out exactly how to push. This is where the new algorithm, C3+, comes in.

The Old Way (C3): Imagine trying to solve a giant, tangled knot of string. The old method tried to untie the whole knot at once. It was slow and often got stuck.
The New Way (C3+): The new method is like having a pair of scissors. It cuts the knot into tiny, separate pieces. It solves each tiny piece instantly (using a simple math trick) and then stitches them back together.
The Result: This makes the robot's brain 10,000 times faster at thinking. It can now handle complex scenarios with 4 or more objects moving around each other in real-time, which was previously impossible.

3. The "Eyes" (Perception)

Before pushing, the robot needs to know what it's looking at.

The Process: The robot takes a video of the objects. It uses AI to trace their outlines (like a digital artist tracing a photo) and builds a 3D model of them, even if they are weird shapes like a letter "R" or a bottle of lotion.
The Tracking: As the robot pushes, the objects move and might hide behind each other. The system is smart enough to keep track of them, like a referee in a game of tag who never loses sight of the players, even when they run behind a tree.

4. The Results: Real-World Success

The team tested this on a real robot arm (a Franka Panda) with 33 different objects, from 3D-printed letters to household items.

Success Rate: It worked 98% of the time.
Speed:
- Moving 1 object: ~30 seconds.
- Moving 2 objects: ~1.5 minutes.
- Moving 3 objects: ~3 minutes.
- Moving 4 objects: ~5 minutes.
The "Push Anything" feat: They successfully cleared a table of 4 different objects, rearranging them into a neat line, something that would have confused previous robots.

The Big Picture

Think of this paper as teaching a robot the art of billiards.

Old robots could only hit the cue ball straight at the target. If the target was blocked, they failed.
This new robot looks at the whole table, calculates the angles, realizes it needs to hit the cue ball into the cushion first to bounce it around the obstacles, and then sink the target. It does this fast enough that it can play the game in real-time, even with a crowded table.

In short: They built a robot that can look at a messy table, figure out the best way to push things around to clean it up, and actually do it without dropping anything or getting stuck.

Here is a detailed technical summary of the paper "Push Anything: Single- and Multi-Object Pushing From First Sight with Contact-Implicit MPC."

1. Problem Statement

The paper addresses the challenge of non-prehensile manipulation (pushing) for robots in cluttered, multi-object environments. Key difficulties include:

Unknown Physical Properties: Objects often have unknown geometries, masses, and inertias.
Contact-Rich Dynamics: Manipulation involves complex interactions (sticking, slipping, separating) between objects, the environment, and the robot.
Scalability: Traditional methods struggle with the combinatorial explosion of contact modes as the number of objects increases.
Limitations of Prior Work: Previous Contact-Implicit Model Predictive Control (CI-MPC) approaches were limited to single-object tasks with known CAD models or failed in multi-object scenarios due to computational intractability and local minima traps.

2. Methodology: The "Push Anything" Pipeline

The authors propose a fully integrated pipeline that operates in real-time, spanning perception, planning, and control.

A. Perception and Reconstruction

Mesh Reconstruction: Uses a RealSense D455 camera and BundleSDF to reconstruct watertight 3D meshes of arbitrary objects from a single video scan.
Robust Tracking: Employs FoundationPose for multi-object tracking, enhanced by XMem for periodic mask re-registration to correct drift caused by occlusions. It also includes logic to resolve pose ambiguity (e.g., symmetric objects).

B. Control Framework: Sampling-Based CI-MPC

The system follows a two-stage approach to overcome the locality limitations of standard CI-MPC:

Global Exploration (Sampling): The system samples candidate end-effector positions on the object surfaces. For each candidate, it solves a local CI-MPC problem. The candidate yielding the lowest cost is selected, and the robot first moves to that position via a collision-free path before executing the pushing trajectory.
Local Planning (CI-MPC): Once at the sampled position, the controller optimizes a trajectory that explicitly reasons about contact forces and modes.

C. Core Innovation: Consensus Complementarity Control Plus (C3+)

The primary algorithmic contribution is C3+, an enhanced version of the Consensus Complementarity Control (C3) algorithm.

Linearization: The nonlinear contact dynamics are approximated using a Linear Complementarity System (LCS).
Slack Variable Reformulation: C3+ introduces a slack variable ( $\eta$ ) to decouple the complementarity constraints. This transforms the problem into a consensus form solvable via the Alternating Direction Method of Multipliers (ADMM).
Computational Efficiency:
- Quadratic Step: Solves a convex Quadratic Program (QP) for the global dynamics.
- Projection Step: The critical improvement. By decoupling constraints, the projection step (which was a costly Mixed-Integer QP in C3) becomes a set of independent 1D problems with closed-form analytical solutions.
- Result: This reduces the projection step time by 4–5 orders of magnitude, enabling real-time performance even with many contact pairs.

3. Key Contributions

Push Anything Pipeline: A complete system capable of scanning, tracking, and pushing diverse, unknown objects in real-time, including multi-object clutter.
C3+ Algorithm: A significantly faster CI-MPC solver that enables reasoning over complex contact networks (up to 19 contact pairs) in real-time, making multi-object rearrangement tractable.
Hardware Validation: Extensive experiments demonstrating high-precision manipulation on a Franka Emika Panda robot across a wide variety of object geometries.

4. Experimental Results

The system was tested on a Franka Panda arm with a spherical end-effector across 33 unique objects (letters, household items) in 1-, 2-, 3-, and 4-object scenarios.

Success Rates:
- Single-Object: 99.9% success rate (700/701 trials).
- Multi-Object: 92.5% overall success rate (210/227 trials) across 2, 3, and 4-object tasks.
- Overall: 98% success rate across all 33 objects.
Performance Metrics:
- Time-to-Goal: Average times were approximately 0.5 min (1 obj), 1.6 min (2 objs), 3.2 min (3 objs), and 5.3 min (4 objs).
- Precision: Achieved tight pose tolerances (translational error $\le$ 2cm, rotational error $\le$ 0.1 rad).
Speed Comparison (C3 vs. C3+):
- While the quadratic step in C3+ is slightly slower, the projection step is drastically faster.
- For a 4-object task, the projection step time dropped from an average of 44 ms (C3) to 0.007 ms (C3+), a speedup of roughly 6,000x.

5. Significance and Impact

Solving the "Multi-Object" Bottleneck: Prior CI-MPC methods were effectively limited to single objects due to computational complexity. C3+ breaks this barrier, allowing robots to de-clutter and rearrange complex scenes dynamically.
Real-World Applicability: By integrating real-world scanning and robust tracking, the system moves beyond idealized CAD simulations to handle "first sight" manipulation of unknown objects.
Algorithmic Efficiency: The closed-form solution for the projection step in C3+ represents a major advancement in contact-implicit control, making high-dimensional, contact-rich trajectory optimization feasible for real-time hardware execution.

6. Limitations and Future Work

Perception: Performance degrades when objects heavily occlude one another, limiting tracking accuracy. Future work aims to improve multi-view tracking.
Physics Modeling: The system assumes identical mass and inertia for all objects. Scaling to highly diverse physical properties requires online model learning.
High-Level Planning: The current system lacks high-level task planning (e.g., deciding the order of moving objects A vs. B). Future iterations aim to integrate higher-level reasoning.
3D Extension: Current work is limited to planar (2D) pushing; extending to 3D non-prehensile manipulation is a future goal.