RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation

Imagine you are trying to teach a robot how to cook, clean, or build things. In the past, to see if a robot was actually good at these tasks, you had to build a real kitchen, a real living room, or a real factory floor, hire a team of people to set up the scene, watch the robot try, and then manually reset everything for the next test.

This is like trying to test a new video game by building a physical cardboard version of the game world, hiring actors to play the characters, and then having to rebuild the whole set every time the player makes a mistake. It's expensive, slow, and impossible to do thousands of times.

RobotArena ∞ is the solution to this problem. Think of it as a massive, magical video game engine that can instantly turn real-life robot videos into a digital simulation, allowing researchers to test robots at a speed and scale that was previously impossible.

Here is how it works, broken down into simple concepts:

1. The "Magic Camera" (Real-to-Sim Translation)

Usually, when researchers want to test a robot in a computer, they have to manually build the 3D world, place every cup and spoon, and program the physics. It takes weeks.

RobotArena ∞ uses a team of AI "magic cameras."

The Input: You feed it a simple video of a robot doing a task in the real world (like "put the tomato in the pot").
The Magic: The system automatically analyzes the video. It figures out where the camera was, what the objects look like in 3D, how heavy they are, and even how the robot's arm moves.
The Output: In seconds, it builds a perfect digital twin of that real-world scene inside a computer. It's like taking a photo of a room and instantly turning it into a playable video game level where the physics work exactly like the real thing.

2. The "Stress Test" (Perturbations)

Once the digital world is built, the researchers don't just let the robot play normally. They want to see if the robot is truly smart or just memorized the specific scene.

Imagine you teach a student to solve a math problem on a specific piece of paper. If you change the font, the color of the paper, or move the numbers slightly, can they still solve it?

RobotArena ∞ does this automatically. It changes the background wallpaper, shifts the colors of the objects, or moves the cups to different spots.
It forces the robot to face thousands of "what-if" scenarios instantly. If the robot fails when the background changes, it proves the robot was just "cheating" by memorizing the background, not actually understanding the task.

3. The "Crowd-Sourced Judges" (Human Feedback)

How do you know if the robot did a good job?

The Robot Judge: The system uses a super-smart AI (a Vision-Language Model) to watch the video and give it a score, like a referee.
The Human Judges: This is the secret sauce. The system takes two videos of different robots trying the same task and shows them to regular people online (like a "Tinder" for robots).
- The humans just have to click: "Which one looked better?" or "Did they tie?"
- By collecting thousands of these simple "A vs. B" votes from regular people, the system creates a global leaderboard (like an Elo rating in chess) that tells us exactly which robot is the best, without needing a single robotics expert to watch every second.

What Did They Discover?

When they ran this massive test on the world's best robot brains (called VLAs), they found some surprising things:

They aren't "Generalists" yet: Most robots are like students who only studied for one specific test. If you change the test slightly (like moving the objects), they fail. They haven't learned the concept of the task; they just memorized the specific video they were trained on.
The "Spatial Paradox": Some robots that were trained with cameras on their wrists (seeing from the hand's perspective) were much better at understanding 3D space than robots that were explicitly taught 3D geometry. It turns out, seeing the world from different angles naturally teaches the robot better spatial skills than trying to force math rules on it.
The "Overfitting" Problem: Many robots failed when the background changed. They were relying on the background to know what to do, rather than the object itself.

Why This Matters

Before RobotArena ∞, testing robots was like trying to measure the speed of a car by driving it on a different, bumpy road every single day. You couldn't compare them fairly.

Now, we have a standardized, infinite racetrack that can be changed instantly. We can test thousands of robots, in thousands of different conditions, using the wisdom of the crowd to decide who wins. This allows us to move faster toward the day when robots can truly be "generalists"—helpers that can walk into any house, understand any task, and do it safely, no matter how messy the room is.

In short: RobotArena ∞ turns the slow, expensive, and dangerous process of testing robots into a fast, cheap, and scalable video game, helping us build smarter machines for the real world.

1. Problem Statement

The development of generalist robot policies (Vision-Language-Action or VLA models) is hindered by a lack of scalable, rigorous, and reproducible evaluation frameworks. Current real-world evaluation suffers from several critical bottlenecks:

Scalability & Cost: Physical testing is labor-intensive, slow, and expensive. It requires human operators to set up scenes, reset environments, and supervise trials, which limits the frequency and scale of evaluations.
Reproducibility & Fairness: Variations in hardware, lighting, camera placement, and manual scene resetting introduce noise, making it difficult to compare models fairly across different institutions.
Safety: Testing complex policies in the real world poses safety risks, especially when policies fail.
Subjectivity: Defining "success" often relies on nuanced human judgment, which is difficult to standardize at scale.

While simulation offers a solution, existing benchmarks often suffer from the "sim-to-real" gap or rely on manually crafted environments that do not reflect the diversity of real-world data. Furthermore, many simulators assume policies are trained and tested in the same closed-world simulation, failing to test true generalization.

2. Methodology: The RobotArena ∞ Framework

RobotArena ∞ addresses these challenges by creating a fully automated pipeline that translates real-world video demonstrations into high-fidelity simulated environments for evaluation. The framework consists of three core components:

A. Automated Real-to-Sim Translation Pipeline

The system converts raw video demonstrations from existing datasets (e.g., BridgeV2, DROID, RH20T) into physics-consistent simulation environments without manual intervention. The pipeline extracts five key elements:

Camera-Robot Pose Estimation: Using differentiable rendering, the system estimates the 6-DoF camera pose relative to the robot. It optimizes a joint angle-conditioned 3D Gaussian robot model against the video using three loss terms: RGB appearance, optical flow consistency, and DINOv2 feature alignment.
3D Object Reconstruction: Task-relevant objects are segmented (using VLMs like Gemini), super-resolved, and converted into textured 3D meshes using 2D-to-3D generative models (Hunyuan-3D).
Pose & Physics Estimation: The system recovers object poses by matching 2D correspondences between the reconstructed mesh and the original video frame. It also infers physical properties (mass, friction) and background scenes (via inpainting) to create a clean, static backdrop.
System Identification: To match robot dynamics, the pipeline tunes Proportional-Derivative (PD) controller gains ( $K_p, K_d$ ) by minimizing the trajectory error between the simulated robot and the real-world demonstration.

B. Controlled Domain Perturbations

To stress-test policy robustness, the framework systematically perturbs the generated environments along multiple axes:

Background Change ( $\Delta$ BG): Replacing the background with inpainted textures to test reliance on contextual cues.
Color Shift ( $\Delta$ Color): Altering RGB channel configurations to test robustness against low-level visual variations.
Object Pose Change ( $\Delta$ ObjPose): Randomly permuting object locations to test spatial generalization.

C. Dual-Mode Evaluation Strategy

RobotArena ∞ evaluates policies using two complementary methods:

Automated VLM Scoring: Vision-Language Models (e.g., Gemini 2.5 Pro) are prompted with video frames and privileged state information (object/robot states) to assign a task progress score to each frame. The final trajectory score is the mean of the last 30% of frames.
Crowdsourced Human Preference: Following the LMarena paradigm, human annotators perform pairwise, double-blind comparisons of two policy execution videos. They select the better performer and provide natural language explanations. These preferences are aggregated using the Bradley-Terry (BT) model to generate a global Elo-style ranking of policies.

3. Key Contributions

Scalable Benchmarking Protocol: A novel framework coupling physics engines, real-to-sim translation, and human preference feedback to enable large-scale, reproducible robot evaluation.
Fully Automated Pipeline: The first system to automate the entire translation from single-view real-world videos to simulation-ready environments using VLMs, generative 3D models, and differentiable rendering.
Extensive Empirical Study: The benchmark evaluates six state-of-the-art VLAs (Octo, RoboVLM, SpatialVLA, CogAct, X-VLA, $\pi_0$ ) across hundreds of environments and 8,500+ human preference pairs, representing the largest-scale robot evaluation to date.
Insights into Generalization: The study provides critical data on how current models fail under distribution shifts, revealing that "generalist" models often overfit to specific training distributions.

4. Key Results & Findings

The evaluation of six VLA models across BridgeSim, DROIDSim, and RH20TSim revealed several critical insights:

Weak Cross-Dataset Generalization: Models perform significantly worse in environments derived from datasets they were not trained on (e.g., a model trained on BridgeV2 fails on DROIDSim). This suggests current VLAs are not true generalists but rather specialists in their training distributions.
Model Architecture Matters:
- $\pi_0$ and X-VLA emerged as top performers in BridgeSim, likely due to their pre-training on multi-view data (including wrist cameras), which provides a robust implicit 3D spatial prior.
- Spatial Paradox: Explicit 3D spatial reasoning (as in SpatialVLA) did not outperform models with implicit 3D priors learned from raw multi-view data.
- Backbone Strength: Models with stronger VLM backbones showed greater resilience to color perturbations, suggesting they rely more on invariant structural cues than superficial appearance.
Overfitting to Configuration: All models degraded significantly when object poses were randomized or backgrounds changed, indicating a heavy reliance on fixed environmental cues rather than semantic understanding.
Correlation with Reality: In a limited real-world validation task ("Put the carrot in the plate"), the simulation rankings aligned with real-world outcomes (e.g., Octo failed in both, while RoboVLM and SpatialVLA succeeded in both).
Comparison with SIMPLER: RobotArena ∞ showed that benchmarks with limited environments (like SIMPLER) may overestimate policy performance due to scenario bias, whereas RobotArena's diverse, perturbed environments provide a more rigorous test.

5. Significance and Future Directions

RobotArena ∞ represents a paradigm shift in robotics evaluation, moving away from expensive, manual physical testing toward a scalable, automated, and crowd-sourced simulation framework.

Impact: It provides the community with a standardized, reproducible platform to measure the true generalization capabilities of robot policies, addressing a critical gap in the field.
Limitations: Current limitations include the lack of wrist-camera inputs in evaluated policies and challenges in simulating fine-grained contact dynamics (e.g., plugging in a charger).
Future Work: The authors plan to expand the range of tasks, incorporate multi-view observations, and leverage advances in physics engines to improve the fidelity of contact-rich interactions.

By releasing the benchmark environments and evaluation code, RobotArena ∞ aims to foster an open ecosystem for the continuous improvement of robotic foundation models.

RobotArena ∞\infty∞: Scalable Robot Benchmarking via Real-to-Sim Translation

1. The "Magic Camera" (Real-to-Sim Translation)

2. The "Stress Test" (Perturbations)

3. The "Crowd-Sourced Judges" (Human Feedback)

What Did They Discover?

Why This Matters

1. Problem Statement

2. Methodology: The RobotArena ∞ Framework

A. Automated Real-to-Sim Translation Pipeline

B. Controlled Domain Perturbations

C. Dual-Mode Evaluation Strategy

3. Key Contributions

4. Key Results & Findings

5. Significance and Future Directions

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks

RobotArena $\infty$ : Scalable Robot Benchmarking via Real-to-Sim Translation