From Translation to Superset: Benchmark-Driven… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a incredibly powerful, high-performance race car built with a very strict, complex language called Rust. It's safe, fast, and built to last, but it's also heavy, expensive to maintain, and only a few specialized mechanics know how to fix it.

Now, imagine you want to build a new version of this car using Python, a language that is more flexible, easier for more people to understand, and has a massive community of mechanics. The goal isn't just to copy the old car; it's to translate it so well that the new Python car can do everything the Rust car can do, but eventually, become even better.

This paper is the story of how the team at JP Morgan Chase did exactly that with their CODEX CLI, an AI coding agent (a robot that writes and fixes code for you).

Here is the breakdown of their journey, using simple analogies:

1. The Challenge: The "One-Time Move" Problem

Usually, when companies move software from one language to another, it's like moving houses. You pack everything up, move it, and then you're done. If the old house gets a new room added next week, you have to manually build that room in the new house. It's slow, boring, and error-prone.

The Innovation: Instead of a one-time move, they built a living bridge. They used an AI (a Large Language Model) to act as a translator that works continuously. Every time the original Rust code gets an update, the AI translates just the changed parts into Python. It's like having a magical translator that instantly rewrites your diary into a new language every time you add a sentence.

2. The Secret Sauce: "The Test Drive" (Benchmarks as Objective Functions)

How do you know the new Python car is as good as the Rust one? You can't just look under the hood (unit tests); you have to drive it.

Old Way: Check if the engine parts are the right shape.
New Way: Take the car to a Test Track (called Terminal-Bench and SWE-bench). These tracks have 80 difficult challenges, like "solve a maze," "crack a code," or "fix a broken website."

The team used these test tracks as their scorecard. If the Python car failed a challenge, they didn't just guess why; they looked at the crash, fixed the specific part that broke, and drove it again. This "Benchmark-Driven Debugging" was so effective that the Python version actually started solving more problems than the original Rust version in some areas.

3. The Results: Smaller, Faster, and Smarter

The results were surprising:

The Size: The original Rust code was a massive 648,000 lines of code (like a 600-page manual). The new Python version is only 41,000 lines (a 40-page booklet). That's a 16x reduction in size!
The Speed: You might think Python is slower, but for an AI agent, the slowest part is waiting for the AI to think (which takes seconds). The Python code itself is so fast that its "thinking time" is less than 0.1% of the total wait time. It's like the difference between a 1-second delay and a 10-second delay; the 1-second delay doesn't matter.
The Performance: On the test tracks, the Python version solved 73.8% of the software engineering tasks (beating Rust's 70.0%) and came very close on the terminal tasks (42.5% vs 47.5%).

4. The "Superset" Surprise: From Copy to Upgrade

The most exciting part is that the Python version didn't just stop at being a copy. Once they proved it could do everything the Rust car could do, they started adding new features that the Rust car never had.

They added a "Superset" module with 30 new superpowers, such as:

Multi-agent orchestration: One agent can hire other agents to help with a task.
Semantic memory: The agent remembers the meaning of past conversations, not just the text.
Voice mode: You can talk to the agent.
Cost tracking: It keeps a running tab of how much money the AI is spending.

Think of it like buying a basic model car, proving it drives just as well as the luxury model, and then realizing you can now easily add a sunroof, a sound system, and a GPS because the new language makes it so much easier to tinker with.

5. What They Learned (The "Aha!" Moments)

Tests aren't enough: You can have 2,600 perfect unit tests (checking if the engine turns on), but if the car can't navigate a real city (the benchmark), it's useless. Real-world tests found bugs that the simple tests missed.
The "Silent" Bugs: They found bugs where the car would just stop working without saying "I'm broken" (like a silent WebSocket failure). Only by running the full test tracks did they see these silent failures.
Language Choice: For AI agents, the "bottleneck" is the AI's thinking time, not the code's execution speed. So, using a more expressive, easier language (Python) is actually the smarter choice because it allows for faster innovation and easier maintenance.

The Bottom Line

This paper proves that you don't have to choose between "safe and rigid" (Rust) and "flexible and fast" (Python). By using AI to translate code continuously and using real-world challenges to guide the process, you can migrate a massive system, keep it perfectly in sync with the original, and then evolve it into something even more powerful.

It's not just a translation; it's an evolution.

1. Problem Statement

Cross-language migration of large, rapidly evolving software systems is a persistent engineering challenge. Traditional migrations are labor-intensive, error-prone, and typically "one-shot" efforts, leading to permanent divergence between the source and target codebases.

Specific Context: The authors address the migration of CODEX CLI, a production-grade AI coding agent originally written in Rust (648K lines of code, 65 crates).
Motivation: The team sought to migrate to Python to improve iteration velocity, leverage the dominant AI/ML ecosystem, and lower the barrier for contributors.
Challenge: How to translate a complex, safety-critical system while maintaining functional parity and enabling continuous synchronization with upstream Rust updates without manual rework.

2. Methodology: Benchmark-as-Objective-Function

The paper proposes a novel methodology where public agent benchmarks serve as the primary objective function for translation, rather than relying solely on unit tests or static analysis.

LLM-Assisted Continuous Translation:
- Instead of a one-time transpilation, the process is a continuous loop: Track (Rust upstream via git submodule) $\rightarrow$ Diff (extract changes) $\rightarrow$ Translate (LLM translates only changed modules) $\rightarrow$ Validate (Benchmark regression testing).
- The LLM is instructed to produce idiomatic Python (e.g., mapping Rust's Result<T,E> to exceptions, Tokio to asyncio, serde to Pydantic) rather than mechanical transpilation.
The Objective Function:
- Unit Tests: 2,621 unit tests were used for basic validation but proved insufficient for detecting integration-level bugs.
- Agent Benchmarks: The team used Terminal-Bench (80 complex terminal tasks) and SWE-bench Verified as the ground truth. The benchmark score acted as a "loss function." If a translation caused a regression in the benchmark score, the LLM was prompted to refine the code based on the specific failure mode.
Architecture for Parity and Superset:
- The Python port maintains a strict 1:1 parity mode (all enhancements disabled) for fair comparison.
- It utilizes a codex.enhancements module with 30 feature-flagged extensions (e.g., multi-agent orchestration, semantic memory) that are additive and do not interfere with the parity baseline.

3. Key Contributions

Benchmark-Driven Debugging: Demonstrated that public benchmarks are superior to unit tests for validating cross-language translations. Benchmarks revealed critical issues invisible to unit tests, including:
- API protocol mismatches (e.g., sending invalid content-item types causing HTTP 400s).
- Environment pollution (pip installation conflicts in Docker).
- Silent failure modes (WebSocket returning empty responses).
- Tool availability gaps.
From Parity to Superset: Showed that the translation methodology allows the target system to evolve beyond the source. The Python port is now a capability superset, adding features like voice mode, persistent plans, and IDE bridges that do not exist in the Rust original.
Continuous Upstream Synchronization: Described an architecture that allows the Python port to absorb daily Rust commits automatically via an LLM-assisted diff-translate-test loop.
Comprehensive Empirical Evaluation: Provided a multi-dimensional analysis across code complexity, test coverage, runtime performance, and end-to-end agent benchmarks.

4. Results

A. Performance and Parity

Code Reduction: The migration achieved a 15.9× reduction in code size (from 648K LOC in Rust to ~41K LOC in Python).
Complexity: Cyclomatic complexity improved, with 89% of Python functions achieving the minimal complexity rank (A), compared to the more complex Rust implementation.
Runtime Overhead:
- Startup time is ~3–5× slower (53.9ms vs. compiled Rust), but this is negligible (<0.1%) compared to LLM API latency (1–10s).
- Local tool execution overhead is negligible (~30µs for orchestration, ~3–7ms for shell execution).
Benchmark Accuracy:
- Terminal-Bench: Python achieved 42.5% accuracy vs. Rust's 47.5%. The gap was attributed to specific API crash fixes and non-deterministic LLM behavior.
- SWE-bench Verified: Python achieved 73.8% (59/80 tasks) vs. Rust's 70.0% (56/80 tasks), slightly outperforming the original.

B. Bugs Discovered via Benchmarks

The benchmark-driven approach uncovered four critical bugs that unit tests missed:

WebSocket Transport: Silent empty responses when API quotas were exhausted, causing the agent to falsely mark tasks as complete.
Memory Extraction: A misconfigured model identifier causing 404 errors.
Variable Initialization: A NameError in the cost tracker crashing specific tasks.
API 400 Recovery: The Rust agent crashed on invalid content-item types; the Python port implemented a systematic 400 recovery layer (handling invalid parameters, context overflow, etc.), making it more robust than the original.

C. Feature Expansion

The Python port added 30 feature-flagged capabilities absent in Rust, including:

Multi-agent orchestration and hierarchical task delegation.
Semantic memory and persistent planning.
Cost tracking and guardian safety assessments.
Voice mode and IDE bridges.

5. Significance and Implications

Language Choice for AI Agents: The paper argues that for LLM-based agents where the bottleneck is API latency (seconds), the performance benefits of Rust are outweighed by Python's expressiveness and ecosystem. Python offers a 15.9× code reduction and faster iteration with negligible performance cost in this specific domain.
Methodological Shift: It establishes benchmarks as first-class objective functions for software engineering tasks like translation, moving beyond "test-driven development" to "benchmark-driven evolution."
Sustainable Migration: The "living bridge" architecture proves that cross-language ports can be maintained continuously against fast-moving upstreams, avoiding the "fork and abandon" problem common in traditional migrations.
Superset Evolution: The work demonstrates that a translated system need not be a static replica; it can immediately become a more capable platform, leveraging the target language's ecosystem to add value beyond the original implementation.

From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python