InnoGym: Benchmarking the Innovation Potential of AI Agents

Imagine you are a judge in a cooking competition.

In most AI competitions today, the judges only care about one thing: Did the cake taste good? If the cake is delicious, the AI gets a gold star. If it tastes bad, it gets nothing. They don't care how the AI made the cake. Did it use a secret family recipe? Did it invent a new way to mix ingredients? Or did it just follow a recipe from a 1950s cookbook perfectly?

The paper "InnoGym" argues that this is a flawed way to measure true intelligence. Just because an AI can copy a perfect recipe doesn't mean it's innovative. True genius isn't just about getting the right answer; it's about finding a new, better, or more creative way to get there.

Here is a simple breakdown of what the researchers built to fix this.

1. The New Scorecard: Taste vs. Creativity

The researchers created a new system called InnoGym (Innovation Gym). Instead of just one score, they give AI agents two scores for every task:

The "Taste" Score (Performance Gain): Did the AI actually solve the problem better than anyone else? Did it make the cake sweeter, lighter, or faster to bake?
The "Creativity" Score (Novelty): Did the AI use a completely different method? Did it invent a new whisking technique instead of just copying the old one?

The Analogy: Imagine two runners.

Runner A runs a marathon in 2 hours using the exact same training plan as the world record holder. They get a high "Taste" score but a low "Creativity" score.
Runner B runs the marathon in 2 hours and 10 minutes, but they invented a brand-new running style that no one has ever seen. They get a high "Creativity" score but a lower "Taste" score.
InnoGym wants to find the runner who does both: runs fast and invents a new style.

2. The Playground: 18 Real-World Challenges

To test these AI agents, the researchers didn't use simple math problems (which are like "solve for x"). They built a gym with 18 complex, real-world challenges.

Think of these as engineering puzzles that humans have been struggling with for years. Examples include:

Packing Circles: How do you fit the maximum number of circles into a square without them overlapping? (Like trying to pack as many pizzas as possible into a small delivery box).
Drug Discovery: How do you predict which chemical combinations might cure a disease?
Traffic Optimization: How do you manage traffic lights in a massive city to stop jams?

These are "Improvable Tasks." We know the current best answers, but we know they aren't perfect yet. There is still room for improvement.

3. The Experiment: What Happened?

The researchers put several top-tier AI agents into this gym to see if they could be innovative. Here is what they found:

The "Copycat" Problem: Most AIs were great at following instructions but terrible at being creative. They could often get close to the human best score, but they did it by tweaking existing methods, not inventing new ones.
The "Wild Idea" Trap: Some AIs tried very creative, wild new methods. They got high "Creativity" scores! But, because their methods were so experimental, they often failed to produce a working solution. They got high creativity but zero "Taste."
The Big Lesson: Creativity without reliability is useless. In the real world, you don't just want a new idea; you want a new idea that actually works. The biggest gap in current AI isn't a lack of imagination; it's a lack of robustness (the ability to stick the landing).

4. The Toolkit: iGym

To make sure the tests were fair, the researchers built a special environment called iGym.

Think of iGym as a standardized laboratory. Before, if you tested an AI on a computer in New York, it might work differently than on a computer in Tokyo because of different software or hardware.
iGym puts every AI in the exact same digital room with the exact same tools. This ensures that if an AI fails, it's because the AI is bad, not because the test was rigged.

Why Does This Matter?

For a long time, we've been asking AI: "Can you solve this?"
InnoGym asks: "Can you solve this better and differently than we ever have before?"

The paper concludes that while AI is getting very good at solving problems, it still struggles to be a true innovator. It's like a student who can memorize the textbook perfectly but hasn't yet learned how to write a new chapter. The future of AI isn't just about being correct; it's about being creative and reliable at the same time.

In short: InnoGym is a new gym where we don't just check if the AI can lift the weight; we check if it can invent a new way to lift it that makes the weight feel lighter for everyone else.

1. Problem Statement

Current benchmarks for Large Language Models (LLMs) and AI Agents primarily focus on correctness (e.g., passing unit tests, matching reference answers, or achieving a specific leaderboard score). This paradigm overlooks the methodology used to reach the solution.

The Gap: Two agents may produce the same correct answer, but one might use a novel, more efficient, or fundamentally different approach. Existing metrics fail to distinguish between "tuning a known method" and "true innovation."
The Need: There is a lack of frameworks that simultaneously evaluate performance breakthroughs (getting better results) and methodological novelty (using different approaches). Furthermore, current agents often struggle to translate creative ideas into robust, executable solutions in complex, real-world engineering and scientific domains.

2. Methodology: The InnoGym Framework

The authors propose InnoGym, a comprehensive framework consisting of a benchmark (iBench) and an execution environment (iGym).

A. Theoretical Foundation: Defining Innovation

The paper formalizes a task as a quadruple $T = (P, S, V, D)$ :

$P$ : Problem instance (visible and hidden parts).
$S$ : Solution space (executable code, proofs, strategies).
$V$ : Performance measure (feasibility $\times$ quality).
$D$ : Dissimilarity measure between solutions.

Based on this, innovation is quantified using two complementary metrics:

Performance Gain ( $G$ ): Measures the improvement of a new solution $s$ $s$ over the best-known baseline ( $V^*_{known}$ $V_{k n o w n}^{*}$ ).
- $G(s) = V(s) - V^*_{known}$ .
- Positive $G$ indicates a state-of-the-art (SOTA) breakthrough.
Novelty ( $N$ ): Measures the methodological distance between a new solution and the set of known solutions ( $S_{known}$ $S_{k n o w n}$ ).
- $N(s) = C(s) \cdot \min_{h \in S_{known}} D(s, h)$ .
- Calculated only for feasible solutions ( $C(s)=1$ ).
- Distance Function ( $D$ ): Implemented via an "Agent-as-Judge" pipeline using Codex. It extracts structured summaries (natural language and pseudocode) from solutions and compares them across six dimensions (problem framing, methodology, architecture, etc.) to generate a dissimilarity score.

Taxonomy of Tasks:
The framework categorizes tasks based on the relationship between known solutions and the optimal frontier:

Solved Problems: Optimal solutions are known; innovation is measured purely by novelty (efficiency).
Improvable Problems (Focus of InnoGym): Known solutions exist but are suboptimal. Innovation can be a performance gain ( $G>0$ ) or a novel method achieving current performance ( $N>0$ ).
Exploratory Problems: No known feasible solutions exist; the first valid solution is a "0-to-1" breakthrough.

B. iBench: The Benchmark Dataset

Scope: 18 carefully curated Improvable Tasks drawn from real-world engineering and scientific domains (e.g., ROADEF challenges, KDD Cup, NeurIPS competitions, 2D Bin Packing, Graph Coloring).
Exclusion Criteria: Solved problems (no room for improvement) and purely exploratory problems (no human baselines for validation) are excluded to ensure measurable performance gains.
Standardization Pipeline:
1. Resource Filtering: Ensures datasets, validators, and reference solutions are accessible.
2. Evaluator Normalization: Converts relative leaderboard scores into absolute metrics and ensures executability across environments.
3. Solution Collection: Aggregates leaderboard solutions and papers, distilling them into structured representations for novelty comparison.
4. Data Partitioning: Splits data into visible (development) and hidden (evaluation) sets.

C. iGym: Unified Execution Environment

To support long-horizon, reproducible evaluations, the authors built iGym, a unified SDK that addresses limitations in existing frameworks (like OpenHands or AutoGen).

Key Features: Robust recovery mechanisms for long-running tasks, native concurrency for parallel tool usage, and a unified abstraction layer for tool management.
Architecture: Supports both workflow-style (LLM as a function) and agent-style (multi-agent scheduling) paradigms, ensuring fair comparison across different agent designs.

3. Key Contributions

Principled Framework: A formal definition of innovation combining Performance Gain and Novelty, moving beyond simple correctness.
InnoGym Benchmark: The first benchmark specifically targeting innovation potential, featuring 18 standardized, real-world "Improvable Tasks."
iGym Environment: A unified, reproducible execution environment designed for long-horizon problem solving and robust tool use.
Empirical Analysis: Systematic evaluation of state-of-the-art agents, revealing critical gaps between creativity and effectiveness.

4. Experimental Results

The authors evaluated three leading agent frameworks (MLAB, CODEACT, AIDE) on a subset of 10 tasks using DeepSeek-v3.1 as the backbone LLM.

Performance Gap: No agent surpassed the human state-of-the-art on complex tasks. Many agents failed to generate valid, executable solutions for tasks with intricate data formats (e.g., CDML, PTTALC).
Robustness vs. Novelty:
- MLAB showed the best balance of performance gain and novelty.
- CODEACT and AIDE often produced novel approaches but suffered from low robustness, resulting in failed executions or poor performance.
- Key Finding: High novelty scores did not translate to performance gains. In many cases, agents with "creative" ideas failed to implement them correctly.
Temporal Dynamics: As agents iterated over time, performance generally improved while novelty decreased (diminishing returns), suggesting agents converge on local optima rather than exploring new paradigms.
Model Dependency: Performance was heavily dependent on the underlying LLM (e.g., Gemini-2.5-Pro outperformed DeepSeek-v3.1), indicating that agent frameworks amplify but do not replace the reasoning capabilities of the base model.
Exploration-Exploitation Trade-off: Higher sampling temperatures increased novelty but decreased performance. A "sweet spot" (temperature 0.5–0.75) was identified for balancing both.

5. Significance and Conclusion

InnoGym represents a paradigm shift in evaluating AI agents. It highlights that true innovation requires both originality and robustness.

Current Limitation: Today's agents often generate novel ideas that fail to execute, or they refine existing methods without significant breakthroughs.
Future Direction: The paper argues that future research must focus on closing the gap between "creative generation" and "reliable execution." Benchmarks must evolve to reward agents that can not only think differently but also build working, high-performing systems.
Impact: By providing a standardized, cross-domain platform, InnoGym enables the community to systematically measure and improve the creative and innovative capabilities of AI, moving the field from simple task completion to genuine scientific and engineering discovery.

The code, dataset, and framework are open-sourced at https://github.com/zjunlp/igym.