TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

This paper introduces TML-Bench, a benchmark for evaluating the end-to-end correctness and reliability of autonomous coding agents on Kaggle-style tabular machine learning tasks, demonstrating that the MiniMax-M2.1 model achieves the best aggregate performance across four competitions under varying time budgets.

Mykola Pinchuk

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are hiring a team of robot chefs to cook a meal using a specific set of ingredients (a spreadsheet of data). Your goal isn't just to see who can make the best dish if they get lucky once; you want to know who can consistently cook a delicious meal, every single time, within a strict time limit, without peeking at the recipe book or asking a friend for help.

This paper, TML-bench, is a giant cooking competition designed to test exactly that. Here is the breakdown in simple terms:

1. The Setting: The "Cook-Off"

In the world of data science, there are many "Kaggle competitions" where people try to predict things (like "Will this customer quit?" or "How many shoes will we sell?"). Usually, people just look at the final score.

But the author, Mykola Pinchuk, realized that's like judging a chef only on one lucky plate. What if the chef got lucky? What if they crashed the computer halfway through? What if they cheated by looking up the answer online?

TML-bench is a new, stricter test. It sets up a "clean kitchen" where:

  • The Robots (AI Agents): 10 different AI models are the chefs.
  • The Ingredients: Four real-world data problems (like predicting customer churn or foot traffic).
  • The Rules: The robots have to load the data, clean it, train a model, and submit an answer.
  • The Catch: They have no internet access (no Googling the answer) and their "knowledge" stops before the competition started (so they can't have memorized the test).

2. The Three Time Limits

The competition tests the robots under three different time pressures, like a cooking show with different rounds:

  • The Sprint (240 seconds / 4 mins): Can they make a decent meal quickly?
  • The Standard (600 seconds / 10 mins): Can they improve the dish with a bit more time?
  • The Marathon (1200 seconds / 20 mins): Can they really refine the recipe if given plenty of time?

Note: The longer time limit also gave the robots a slightly different "instruction manual" focused on a specific cooking technique (XGBoost), making the test fairer.

3. The "Five-Taste" Rule (Reliability)

This is the most important part. In most tests, you run the AI once and see the score. In TML-bench, each robot cooks the same dish five times.

  • Why? Because sometimes a robot might get a lucky break or a random glitch.
  • The Score: The paper doesn't look at the "best" score. It looks at the median (the middle score of the five attempts).
  • The Goal: If a robot gets a 10/10 once but fails four times, it's a bad chef. If it gets a 7/10 five times in a row, it's a reliable chef. The paper cares about consistency.

4. The Secret Sauce: The "Hidden Label"

How do they know if the robot actually learned the pattern or just guessed?

  • The robots submit their answers to a "blind judge."
  • The judge has a secret list of correct answers (the "private holdout") that the robots never saw.
  • The robots only get a score based on how well they predicted these hidden answers. This prevents them from "cheating" by over-fitting to the data they were given.

5. The Results: Who Won?

After all the cooking, the paper found:

  • The Star Chef: A model called MiniMax-M2.1-TEE was the most consistent and highest-scoring chef across all four competitions.
  • Time Matters: Generally, giving the robots more time helped them cook better, but not always. Some robots got stuck in a loop or got confused, while others just got better.
  • Reliability is Key: Some models had high scores but were very "jittery" (sometimes great, sometimes terrible). Others were steady. The paper argues that for real-world use, steady is better than lucky.

6. Why This Matters

Think of it like buying a car.

  • Old Benchmarks: "This car hit 100 mph once on a perfect track!" (Great, but what if it breaks down every other day?)
  • TML-bench: "This car drove 50 miles every day for a week, in the rain and sun, and never stalled."

The paper concludes that for AI to be useful in real businesses (like banks or retail), we need to stop celebrating "lucky runs" and start measuring reliability, consistency, and the ability to work under time pressure.

In a nutshell: TML-bench is a stress test for AI data scientists to prove they aren't just lucky guessers, but reliable workers who can get the job done, every time, without cheating.