One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Imagine you are a chef who just invented a new, revolutionary recipe. Before you can serve it to the world, you need to know: Is it actually good? Does it taste like a pizza or a pancake? Did it burn the kitchen down?

In the world of Artificial Intelligence (AI), this "taste test" is called Evaluation. But right now, testing a new AI model is like trying to run a restaurant kitchen where every ingredient comes from a different country, every recipe is written in a different language, and you have to manually build your own stove, oven, and cutting board before you can even start cooking. It's slow, confusing, and prone to mistakes.

Enter One-Eval. Think of One-Eval as an AI Sous-Chef (a highly skilled assistant) that you can talk to in plain English. Instead of giving it complex code or spreadsheets, you just say, "Hey, I need to check if my new AI is good at math and can tell the truth."

Here is how One-Eval works, broken down into three simple steps:

1. The Translator (NL2Bench)

The Problem: You say, "Check my math skills." But the AI doesn't know which math test to use. Is it elementary school math? College calculus? Tricky riddles?
The One-Eval Solution: This is the Translator. It listens to your casual request and turns it into a professional shopping list.

Analogy: Imagine you tell a travel agent, "I want a relaxing beach vacation." The agent doesn't just book a random flight; they figure out you probably want Hawaii, not a desert, and they book the specific resorts that fit your budget.
What it does: It takes your sentence, figures out exactly what you mean, and picks the perfect "tests" (benchmarks) from a giant library of thousands of options. It even asks you, "Did you mean this specific test?" so you can tweak it before it starts.

2. The Logistics Manager (BenchResolve)

The Problem: Even if you know which test to use, getting the test materials is a nightmare. One test is on a website, another is in a zip file, and a third uses a weird format that your computer doesn't understand. You'd have to download them, rename files, and fix broken links manually.
The One-Eval Solution: This is the Logistics Manager. It does all the heavy lifting behind the scenes.

Analogy: Imagine you order a custom-built house. Instead of you having to go to the lumber yard, the hardware store, and the plumbing supply shop to buy bricks and pipes, the builder goes out, buys everything, and delivers it all to your driveway, already sorted into neat piles.
What it does: It automatically finds the right data files, downloads them, and "translates" them into a standard format so the AI can actually take the test. It makes sure the "stove" is built and the "ingredients" are prepped.

3. The Critic & Reporter (Metrics & Reporting)

The Problem: Usually, when you test an AI, you get a single number back, like "85%." That's like a teacher giving you a grade of "B" without telling you why you got it, or which questions you missed. It doesn't help you improve.
The One-Eval Solution: This is the Critic. It doesn't just give you a score; it gives you a full report card with advice.

Analogy: Instead of just saying "You got a B," a great coach says, "You ran fast, but you tripped on the last hurdle. Also, your form was great on the left side but weak on the right. Here are three specific drills to fix it."
What it does: It looks at how the AI failed. Did it lie? Did it get confused by a long sentence? Did it hallucinate (make things up)? It generates a colorful, easy-to-read report that tells you exactly where the AI is strong and where it needs to go back to school.

The "Human-in-the-Loop" Safety Net

One-Eval is smart, but it knows it's not perfect. It has a Safety Checkpoint.

Analogy: Think of it like a pilot flying a plane. The autopilot (One-Eval) can handle 99% of the flight, but before it lands, it asks the pilot (you), "Hey, I'm about to land on Runway 4. Is that okay?" If you say, "No, use Runway 2," the system instantly changes course.
What it does: At key moments, it pauses and shows you its plan. You can approve it, edit it, or tell it to start over. This ensures you are always in control, even though the system does the boring work.

Why Does This Matter?

Before One-Eval, testing an AI was like trying to assemble IKEA furniture without the instructions, using tools you had to build yourself. It took days and required a PhD in carpentry.

With One-Eval, you just say, "Build me a table," and the system brings you the wood, the screws, the instructions, and the finished product, while pointing out any wobbly legs. It turns a chaotic, manual nightmare into a smooth, automated conversation, making it much easier for companies to build safer, smarter, and more reliable AI.

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

1. The Translator (NL2Bench)

2. The Logistics Manager (BenchResolve)

3. The Critic & Reporter (Metrics & Reporting)

The "Human-in-the-Loop" Safety Net

Why Does This Matter?

1. Problem Statement

2. Methodology: The One-Eval Framework

A. Stage 1: NL2Bench (Intent Structuring & Planning)

B. Stage 2: BenchResolve (Resolution & Configuration)

C. Stage 3: Metrics & Reporting (Analysis & Decision Support)

3. Key Contributions

4. Experimental Results

5. Significance

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

1. The Translator (NL2Bench)

2. The Logistics Manager (BenchResolve)

3. The Critic & Reporter (Metrics & Reporting)

The "Human-in-the-Loop" Safety Net

Why Does This Matter?

1. Problem Statement

2. Methodology: The One-Eval Framework

A. Stage 1: NL2Bench (Intent Structuring & Planning)

B. Stage 2: BenchResolve (Resolution & Configuration)

C. Stage 3: Metrics & Reporting (Analysis & Decision Support)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents