MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

This paper introduces MiniAppBench, a comprehensive benchmark derived from real-world data to evaluate LLMs' ability to generate principle-driven interactive HTML applications, alongside MiniAppEval, an agentic framework that uses browser automation to assess these applications across intention, static, and dynamic dimensions.

Zuhao Zhang, Chengyue Yu, Yuante Li, Chenyi Zhuang, Linjian Mo, Shuai Li

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you've spent years asking a brilliant but slightly rigid librarian (the AI) for information. You ask, "What are Newton's laws?" and it hands you a printed page of text. It's accurate, but it's static. You can't touch it, you can't play with it, and you can't really feel how gravity works just by reading about it.

Now, imagine that same librarian has evolved. Instead of handing you a page, they instantly build you a miniature, interactive playground right on your screen. You drop a ball, and it bounces exactly how physics says it should. You tweak a slider, and the simulation changes in real-time. This is the shift the paper is talking about: moving from Text to MiniApps.

Here is a breakdown of the paper's key ideas using simple analogies:

1. The Problem: The "Text-Only" Trap

Currently, most AI benchmarks are like multiple-choice tests. They ask the AI to write a piece of code that solves a math problem or fixes a bug. If the code runs without errors, the AI gets a gold star.

But this misses the point of the future. The future isn't just about writing code that works; it's about building applications that make sense in the real world.

  • The Flaw: An AI could write a simulation of a falling apple that technically runs, but if the apple floats upward because the AI forgot gravity, the code "works" but the application is useless.
  • The Gap: Existing tests don't check if the AI understands real-world rules (like "a week has 7 days" or "water flows downhill"). They only check if the syntax is correct.

2. The Solution: MINIAPPBENCH (The New Exam)

The authors created a new testing ground called MINIAPPBENCH. Think of this not as a multiple-choice test, but as a cooking competition.

  • The Challenge: Instead of asking the AI to "write a recipe," they say, "Build a kitchen where the user can actually cook a meal."
  • The Rules: The AI must build a small, interactive web app (a "MiniApp") that follows real-world logic.
    • Example: If you ask for a "Diet Tracker," the app shouldn't just list food. It should actually calculate calories, let you drag and drop items, and warn you if you eat too much sugar, adhering to the logic of nutrition.
  • The Data: They didn't make up these questions. They took 500 real requests from millions of actual users who wanted to build these kinds of things. This ensures the test is based on what people actually need, not what researchers think is cool.

3. The Judge: MINIAPPEVAL (The Robot Inspector)

This is the trickiest part. How do you grade an open-ended creative project? If you ask an AI to build a "Game," there is no single "correct" answer. One game might be a puzzle; another might be a runner.

Traditional tests fail here because they look for a specific "Ground Truth" (one right answer). But for MiniApps, there is no single right answer, only good and bad implementations.

To solve this, they built MINIAPPEVAL, which acts like a robot inspector with a human brain.

  • How it works: Instead of just reading the code, the robot actually uses the app. It clicks buttons, drags sliders, and tries to break the app (like an angry user trying to crash a website).
  • The Three Dimensions of Grading:
    1. Intention: Did the robot understand what the user wanted? (e.g., "Did it actually build a diet tracker?")
    2. Static: Does the app look and feel right? (e.g., "Are the buttons labeled correctly? Is the layout messy?")
    3. Dynamic: Does the app behave correctly? (e.g., "If I click 'Add Water,' does the water level actually go up? Does the physics engine respect gravity?")

4. The Results: The AI is Still a Rookie

When they ran this new test on the world's best AI models (like GPT-5, Claude, etc.), the results were humbling.

  • The Score: Even the smartest models only passed about 45% of the time.
  • The Reality Check: Most AIs are great at writing code that looks right, but they often fail at the "common sense" part. They might build a calendar where a week has 8 days, or a physics sim where objects float. They are still learning to be "Architects" rather than just "Typists."

5. Why This Matters

This paper is a wake-up call. We are moving from an era where AI is a search engine (giving you text) to an era where AI is a builder (giving you tools).

  • Old Way: You ask, "How do I track my diet?" -> AI gives you a list of tips.
  • New Way: You ask, "How do I track my diet?" -> AI builds you a custom app where you can log meals, see charts, and get alerts, all following the rules of nutrition.

The Takeaway:
The authors have built the first "driving test" for AI builders. It's not enough to know the rules of the road (syntax); the AI must actually drive the car safely (adhere to real-world principles) without crashing. Right now, most AIs are still in driver's ed, but this new benchmark will help them get their licenses faster.