MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Imagine you've spent years asking a brilliant but slightly rigid librarian (the AI) for information. You ask, "What are Newton's laws?" and it hands you a printed page of text. It's accurate, but it's static. You can't touch it, you can't play with it, and you can't really feel how gravity works just by reading about it.

Now, imagine that same librarian has evolved. Instead of handing you a page, they instantly build you a miniature, interactive playground right on your screen. You drop a ball, and it bounces exactly how physics says it should. You tweak a slider, and the simulation changes in real-time. This is the shift the paper is talking about: moving from Text to MiniApps.

Here is a breakdown of the paper's key ideas using simple analogies:

1. The Problem: The "Text-Only" Trap

Currently, most AI benchmarks are like multiple-choice tests. They ask the AI to write a piece of code that solves a math problem or fixes a bug. If the code runs without errors, the AI gets a gold star.

But this misses the point of the future. The future isn't just about writing code that works; it's about building applications that make sense in the real world.

The Flaw: An AI could write a simulation of a falling apple that technically runs, but if the apple floats upward because the AI forgot gravity, the code "works" but the application is useless.
The Gap: Existing tests don't check if the AI understands real-world rules (like "a week has 7 days" or "water flows downhill"). They only check if the syntax is correct.

2. The Solution: MINIAPPBENCH (The New Exam)

The authors created a new testing ground called MINIAPPBENCH. Think of this not as a multiple-choice test, but as a cooking competition.

The Challenge: Instead of asking the AI to "write a recipe," they say, "Build a kitchen where the user can actually cook a meal."
The Rules: The AI must build a small, interactive web app (a "MiniApp") that follows real-world logic.
- Example: If you ask for a "Diet Tracker," the app shouldn't just list food. It should actually calculate calories, let you drag and drop items, and warn you if you eat too much sugar, adhering to the logic of nutrition.
The Data: They didn't make up these questions. They took 500 real requests from millions of actual users who wanted to build these kinds of things. This ensures the test is based on what people actually need, not what researchers think is cool.

3. The Judge: MINIAPPEVAL (The Robot Inspector)

This is the trickiest part. How do you grade an open-ended creative project? If you ask an AI to build a "Game," there is no single "correct" answer. One game might be a puzzle; another might be a runner.

Traditional tests fail here because they look for a specific "Ground Truth" (one right answer). But for MiniApps, there is no single right answer, only good and bad implementations.

To solve this, they built MINIAPPEVAL, which acts like a robot inspector with a human brain.

How it works: Instead of just reading the code, the robot actually uses the app. It clicks buttons, drags sliders, and tries to break the app (like an angry user trying to crash a website).
The Three Dimensions of Grading:
1. Intention: Did the robot understand what the user wanted? (e.g., "Did it actually build a diet tracker?")
2. Static: Does the app look and feel right? (e.g., "Are the buttons labeled correctly? Is the layout messy?")
3. Dynamic: Does the app behave correctly? (e.g., "If I click 'Add Water,' does the water level actually go up? Does the physics engine respect gravity?")

4. The Results: The AI is Still a Rookie

When they ran this new test on the world's best AI models (like GPT-5, Claude, etc.), the results were humbling.

The Score: Even the smartest models only passed about 45% of the time.
The Reality Check: Most AIs are great at writing code that looks right, but they often fail at the "common sense" part. They might build a calendar where a week has 8 days, or a physics sim where objects float. They are still learning to be "Architects" rather than just "Typists."

5. Why This Matters

This paper is a wake-up call. We are moving from an era where AI is a search engine (giving you text) to an era where AI is a builder (giving you tools).

Old Way: You ask, "How do I track my diet?" -> AI gives you a list of tips.
New Way: You ask, "How do I track my diet?" -> AI builds you a custom app where you can log meals, see charts, and get alerts, all following the rules of nutrition.

The Takeaway:
The authors have built the first "driving test" for AI builders. It's not enough to know the rules of the road (syntax); the AI must actually drive the car safely (adhere to real-world principles) without crashing. Right now, most AIs are still in driver's ed, but this new benchmark will help them get their licenses faster.

Here is a detailed technical summary of the paper "MINIAPPBENCH: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants."

1. Problem Statement

The paper identifies a critical paradigm shift in Human-AI interaction: Large Language Models (LLMs) are evolving from generating static text responses to creating MINIAPPS—dynamic, interactive HTML-based applications. While current LLMs excel at code generation, existing benchmarks fail to evaluate this new capability effectively due to two main limitations:

Static Focus: Traditional code benchmarks (e.g., HumanEval, MBPP) focus on algorithmic logic and syntax, ignoring execution context and user-facing behavior. Web benchmarks (e.g., FullFront, WebBench) prioritize visual fidelity or static layout reconstruction but neglect the dynamic interaction logic and adherence to real-world principles (e.g., physics laws, temporal constraints).
Evaluation Gap: Generating MINIAPPS is an open-ended task where multiple valid implementations exist. There is no single "ground truth" code. Consequently, rigid assertion-based or template-matching evaluation methods are insufficient. Furthermore, models often generate syntactically correct code that fails to adhere to implicit real-world constraints (e.g., a simulation violating gravity or a scheduler ignoring a 7-day week).

2. Methodology

The authors propose a comprehensive framework consisting of a new benchmark (MINIAPPBENCH) and an agentic evaluation system (MINIAPPEVAL).

A. MINIAPPBENCH (The Benchmark)

Data Source: Distilled from over 10 million real-world user queries from a production platform.
Construction Pipeline:
1. Filtering: Identified 1,123 high-quality "seed" queries requiring customized interaction and real-world principles.
2. Augmentation: Expanded to 1,974 queries via LLM-driven evolution while preserving core intent.
3. Reference Generation: Created verifiable evaluation references (Eval-Ref) for each task, guiding the assessment of principles.
4. Final Selection: Selected 500 tasks balanced across difficulty (Easy, Mid, Hard) and six domains: Science, Games, Tools, Humanities, Lifestyle, and Visualization.
Task Definition: Each task requires the model to generate a self-contained HTML/React application that not only renders a UI but also encodes causal dependencies and interaction logic adhering to real-world rules (e.g., "simulate one week of choices" must strictly follow a 7-day cycle).

B. MINIAPPEVAL (The Evaluation Framework)

To address the lack of ground truth, the authors introduce an Agentic Evaluation Framework that simulates human-like exploratory testing using browser automation (Playwright).

Mechanism: An LLM-powered agent interacts with the generated MiniApp in a sandboxed browser environment. It performs actions (clicking, dragging, typing) and observes runtime behavior (DOM changes, console logs).
Three Evaluation Dimensions:
1. Intention: Does the application fulfill the high-level user goal? (e.g., Is the physics simulation actually demonstrating the requested law?)
2. Static: Is the code structurally correct, accessible, and does it contain required elements? (Evaluated without execution).
3. Dynamic: Does the runtime behavior adhere to real-world principles and handle edge cases? (Evaluated via multi-step interaction trajectories).
Scoring: The system outputs a structured score (0–1) for each dimension. A task is considered "passed" only if the minimum score across all three dimensions exceeds a threshold (0.8).
Bias Mitigation: For visual tasks, a Double-Blind protocol is used where the evaluator first describes the page without seeing the query, then compares the description to the query to reduce confirmation bias.

3. Key Contributions

New Interaction Paradigm: The paper formally defines MINIAPPS as a new form of human-LLM interaction where code serves as an executable world model, externalizing knowledge through dynamic interfaces.
First Principle-Driven Benchmark: MINIAPPBENCH is the first benchmark specifically designed to evaluate the ability of LLMs to generate interactive applications that adhere to implicit real-world principles (scientific laws, commonsense constraints).
Agentic Evaluation Framework: MINIAPPEVAL introduces a novel, automated evaluation method that combines static code inspection with dynamic, browser-based exploration, overcoming the limitations of static metrics and rigid assertions.
Comprehensive Dataset: A high-quality dataset of 500 tasks covering diverse domains and difficulty levels, derived from real user demands.

4. Experimental Results

The authors evaluated 16 state-of-the-art LLMs (including GPT-5, Claude 3.5/4, Gemini 3, and various Qwen/GLM models) on MINIAPPBENCH.

Overall Performance: Current LLMs struggle significantly with this task. The highest-performing model, GPT-5.2, achieved a pass rate of only 45.46%. The average pass rate across all models was 17.05%.
Model Comparison:
- Closed-source vs. Open-source: Closed-source models consistently outperformed open-source counterparts.
- Scaling Laws: Larger models generally performed better (e.g., GLM-4.7 at 18.31% vs. GLM-4.5-Air at 7.09%), but even top-tier models fail on complex, principle-heavy tasks.
Domain Analysis: Models performed better on Visualization and Lifestyle tasks (often relying on commonsense) compared to Science and Tools, which require strict adherence to complex physical or logical rules.
Evaluation Validity:
- Human Alignment: MINIAPPEVAL showed high agreement with human expert judgments (Cohen's $\kappa$ ranging from 0.81 to 0.89).
- Ablation Studies: Removing the "Eval-Ref" or the "Agent" (dynamic interaction) caused significant drops in accuracy and recall, proving that both principle guidance and dynamic testing are essential.
- Double-Blind: The double-blind method improved accuracy on visual tasks by reducing confirmation bias (e.g., increasing accuracy from 80% to 83.6% for GPT-5.2).

5. Significance

Redefining LLM Capabilities: The paper argues that the future of LLMs lies not just in text generation but in autonomous architecture—building functional, principle-compliant software artifacts.
Benchmarking Standard: MINIAPPBENCH fills a critical gap in the evaluation landscape, providing a rigorous standard for measuring "world reasoning" and "interactive logic" rather than just syntax or static layout.
Future Research Direction: The results highlight that while LLMs are improving, they still lack the robustness to reliably construct complex, principle-driven interactive systems. The proposed MINIAPPEVAL framework offers a scalable, automated method to drive future research in this direction, moving beyond static code generation to dynamic, user-centric application creation.

In conclusion, the paper establishes that generating high-quality MINIAPPS is a significantly harder challenge than previously thought, requiring models to deeply understand and operationalize real-world principles, and provides the necessary tools (benchmark and evaluator) to track progress in this emerging field.

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

1. The Problem: The "Text-Only" Trap

2. The Solution: MINIAPPBENCH (The New Exam)

3. The Judge: MINIAPPEVAL (The Robot Inspector)

4. The Results: The AI is Still a Rookie

5. Why This Matters

1. Problem Statement

2. Methodology

A. MINIAPPBENCH (The Benchmark)

B. MINIAPPEVAL (The Evaluation Framework)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning