Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

This paper introduces Vibe Code Bench, a novel benchmark featuring 100 web application specifications evaluated by autonomous browser agents, which reveals that even the best frontier models achieve only 58.0% accuracy on end-to-end development tasks and highlights self-testing and evaluator alignment as critical factors for success.

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you've hired a brilliant, super-fast apprentice to build you a house. You don't give them blueprints; you just say, "I want a cozy cottage with a big kitchen, a garden, and a front door that opens automatically."

In the past, we tested these AI apprentices by asking them to do tiny, isolated tasks: "Can you hammer this nail?" or "Can you mix this specific type of concrete?" They got really good at those small jobs.

But the real question is: Can they actually build the whole house, from the foundation to the roof, and make sure the lights turn on when you flip the switch?

This paper, "Vibe Code Bench," introduces a new way to test AI models on exactly that. Instead of asking them to fix a single broken window, the researchers gave 16 of the world's smartest AI models a challenge: Build a complete, working website from a simple text description.

Here is the breakdown of how they did it and what they found, using some everyday analogies.

1. The Test: "The Vibe Code Bench"

Think of this as a driving test for AI, but instead of driving a car, they are driving a construction crew.

  • The Challenge: The AI was given 100 different "job descriptions" (like "build an app to track your habits" or "create a tool for a small business to manage payments").
  • The Environment: The AI was dropped into a digital sandbox with a computer terminal, a web browser, and access to tools like databases and payment processors (like Stripe).
  • The Goal: The AI had to write the code, set up the servers, and launch the website all by itself.
  • The Grader: Once the AI finished, a robotic "inspector" (an autonomous browser agent) actually visited the website. It clicked buttons, tried to log in, made test purchases, and checked if the app actually worked. It didn't just read the code; it used the app.

2. The Results: "The Apprentice is Getting Better, But Still Needs a Supervisor"

The researchers tested 16 top-tier AI models. Here is the verdict:

  • The Best Performer: The top model (GPT-5.3-Codex) succeeded in about 62% of the tasks.
  • The Reality Check: This means that for nearly 4 out of 10 jobs, the AI either couldn't build the house, or the house collapsed the moment someone tried to walk through the front door.
  • The Gap: While AI is amazing at writing a single paragraph of code, building a whole system that connects a database, a login screen, and a payment processor is still a massive, unsolved puzzle.

3. The Secret Sauce: "Self-Testing"

The paper discovered a fascinating habit that separates the good builders from the great ones.

  • The "Edit-Only" Builders: Some models just wrote code, hit "submit," and hoped for the best. These models failed often.
  • The "Self-Testing" Builders: The top-performing models acted like careful craftsmen. They would write a bit of code, open the browser, click around, check if it worked, find a bug, fix it, and repeat.
  • The Analogy: It's the difference between a chef who just throws ingredients in a pot and serves it, versus a chef who tastes the soup, adds salt, tastes it again, and adjusts the heat. The paper found that the more an AI "tasted its own soup" (tested its own code), the better the final dish was.

4. The Human Factor: "Who is the Judge?"

One of the most interesting parts of the study was about how the apps were graded.

  • The Problem: If you ask one AI to grade another AI's work, they might disagree. One might say, "The login button is blue, so it's perfect!" while another says, "The button is too small, it's a fail!"
  • The Finding: The researchers found that who you choose to be the judge matters a lot. Some AI judges agreed with human experts 93% of the time, while others only agreed 32% of the time.
  • The Lesson: You can't just use any AI to grade the work. You have to pick the right "inspector" to get a fair score.

5. The Big Picture: "From 'Can It Write?' to 'Can It Build?'"

For a long time, the question was, "Can AI write code?" The answer is a loud YES.

Now, the question has changed to: "Can AI build software?"

This paper shows that we are getting close, but we aren't there yet. The AI is like a very talented junior developer who needs a senior engineer to look over their shoulder, catch the big mistakes, and make sure the final product actually works for real people.

In short: AI is no longer just a spell-checker for code; it's becoming a builder. But until it can reliably build a whole house without the roof falling off, we still need human architects to keep the blueprints safe.