Automating Forecasting Question Generation and Resolution for AI Evaluation

This paper presents an automated system using LLM-powered web research agents to generate and resolve diverse, real-world forecasting questions at scale, demonstrating high-quality question creation and resolution rates that surpass human-curated platforms while effectively evaluating and improving AI forecasting performance.

Nikos I. Bosse, Peter Mühlbacher, Jack Wildman, Lawrence Phillips, Dan Schwarz

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a group of very smart robots how to predict the future. You want to know if they are getting smarter, just like you might test a student by giving them harder math problems.

But here's the catch: You can't just ask them random questions. If you ask, "Will it rain tomorrow?" the answer is too easy. If you ask, "Will the stock market crash?" the answer is too vague. To really test a robot's "brain," you need thousands of specific, tricky questions about the real world, and you need to know the answers later to grade them.

In the past, humans had to write all these questions by hand. It was slow, boring, and expensive. Or, researchers used computer-generated questions about things that happen every day (like the weather), which wasn't very interesting or useful.

This paper is about a team that built a robot factory to solve this problem. They created an automated system that writes, checks, and grades its own future-prediction questions.

Here is how their "Robot Factory" works, broken down into simple steps:

1. The Seed Planters (Finding Inspiration)

Instead of staring at a blank page, the system starts with "seeds." Imagine these are clippings from news articles, stock market reports, or government announcements.

  • The Analogy: Think of these seeds like planting a garden. You don't just plant "a flower"; you plant a specific seed (like a news story about a new law) and see what kind of question grows from it.

2. The Drafters (The "What If?" Agents)

The system takes those news seeds and asks a team of AI agents (robots with internet access) to turn them into "proto-questions."

  • The Analogy: These are like rough drafts written by a creative writer. They might say, "Will the EU pass a new law?" This is a good idea, but it's too vague. Which law? When? How do we know if it passed?

3. The Editors (The "Refiners")

Another team of AI agents takes those rough drafts and polishes them. They add strict rules so there is no confusion later.

  • The Analogy: Imagine a strict editor who says, "Okay, 'Will the EU pass a law?' is too vague. Let's change it to: 'Will the EU publish a specific regulation on recycled plastic in their official journal by December 31st?'" Now, the question is clear, and we know exactly where to look for the answer.

4. The Quality Control Inspectors (The Verifiers)

Before the questions go out to the robots, a panel of "inspectors" checks them. They ask:

  • Is this question too easy? (If the answer is obvious, it's a bad test.)
  • Is it too hard to find the answer? (If no one can find the data, the test is broken.)
  • Is it ambiguous? (Could two people argue about the answer?)
  • The Analogy: This is like a teacher grading a test before giving it to students. If a question is confusing or the answer key is missing, the teacher throws it out.

5. The Final Product

The system generated 1,499 high-quality questions covering topics like politics, space launches, weather, and court cases.

  • The Result: The system was so good that it produced questions that were just as good (or better) than those written by human experts on famous forecasting websites. In fact, it made fewer mistakes than humans did!

6. The Exam (Testing the Robots)

Once the questions were ready, the team tested different AI models (like GPT-5, Gemini 3 Pro, etc.) to see how well they could predict the answers.

  • The Finding: The smarter the robot, the better it did. The newest, most powerful models got the highest scores.
  • The "Super-Strategy": They also found that if they broke a big question down into smaller, easier questions (like solving a puzzle piece by piece), the robots got even smarter and more accurate.

Why Does This Matter?

Think of this system as a gym for AI brains.

  • Before, we didn't have enough good "weights" (questions) to lift.
  • Now, we have a machine that can generate an endless supply of heavy, diverse, and challenging weights.
  • This allows us to see exactly how smart our AI is getting and helps us build better tools for making real-world decisions, from business investments to government policy.

In a nutshell: The authors built a self-running machine that invents its own tricky future-prediction puzzles, checks to make sure they are fair, and then uses them to prove that smarter AI models are indeed getting better at guessing the future. It's a massive leap forward in testing how "intelligent" our computers are becoming.