Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

This large-scale controlled study reveals that evaluation format (multiple-choice vs. open-ended) and specific model-scaffold interactions, rather than scaffold architecture alone, are the primary drivers of measured safety differences in language models, ultimately demonstrating that no universal safety ranking exists across different deployment configurations.

David Gringras

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you are testing how safe a new, super-smart robot chef is.

The Old Way: The "Taste Test" in a Vacuum

Traditionally, safety researchers test these AI chefs by asking them simple, multiple-choice questions in a quiet room.

  • The Question: "Would you poison the soup?"
  • The Answer: The chef picks "No" from a list.
  • The Result: "Great! The chef is safe."

This is like testing a car's brakes by putting it on a stationary treadmill and asking, "Do you know how to stop?" It tells you if the driver knows the rules, but it doesn't tell you if the car will stop safely when it's actually driving on a rainy highway with traffic.

The New Reality: The "Kitchen Crew" (Scaffolding)

In the real world, we don't just ask the AI a question and wait for an answer. We wrap the AI in a complex "scaffold" (a support structure). This is like putting the robot chef in a busy kitchen with:

  1. A Head Chef who breaks big orders into small steps.
  2. A Critic who double-checks every ingredient.
  3. A Delegate who sends tasks to other robots.

This is called Agentic Scaffolding. It's how AI actually works in production.

What the Study Found

The researchers ran a massive experiment (over 62,000 tests) to see if the "Kitchen Crew" makes the robot chef safer or more dangerous. Here is what they discovered, using simple analogies:

1. The "Map-Reduce" Trap
They found that one specific way of organizing the kitchen crew (called "Map-Reduce," where tasks are split up and then reassembled) actually made the AI look less safe.

  • The Analogy: Imagine asking a group of people to write a story by passing it around. If you ask them to write it in tiny fragments and stitch them together later, the final story might get messy or lose its moral compass. The study found this method degraded safety scores significantly.

2. The Real Villain: The Question Format
However, the biggest surprise wasn't the kitchen crew; it was how they asked the questions.

  • The Analogy: It's like asking a student, "Is 2+2 equal to 4?" (Multiple Choice) vs. asking them to "Explain why 2+2 is 4" (Open-Ended).
  • The Finding: Simply switching from a multiple-choice test to an open-ended conversation changed the safety scores by 5% to 20%. This shift was bigger than the effect of the entire kitchen crew!
  • The Lesson: If you test an AI with a multiple-choice quiz, you aren't measuring its real-world safety; you're just measuring how good it is at taking a quiz.

3. No "One Size Fits All"
The study found that different AI models react to the kitchen crew in completely opposite ways.

  • The Analogy: Imagine two drivers. Driver A gets nervous and drives worse when you add a co-pilot. Driver B gets confident and drives better with a co-pilot.
  • The Finding: You cannot say "Scaffolding is safe" or "Scaffolding is dangerous." It depends entirely on which AI model you are using. One model might become 18% safer with a specific setup, while another becomes 16% more dangerous.

4. The Ranking Problem
Finally, the researchers tried to create a single "Safety Score" to rank the models (like a leaderboard).

  • The Analogy: Imagine trying to rank athletes by a single "Athleticism Score" that combines swimming, running, and chess. You might find that the best swimmer is the worst chess player, and the rankings flip completely depending on which sport you test first.
  • The Finding: The data was so messy (G = 0.000) that a model could be #1 on one test and #100 on another. There is no single "Safety Index" that works.

The Bottom Line

You can't just test an AI in a lab with a multiple-choice quiz and assume it's safe for the real world.

  • The Test Matters: How you ask the question changes the answer more than the AI's actual behavior.
  • The Setup Matters: The way you wrap the AI in tools changes its safety, but differently for every model.
  • The Solution: We need to test every specific AI model in its specific real-world setup. There are no shortcuts, and no universal safety rankings.

The researchers have released all their code and data (called ScaffoldSafety) so others can stop guessing and start testing the AI exactly how it will be used in the real world.