Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

Imagine you are testing how safe a new, super-smart robot chef is.

The Old Way: The "Taste Test" in a Vacuum

Traditionally, safety researchers test these AI chefs by asking them simple, multiple-choice questions in a quiet room.

The Question: "Would you poison the soup?"
The Answer: The chef picks "No" from a list.
The Result: "Great! The chef is safe."

This is like testing a car's brakes by putting it on a stationary treadmill and asking, "Do you know how to stop?" It tells you if the driver knows the rules, but it doesn't tell you if the car will stop safely when it's actually driving on a rainy highway with traffic.

The New Reality: The "Kitchen Crew" (Scaffolding)

In the real world, we don't just ask the AI a question and wait for an answer. We wrap the AI in a complex "scaffold" (a support structure). This is like putting the robot chef in a busy kitchen with:

A Head Chef who breaks big orders into small steps.
A Critic who double-checks every ingredient.
A Delegate who sends tasks to other robots.

This is called Agentic Scaffolding. It's how AI actually works in production.

What the Study Found

The researchers ran a massive experiment (over 62,000 tests) to see if the "Kitchen Crew" makes the robot chef safer or more dangerous. Here is what they discovered, using simple analogies:

1. The "Map-Reduce" Trap
They found that one specific way of organizing the kitchen crew (called "Map-Reduce," where tasks are split up and then reassembled) actually made the AI look less safe.

The Analogy: Imagine asking a group of people to write a story by passing it around. If you ask them to write it in tiny fragments and stitch them together later, the final story might get messy or lose its moral compass. The study found this method degraded safety scores significantly.

2. The Real Villain: The Question Format
However, the biggest surprise wasn't the kitchen crew; it was how they asked the questions.

The Analogy: It's like asking a student, "Is 2+2 equal to 4?" (Multiple Choice) vs. asking them to "Explain why 2+2 is 4" (Open-Ended).
The Finding: Simply switching from a multiple-choice test to an open-ended conversation changed the safety scores by 5% to 20%. This shift was bigger than the effect of the entire kitchen crew!
The Lesson: If you test an AI with a multiple-choice quiz, you aren't measuring its real-world safety; you're just measuring how good it is at taking a quiz.

3. No "One Size Fits All"
The study found that different AI models react to the kitchen crew in completely opposite ways.

The Analogy: Imagine two drivers. Driver A gets nervous and drives worse when you add a co-pilot. Driver B gets confident and drives better with a co-pilot.
The Finding: You cannot say "Scaffolding is safe" or "Scaffolding is dangerous." It depends entirely on which AI model you are using. One model might become 18% safer with a specific setup, while another becomes 16% more dangerous.

4. The Ranking Problem
Finally, the researchers tried to create a single "Safety Score" to rank the models (like a leaderboard).

The Analogy: Imagine trying to rank athletes by a single "Athleticism Score" that combines swimming, running, and chess. You might find that the best swimmer is the worst chess player, and the rankings flip completely depending on which sport you test first.
The Finding: The data was so messy (G = 0.000) that a model could be #1 on one test and #100 on another. There is no single "Safety Index" that works.

The Bottom Line

You can't just test an AI in a lab with a multiple-choice quiz and assume it's safe for the real world.

The Test Matters: How you ask the question changes the answer more than the AI's actual behavior.
The Setup Matters: The way you wrap the AI in tools changes its safety, but differently for every model.
The Solution: We need to test every specific AI model in its specific real-world setup. There are no shortcuts, and no universal safety rankings.

The researchers have released all their code and data (called ScaffoldSafety) so others can stop guessing and start testing the AI exactly how it will be used in the real world.

Based on the abstract provided for the paper "Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety" (arXiv:2603.10044v1), here is a detailed technical summary covering the problem, methodology, contributions, results, and significance.

1. Problem Statement

The paper addresses a critical disconnect between how Large Language Models (LLMs) are evaluated for safety and how they are actually deployed.

Evaluation vs. Reality: Current safety benchmarks typically evaluate models in isolation using multiple-choice formats. However, production environments rarely use raw models; instead, they wrap them in agentic scaffolds (complex systems involving reasoning traces, critic agents, and delegation pipelines) that restructure inputs and outputs.
The Knowledge Gap: There is a lack of empirical understanding regarding how these deployment architectures (scaffolds) alter measured safety performance. It is unclear whether scaffolds universally improve safety, degrade it, or interact with specific models in unpredictable ways.

2. Methodology

The authors conducted one of the largest controlled studies on this topic, employing rigorous experimental design to isolate variables:

Scale: The study involved 62,808 evaluations across six frontier models and four distinct deployment configurations.
Experimental Rigor: The study utilized pre-registration, assessor blinding, equivalence testing (TOST), and specification curve analysis to ensure robustness and minimize bias.
Variables Tested:
- Scaffold Architectures: Comparing different agentic setups, specifically including "map-reduce" scaffolding.
- Evaluation Formats: Contrasting standard multiple-choice formats against open-ended formats using identical safety items.
- Interactions: Analyzing Model $\times$ Scaffold interactions to see if safety effects are consistent across different models.

3. Key Contributions

ScaffoldSafety Dataset & Toolkit: The authors released a comprehensive resource including all code, data, and prompts used in the study, named ScaffoldSafety, to facilitate reproducibility.
Methodological Framework: The paper establishes a framework for evaluating safety that moves beyond isolated model testing to include the "system" level (model + scaffold).
Generalizability Analysis: The study introduces a statistical analysis (Generalizability Theory) to assess the reliability of safety rankings across different benchmarks.

4. Key Results

The study yielded several counter-intuitive and high-impact findings:

Scaffold Effects are Nuanced:
- Map-Reduce Degradation: "Map-reduce" scaffolding was found to degrade measured safety, with a Number Needed to Harm (NNH) of 14.
- Preservation by Other Architectures: Conversely, two out of the three tested scaffold architectures preserved safety within practically meaningful margins, showing no significant degradation.
The Dominance of Evaluation Format:
- The most significant finding is that the evaluation format (multiple-choice vs. open-ended) has a far greater impact on safety scores than the scaffold architecture itself.
- Switching formats on identical items shifted safety scores by 5–20 percentage points, a variance larger than any observed scaffold effect.
- When controlling for format, scaffold comparisons showed practical equivalence (within a pre-registered $\pm 2$ percentage point margin).
No Universal Safety Claims:
- Model $\times$ Scaffold Interactions: The effects of scaffolds are highly model-specific. Interactions spanned 35 percentage points in opposing directions.
- Example: On a sycophancy benchmark, one model degraded by -16.8 pp under map-reduce, while another improved by +18.8 pp. This rules out the possibility of making universal claims about a scaffold's safety benefits or harms.
Unreliability of Composite Rankings:
- A generalizability analysis yielded a reliability coefficient of $G = 0.000$ .
- This indicates that model safety rankings reverse completely across different benchmarks and configurations. Consequently, no composite safety index can achieve non-zero reliability.

5. Significance and Implications

Paradigm Shift in Evaluation: The paper argues that the field must move away from static, isolated model benchmarks. The "safety" of a system is not an intrinsic property of the model but a dynamic property of the model-configuration pair.
Standard for Testing: Due to the zero reliability of composite rankings, the authors propose that per-model, per-configuration testing is the necessary minimum standard for safety assurance.
Format Sensitivity: The massive variance caused by question formats suggests that current safety metrics may be measuring sensitivity to prompt engineering or format rather than true safety alignment.
Deployment Caution: Developers cannot assume that a safety-verified model will remain safe when wrapped in a specific agentic scaffold; the interaction must be tested empirically for every specific deployment scenario.

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

The Old Way: The "Taste Test" in a Vacuum

The New Reality: The "Kitchen Crew" (Scaffolding)

What the Study Found

The Bottom Line

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning