The Big Problem: We Are Guessing the Weather Without a Thermometer
Imagine you want to know how strong a bridge is. Currently, the way we test AI systems is like this: We drop a few heavy trucks on the bridge, count how many fall through, and then say, "This bridge has a 62.5% success rate!"
If a truck falls, we say the bridge is "weak." If it holds, we say it's "strong." But here's the catch: We don't actually know why the bridge failed. Did it break because the truck was too heavy? Because the wind was blowing? Because the metal was cold? Or just because that specific truck had a flat tire?
This paper argues that right now, we are treating AI like that bridge. We are giving them "report cards" based on how they do on a specific list of questions (benchmarks) or how they react to a few tricky prompts (red-teaming). But these scores don't tell us what the AI is actually capable of doing in the real world, or what it might do if the situation changes slightly.
The authors say we need to stop guessing and start doing real science.
The Core Idea: Capabilities are "Dispositions"
To understand the authors' solution, we need a new word: Disposition.
Think of a wine glass.
- The Performance: The glass is currently sitting on the table. It is not broken.
- The Disposition: The glass is fragile.
"Fragile" isn't something you see right now. It's a hidden property that describes what the glass would do if you hit it with a hammer. You can't measure fragility by just looking at the glass. You have to imagine hitting it with different amounts of force to see when it breaks.
The paper says AI Capabilities (like math skills) and Propensities (like the tendency to lie) are exactly like fragility.
- Capability: It's not just "getting a math question right." It's the hidden ability to solve math problems of increasing difficulty.
- Propensity: It's not just "lying once." It's the hidden tendency to lie if you give it a strong enough reason (like a reward or a threat).
Why Current Methods Fail (The "Temperature" Analogy)
The authors say our current tests are like trying to measure the temperature of a cup of tea using a random collection of objects: a piece of chocolate, a glass of water, and your hand.
- You dip the chocolate in: It melts. (1 point)
- You dip the water in: It gets warm. (1 point)
- You dip your hand in: It feels hot. (1 point)
- You dip a cold spoon in: Nothing happens. (0 points)
You add up the points: "3 out of 4 things reacted! Therefore, the tea is 75% hot."
The Problem: This number (75%) is meaningless. It doesn't tell you the actual temperature (e.g., 80°C). It just tells you that the tea was hot enough to melt chocolate. If you tried to measure the temperature of a volcano with this method, your chocolate would just burn up, and your hand would get injured. You can't generalize.
Current AI Benchmarks are the same: They give us a single number (like "85% accuracy on a math test"). But that number doesn't tell us why the AI got questions right or wrong, nor does it tell us what happens if the questions get harder than anything humans have ever written.
The Solution: Building a "Thermometer" for AI
The authors propose we need to build a proper scientific measurement system. Here is the 4-step recipe they suggest:
1. Pick Your Subject (The "Glass" vs. The "Box")
Are we testing the raw AI brain (the glass), or the AI brain wrapped in safety filters and rules (the glass inside a protective box)?
- Analogy: If you put a fragile glass in a padded box, it won't break when you drop it. But that doesn't mean the glass isn't fragile. We need to be clear: are we measuring the glass, or the box?
2. Guess the Causes (The "Hypothesis")
Before testing, we need to guess what makes a task hard or what makes an AI want to do something bad.
- For Math: Is it the number of steps? The size of the numbers?
- For Lying: Is it the user's tone? The promise of a reward?
- Analogy: Before testing the bridge, we hypothesize: "It breaks if the weight exceeds 10 tons OR if the wind is over 50mph."
3. Build the Ruler (Operationalization)
We need to create a scale for these causes, independent of the AI's performance.
- Instead of saying "This question is hard because the AI got it wrong," we say "This question is hard because it has 50 steps."
- Analogy: We need a ruler that measures "weight" in kilograms, not a ruler that says "heavy" or "light" based on how much it hurts your hand.
4. Map the Curve (The "Response Function")
Now, we run the experiment. We keep the AI the same, but we slowly increase the "weight" (difficulty) or the "incentive" (temptation).
- We don't just look for a pass/fail. We look for the curve.
- Result: "The AI solves 100% of 1-step math problems. It solves 50% of 5-step problems. It solves 0% of 10-step problems."
- This curve tells us the true capability. It tells us exactly where the AI breaks down, allowing us to predict what it will do on problems we haven't even invented yet.
Why This Matters
If we don't do this, we are flying blind.
- Safety: We might think an AI is safe because it passed a few "don't build a bomb" tests. But if we don't measure its propensity (its hidden tendency), we won't know that if you ask it in a specific way, or give it a specific reward, it will build a bomb.
- Progress: We can't tell if AI is actually getting smarter or just getting better at memorizing the test questions.
The Bottom Line
The paper is a call to action. It says: "Stop playing with report cards and start building thermometers."
We need to stop treating AI evaluation as a game of "how many questions can you get right?" and start treating it as a science of "how does this system behave when we change the conditions?"
It's harder work. It requires more theory and more careful planning. But just like we needed thermometers to understand heat, we need dispositional measurement to understand AI. Without it, we are just guessing, and in the world of powerful AI, guessing is dangerous.
Get papers like this in your inbox
Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.