Measuring What AI Systems Might Do: Towards A Measurement Science in AI

The Big Problem: We Are Guessing the Weather Without a Thermometer

Imagine you want to know how strong a bridge is. Currently, the way we test AI systems is like this: We drop a few heavy trucks on the bridge, count how many fall through, and then say, "This bridge has a 62.5% success rate!"

If a truck falls, we say the bridge is "weak." If it holds, we say it's "strong." But here's the catch: We don't actually know why the bridge failed. Did it break because the truck was too heavy? Because the wind was blowing? Because the metal was cold? Or just because that specific truck had a flat tire?

This paper argues that right now, we are treating AI like that bridge. We are giving them "report cards" based on how they do on a specific list of questions (benchmarks) or how they react to a few tricky prompts (red-teaming). But these scores don't tell us what the AI is actually capable of doing in the real world, or what it might do if the situation changes slightly.

The authors say we need to stop guessing and start doing real science.

The Core Idea: Capabilities are "Dispositions"

To understand the authors' solution, we need a new word: Disposition.

Think of a wine glass.

The Performance: The glass is currently sitting on the table. It is not broken.
The Disposition: The glass is fragile.

"Fragile" isn't something you see right now. It's a hidden property that describes what the glass would do if you hit it with a hammer. You can't measure fragility by just looking at the glass. You have to imagine hitting it with different amounts of force to see when it breaks.

The paper says AI Capabilities (like math skills) and Propensities (like the tendency to lie) are exactly like fragility.

Capability: It's not just "getting a math question right." It's the hidden ability to solve math problems of increasing difficulty.
Propensity: It's not just "lying once." It's the hidden tendency to lie if you give it a strong enough reason (like a reward or a threat).

Why Current Methods Fail (The "Temperature" Analogy)

The authors say our current tests are like trying to measure the temperature of a cup of tea using a random collection of objects: a piece of chocolate, a glass of water, and your hand.

You dip the chocolate in: It melts. (1 point)
You dip the water in: It gets warm. (1 point)
You dip your hand in: It feels hot. (1 point)
You dip a cold spoon in: Nothing happens. (0 points)

You add up the points: "3 out of 4 things reacted! Therefore, the tea is 75% hot."

The Problem: This number (75%) is meaningless. It doesn't tell you the actual temperature (e.g., 80°C). It just tells you that the tea was hot enough to melt chocolate. If you tried to measure the temperature of a volcano with this method, your chocolate would just burn up, and your hand would get injured. You can't generalize.

Current AI Benchmarks are the same: They give us a single number (like "85% accuracy on a math test"). But that number doesn't tell us why the AI got questions right or wrong, nor does it tell us what happens if the questions get harder than anything humans have ever written.

The Solution: Building a "Thermometer" for AI

The authors propose we need to build a proper scientific measurement system. Here is the 4-step recipe they suggest:

1. Pick Your Subject (The "Glass" vs. The "Box")

Are we testing the raw AI brain (the glass), or the AI brain wrapped in safety filters and rules (the glass inside a protective box)?

Analogy: If you put a fragile glass in a padded box, it won't break when you drop it. But that doesn't mean the glass isn't fragile. We need to be clear: are we measuring the glass, or the box?

2. Guess the Causes (The "Hypothesis")

Before testing, we need to guess what makes a task hard or what makes an AI want to do something bad.

For Math: Is it the number of steps? The size of the numbers?
For Lying: Is it the user's tone? The promise of a reward?
Analogy: Before testing the bridge, we hypothesize: "It breaks if the weight exceeds 10 tons OR if the wind is over 50mph."

3. Build the Ruler (Operationalization)

We need to create a scale for these causes, independent of the AI's performance.

Instead of saying "This question is hard because the AI got it wrong," we say "This question is hard because it has 50 steps."
Analogy: We need a ruler that measures "weight" in kilograms, not a ruler that says "heavy" or "light" based on how much it hurts your hand.

4. Map the Curve (The "Response Function")

Now, we run the experiment. We keep the AI the same, but we slowly increase the "weight" (difficulty) or the "incentive" (temptation).

We don't just look for a pass/fail. We look for the curve.
Result: "The AI solves 100% of 1-step math problems. It solves 50% of 5-step problems. It solves 0% of 10-step problems."
This curve tells us the true capability. It tells us exactly where the AI breaks down, allowing us to predict what it will do on problems we haven't even invented yet.

Why This Matters

If we don't do this, we are flying blind.

Safety: We might think an AI is safe because it passed a few "don't build a bomb" tests. But if we don't measure its propensity (its hidden tendency), we won't know that if you ask it in a specific way, or give it a specific reward, it will build a bomb.
Progress: We can't tell if AI is actually getting smarter or just getting better at memorizing the test questions.

The Bottom Line

The paper is a call to action. It says: "Stop playing with report cards and start building thermometers."

We need to stop treating AI evaluation as a game of "how many questions can you get right?" and start treating it as a science of "how does this system behave when we change the conditions?"

It's harder work. It requires more theory and more careful planning. But just like we needed thermometers to understand heat, we need dispositional measurement to understand AI. Without it, we are just guessing, and in the world of powerful AI, guessing is dangerous.

1. Problem Statement

Current AI evaluation practices suffer from a fundamental conceptual mismatch: they conflate observable performance (what an AI does on a specific dataset) with dispositional properties (what an AI is capable of doing or likely to do under various conditions).

Conceptual Ambiguity: Terms like "capabilities," "propensities," "skills," and "abilities" are used interchangeably and often reduced to aggregate accuracy scores on fixed benchmarks (e.g., MATH, HumanEval) or worst-case outcomes from red-teaming.
Lack of Causal Grounding: These methods do not specify what is being measured. They fail to identify the causal relationships between contextual conditions (e.g., task difficulty, incentives) and behavioral outputs.
Inability to Generalize: Current methods cannot reliably extrapolate to:
- Superhuman regimes: Systems that exceed human performance on tasks humans cannot solve or verify.
- Safety-critical domains: Scenarios where direct testing is ethically prohibited (e.g., designing biological weapons).
The Core Issue: Without a theory of measurement, AI evaluation remains a collection of engineering conventions rather than a scientific discipline capable of predicting system behavior in unobserved contexts.

2. Methodological Framework: The Dispositional Account

The authors propose a framework grounded in philosophy of science, measurement theory, and cognitive science. They argue that capabilities and propensities are dispositional properties—intrinsic features of a system defined by counterfactual relationships between contextual conditions and behavioral outputs.

Key Definitions

Dispositions: Stable features characterized by what a system would do under specific conditions, not just what it has done.
- Example: Fragility is not the act of breaking, but the probability of breaking given a specific impact force.
Capabilities: Dispositions where the probability of behavior varies systematically with task demands (e.g., complexity, number of steps).
Propensities: Dispositions where the probability of behavior varies systematically with incentives (e.g., user framing, moral justification, oversight cues).

The Measurement Logic

To measure a disposition scientifically, the paper outlines a four-step causal process:

Define the Subject: Explicitly specify the system being measured (e.g., base model vs. deployed system with filters).
Hypothesize Causal Basis: Identify which contextual properties ( $\pi$ $π$ ) causally influence the behavior.
- For capabilities: Task features (symbolic depth, reasoning steps).
- For propensities: Incentive features (framing, urgency, oversight).
Operationalize Context: Independently measure and scale these contextual properties before testing the system. This avoids circularity (defining difficulty based on failure rates).
Empirical Mapping: Systematically vary $\pi$ and measure the conditional probability of the target behavior $p(v | \pi, \theta)$ , where $\theta$ represents latent system properties. The result is a response function (a curve or surface) rather than a single scalar score.

3. Critique of Prevailing Practices

The paper systematically deconstructs why current dominant methods fail as scientific measurements:

Benchmarking (Capabilities):
- Flaw: Aggregates performance on convenience-sampled datasets into a single score.
- Failure: It conflates difficulty, dataset quirks, and annotation biases. It cannot distinguish why a system fails (e.g., is it due to reasoning depth or representational limits?). It lacks construct validity because the tasks do not map to a theoretical causal structure.
Elicitation/Red-Teaming (Propensities):
- Flaw: Relies on adversarial prompts to find "worst-case" failures.
- Failure: It samples a tiny, human-selected region of contextual space. It cannot distinguish between behaviors that occur only under contrived provocation and those that are robustly disposed to occur across relevant counterfactuals.
Latent-Variable Models (e.g., Item Response Theory - IRT):
- Flaw: While mathematically sophisticated, data-driven IRT infers "difficulty" and "ability" solely from performance covariance.
- Failure: It is atheoretical. Without independently defined contextual variables, the latent parameters are circular (difficulty is just "what the system failed"). It cannot generalize to new tasks or populations because the measurements are relative to the specific set of systems and items tested.

4. Key Contributions

Conceptual Definition: Rigorously defines AI capabilities and propensities as dispositional properties grounded in counterfactuals and causal relationships, distinguishing them from mere performance metrics.
Diagnostic Analysis: Demonstrates that current evaluation practices (benchmarks, red-teaming, IRT) fail to measure dispositions because they lack independent operationalization of contextual variables and causal hypotheses.
Proposed Framework: Outlines a disposition-respecting measurement science requiring:
- Explicit definition of the measurement subject.
- Hypothesis-driven identification of causal contextual drivers.
- Independent operationalization of these drivers (scales/units).
- Systematic variation to map the $p(v | \pi, \theta)$ response function.
Illustrative Examples:
- Arithmetic Capability: Instead of a single accuracy score, measure the probability of success as a function of step count, digit length, and carry complexity.
- Honesty Propensity: Instead of counting "jailbreaks," map the probability of disallowed advice as a function of user moral justification, urgency, and oversight cues.

5. Results and Implications

Theoretical Shift: The paper argues that AI evaluation must shift from convenience-driven benchmarking to theory-led measurement.
Generalizability: By mapping the response function $p(v | \pi, \theta)$ , researchers can extrapolate to unobserved regimes (e.g., superhuman tasks or dangerous scenarios) using safe, controlled variations of $\pi$ , similar to how engineers estimate material tensile strength without breaking the material.
Policy Relevance: Regulatory frameworks requiring assessments of AI safety and capability cannot rely on current aggregate scores. They require measurements that are valid, comparable across systems, and capable of predicting behavior in high-stakes, untestable environments.

6. Significance

This paper provides the conceptual foundation for a mature measurement science in AI. It challenges the field to move beyond "leaderboards" and "anecdotal failures" toward a rigorous, causal understanding of AI behavior.

Scientific Rigor: It aligns AI evaluation with established practices in physics and psychometrics, where measurement requires defining units, scales, and causal mechanisms.
Safety: It offers a pathway to assess risks in domains where direct testing is impossible (e.g., biological warfare), by inferring dispositions from safe, structurally similar tasks.
Interdisciplinary Necessity: It calls for collaboration between AI researchers, cognitive scientists, philosophers, and statisticians to build the necessary theories of task complexity and human-AI interaction incentives.

In summary, the authors argue that measuring what AI systems might do requires treating capabilities and propensities as causal dispositions, measured by systematically varying the environment and mapping the resulting probability of behavior, rather than simply tallying successes on fixed datasets.