AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference

Imagine you are trying to figure out the true personality of a group of very smart, well-behaved robots. You ask them, "Is it good to be kind?" and they all say, "Yes, absolutely!" You ask, "Is it good to be honest?" and they all say, "Yes, of course!"

You might think, "Great! They all have the same good values." But here's the problem: You haven't actually learned anything about their unique personalities. You've just confirmed they all know the "polite robot handbook."

This is the problem the paper AdAEM is trying to solve.

The Problem: The "Polite Robot" Trap

Current ways of testing Large Language Models (LLMs) are like asking a group of people, "Do you like pizza?"

The Result: Everyone says "Yes."
The Reality: You don't know who loves deep-dish, who hates cheese, who is allergic to gluten, or who only eats pizza on Tuesdays. You just know they all agree on the basic concept.

The paper calls this the "Informativeness Challenge." Old tests use boring, generic questions that everyone answers the same way because the models are trained to be safe and helpful. They hide the models' true, messy, and sometimes conflicting values.

The Solution: AdAEM (The "Devil's Advocate" Generator)

The authors created a new system called AdAEM. Think of it not as a test, but as a dynamic debate coach that never stops arguing.

Instead of using a static list of questions (like a printed quiz), AdAEM is a self-improving engine that does three things:

1. It Finds the "Gray Areas"

Imagine you are trying to find out if two friends have different opinions on politics.

Old Method: Ask, "Is democracy good?" (Both say yes. Boring.)
AdAEM Method: It looks at the news, sees a specific, messy event happening right now (like a new law about AI in a specific country), and asks, "Should the government ban AI art to protect human artists, even if it slows down innovation?"

AdAEM automatically hunts for these controversial, timely, and culturally specific topics where people (and robots) actually disagree.

2. It Plays "Tag" with Different Models

AdAEM doesn't just ask one model; it asks a whole team of models from different countries and with different training data (e.g., one from the US, one from China, one from Europe).

The Analogy: Imagine a game of "Tag." AdAEM throws a question at Model A. If Model A answers, AdAEM immediately throws a slightly different version of that question at Model B to see if they react differently.
The Goal: It keeps tweaking the question until it finds the exact phrasing that makes Model A say "Yes!" and Model B say "No!" or "Maybe, but..."
The Result: It creates a "Value Map" that shows exactly where the models diverge.

3. It Never Gets Old (Self-Extensible)

Most tests become useless the moment a new robot is built because the new robot might have memorized the old test questions.

AdAEM is like a living garden. As new models are released, AdAEM uses them to grow new questions. If a new model knows about an event that happened yesterday, AdAEM uses that event to create a fresh question that the old models haven't seen yet. This prevents the models from "cheating" by memorizing answers.

How It Works (The "Secret Sauce")

The paper uses some fancy math (Information Theory), but you can think of it like tuning a radio.

If you tune the radio to a station where everyone is singing the same song, the signal is clear but uninteresting.
AdAEM keeps turning the dial until it finds the "static" or the "noise"—the spots where the signals from different models clash. That "static" is where the real differences in their values live.

Why Does This Matter?

If you are a company building an AI, or a government regulating it, you need to know:

Does this AI prioritize safety over freedom?
Does this AI think tradition is more important than innovation?
Does this AI have a hidden bias toward Western culture over Eastern culture?

Old tests say, "They are all safe."
AdAEM says: "Model A is a strict traditionalist who loves safety. Model B is a chaotic innovator who loves freedom. Model C is a cultural chameleon that changes its mind based on who it's talking to."

The Bottom Line

AdAEM is a tool that stops asking robots, "Are you good?" and starts asking, "What kind of good are you, and where do you draw the line?"

It turns the evaluation of AI from a boring multiple-choice quiz into a lively, ever-changing debate, revealing the true, complex, and sometimes conflicting personalities hidden inside our digital assistants.

Here is a detailed technical summary of the paper "AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference".

1. Problem Statement

The paper addresses the "informativeness challenge" in evaluating Large Language Models' (LLMs) underlying value orientations.

Current Limitations: Existing value benchmarks (e.g., static datasets like ValueBench or SVS) rely on generic, outdated, or contaminated test questions.
The Consequence: These benchmarks often fail to distinguish between different LLMs because models tend to converge on "safe," universally aligned answers (e.g., Harmlessness, Honesty, Helpfulness). This results in saturated, uninformative evaluations that cannot reveal true cultural biases, misalignment, or specific value differences between models.
Goal: To create a dynamic, self-extensible evaluation framework that automatically generates questions capable of exposing the distinct value boundaries and inclinations of diverse LLMs.

2. Methodology: AdAEM Framework

AdAEM (Adaptively and Automated Extensible Measurement) is a novel algorithm that iteratively generates and refines test questions to maximize the information gain regarding value differences. It operates without manual annotation or fine-tuning.

A. Core Optimization Objective

The method formulates question generation as an optimization problem to maximize an Information-Theoretic Objective. The goal is to find a question $x$ that maximizes two terms:

Distinguishability: Maximizing the separation between the value distributions of different LLMs ( $p_{\theta_i}(v|x)$ ). This is measured using the Generalized Jensen-Shannon Divergence (GJS).
Disentanglement: Ensuring the value reflected in the response comes from the model's internal orientation, not the question's inherent bias. This is measured by minimizing the divergence between the question's implied values and the model's response values.

The objective function is:
$x^* = \arg\max_x \left( \sum_{i=1}^K \alpha_i KL[p_{\theta_i}(v|x) || p_M(v|x)] + \beta \sum_{i=1}^K JS[\hat{p}(v|x) || p_{\theta_i}(v|x)] \right)$
Where $p_M$ is the aggregated distribution of all models, and $\alpha, \beta$ are hyperparameters.

B. Iterative Process (EM-like Algorithm)

AdAEM uses an Expectation-Maximization (EM) style iteration to solve the optimization:

Response Generation Step (E-Step): Given a question $x$ $x$ , the system samples responses $y$ $y$ from a set of diverse LLMs. It selects responses that maximize a score based on:
- Value Conformity: The response aligns with the model's latent value $v$ .
- Value Difference: The response differs significantly from other models' values.
- Semantic Coherence/Difference: The response is coherent with the topic but semantically distinct from others.
Question Refinement Step (M-Step): Given the selected high-scoring responses, the system refines the question $x$ to further maximize the distinguishability and disentanglement scores. This involves using Chain-of-Thought (CoT) prompting to make questions more specific, controversial, and culturally relevant.

C. Exploration Strategy (Multi-Armed Bandit)

To avoid getting stuck in local optima and to cover diverse social issues, AdAEM employs a Multi-Armed Bandit (MAB) approach (specifically Upper Confidence Bound - UCB):

Arms: Initial generic topics (e.g., "overworking," "renewable energy").
Selection: The algorithm selects the most promising topic to expand based on its current informativeness score and exploration count.
Expansion: It generates new, specific questions (e.g., shifting from "firefighting" to "firefighting drones in California wildfires") using LLMs with different knowledge cutoffs and cultural backgrounds.
Self-Extensibility: By using models with different training data (time and culture), AdAEM naturally avoids data contamination and captures emerging social issues.

D. Evaluation Metric

To assess the resulting value orientations, AdAEM uses a TrueSkill-based relative ranking system:

Instead of assigning absolute scores, it aggregates binary value labels (from a value classifier) across thousands of questions.
It calculates a "win rate" for each model against others in specific value dimensions, modeling uncertainty and robustness.

3. Key Contributions

Novel Framework: Introduction of AdAEM, the first self-extensible, dynamic value evaluation method that automatically generates value-evoking questions.
Solving Informativeness: Demonstrates that by probing value boundaries via in-context optimization, the method successfully elicits distinguishable value differences that static benchmarks miss.
Dataset Creation: Construction of AdAEM Bench, a dataset of 12,310 high-quality, diverse, and controversial questions grounded in Schwartz's Theory of Basic Values (and validated with Moral Foundations Theory).
Empirical Validation: Extensive experiments showing AdAEM questions have higher novelty, better semantic diversity, and stronger validity (via value priming experiments) compared to existing benchmarks.

4. Key Results

Question Quality: AdAEM-generated questions show significantly lower semantic similarity to existing datasets (Sim $\approx$ 0.44 vs. 0.60 for ValueBench), indicating high novelty and reduced contamination risk.
Distinguishability:
- Static benchmarks (SVS, ValueDCG) often produce collapsed results where all models appear to have identical values (e.g., all high on "Security").
- AdAEM reveals distinct profiles: e.g., GPT-4 and GLM-4 show different priorities on "Hedonism" and "Universalism," reflecting their cultural origins (US vs. China).
- Reasoning vs. Chat Models: AdAEM successfully identified that reasoning-focused models (e.g., O3-Mini) prioritize "Self-Direction" and "Stimulation" more than chat-based models.
Validity & Reliability:
- Construct Validity: When models were explicitly primed to adopt a specific value (e.g., "Security"), AdAEM scores increased significantly (+31%) while conflicting values decreased (-58%).
- Reliability: High internal consistency (Cronbach's $\alpha$ = 0.90) across random data folds.
Extensibility: The framework successfully generated region-specific questions (e.g., US wildfires, French secularism) and temporal updates (post-2023 events) by leveraging models with different knowledge cutoffs.

5. Significance

Dynamic Evaluation: Moves the field from static, one-time benchmarks to a continuous, co-evolving evaluation system that keeps pace with LLM development.
Cultural & Bias Insight: Provides a robust tool for detecting cultural misalignment and hidden biases that are masked by "safety" training in generic benchmarks.
Research Foundation: Lays the groundwork for interdisciplinary research into AI alignment, offering a reproducible, automated method to generate the "controversial" data needed to stress-test LLM values.
Open Source: The code and the generated AdAEM Bench are released to facilitate further research and reproducibility.

In summary, AdAEM transforms value evaluation from a static measurement of "safety" into a dynamic, information-maximizing process that uncovers the nuanced, often conflicting, and culturally specific value landscapes of modern LLMs.