Evaluating LLM Alignment With Human Trust Models

Imagine you have a very smart, but slightly mysterious, robot brain (a Large Language Model, or LLM). You know it can write stories, answer questions, and chat like a human. But here's the big question: Does it actually understand what "trust" means, or is it just guessing the right words?

This paper is like a "brain scan" for that robot. The researchers wanted to peek inside the robot's mind to see how it organizes the concept of trust, and whether that organization matches how humans think about it.

Here is the story of their discovery, broken down with some simple analogies.

1. The Problem: The "Black Box" Mystery

Usually, when we use AI, we treat it like a Black Box. We put a question in one side, and an answer comes out the other. We don't know what happens inside.

The Old Way: Researchers would ask the AI, "Do you trust this person?" and see what it said. This is like judging a chef only by the taste of the food, without seeing the kitchen.
The New Way (This Paper): The researchers decided to open the box. They looked at the internal wiring (the mathematical "activation space") of the AI to see how it actually feels about trust.

2. The Method: The "Contrastive Prompting" Game

To see how the AI thinks about trust, the researchers played a game of Contrastive Prompting.

Think of the AI's mind as a giant library of feelings. If you ask it to write a story about "Happiness," it pulls out a specific book. If you ask for "Sadness," it pulls a different one.

The Trick: They asked the AI to write two stories for every concept: one where the concept is present (e.g., "Katherine helps Alice") and one where it is absent or opposite (e.g., "Katherine ignores Alice").
The Result: By subtracting the "ignoring" story from the "helping" story, they isolated the pure "mathematical fingerprint" of Helpfulness or Trust inside the AI's brain. It's like taking a photo of a room with the lights on, then with the lights off, and subtracting the two to see exactly where the lightbulb is.

3. The Map: Five Different "Trust Theories"

Humans have written many books on how trust works. The researchers picked five famous theories (like the Marsh Model, Mayer Model, and Castelfranchi Model) to see which one the AI's brain looked most like.

Imagine these theories are different maps of a city:

Map A (Marsh): Says trust is just a probability score based on past behavior.
Map B (Mayer): Says trust is about ability, kindness, and honesty.
Map C (Castelfranchi): Says trust is a complex mental state involving beliefs, goals, and predictions.

The researchers took the AI's "Trust Fingerprint" and tried to overlay it onto these five maps. They asked: "Does the AI's internal map look like Map A, Map B, or Map C?"

4. The Discovery: The "Castelfranchi" Match

Here is the exciting part. They measured how close the AI's internal map was to the human maps using a "similarity score" (like a dating app match percentage).

The Winner: The AI's brain matched the Castelfranchi Model the best!
- What does this mean? The Castelfranchi model is very "psychological." It says trust isn't just about "did they do the job?" It's about "do I believe they want to do the job, and do I believe they can do it?"
- The AI seems to understand that trust is a mental attitude, not just a scorecard. It connects trust with concepts like "willingness," "commitment," and "predictability" in a way that feels very human.
The Runner-Up: The Marsh Model came in second. This is the more "mathy" model based on past performance. The AI gets this too, but it prefers the deeper, psychological understanding.

5. The Surprise: Where the AI Got It Wrong

The researchers also found some funny mismatches.

The "Risk" Confusion: In the Mayer Model, "Risk" is a positive thing. The theory says, "You can only trust someone if you are willing to take a risk."
The AI's View: When the researchers checked the AI's brain, it thought Risk and Trust were opposites! To the AI, "Risk" felt more like "Danger" or "Fear," not "Vulnerability."
The Lesson: The AI hasn't fully learned the human nuance that sometimes you have to be brave (take a risk) to trust someone. It sees risk as a bad thing, period.

Why Does This Matter?

Think of the AI as a new employee joining your team.

Before: You didn't know if they understood your company culture.
Now: You know they have a very sophisticated internal map of how relationships work. They understand that trust is about beliefs and goals, not just checking boxes.

The Takeaway:
This paper proves that AI isn't just a parrot repeating words. It has built a structured, internal understanding of human social concepts like trust. By knowing how the AI thinks about trust, we can:

Build better AI: We can tweak the AI's "brain" to make it act more trustworthy.
Understand ourselves: It helps us see where human logic and machine logic differ (like the "Risk" example).

In short, the researchers opened the robot's head, found a very human-like map of trust inside, and realized that while the robot is smart, it still needs a little help understanding that sometimes, taking a risk is the first step to trusting someone.

Trust Model	Average Cosine Similarity	Concepts Above Threshold (0.6)	Alignment Rank
Castelfranchi	0.7303	8	1st (Highest)
Marsh	0.6973	7	2nd
McAllister	0.6704	4	3rd
McKnight	0.6640	5	4th
Mayer	0.4530	5	5th (Lowest)

Evaluating LLM Alignment With Human Trust Models

1. The Problem: The "Black Box" Mystery

2. The Method: The "Contrastive Prompting" Game

3. The Map: Five Different "Trust Theories"

4. The Discovery: The "Castelfranchi" Match

5. The Surprise: Where the AI Got It Wrong

Why Does This Matter?

1. Problem Statement

2. Methodology

A. Contrastive Prompting & Vector Extraction

B. Establishing a Similarity Threshold

C. Alignment with Trust Models

3. Key Results

A. Internal Structure of the LLM

B. Model Alignment Rankings

C. Specific Findings & Anomalies

4. Key Contributions

5. Significance and Implications

Limitations

Evaluating LLM Alignment With Human Trust Models

1. The Problem: The "Black Box" Mystery

2. The Method: The "Contrastive Prompting" Game

3. The Map: Five Different "Trust Theories"

4. The Discovery: The "Castelfranchi" Match

5. The Surprise: Where the AI Got It Wrong

Why Does This Matter?

1. Problem Statement

2. Methodology

A. Contrastive Prompting & Vector Extraction

B. Establishing a Similarity Threshold

C. Alignment with Trust Models

3. Key Results

A. Internal Structure of the LLM

B. Model Alignment Rankings

C. Specific Findings & Anomalies

4. Key Contributions

5. Significance and Implications

Limitations

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning