Evaluating LLM Alignment With Human Trust Models

This paper presents a white-box analysis of the EleutherAI/gpt-j-6B model, demonstrating through contrastive prompting that its internal representation of trust aligns most closely with the Castelfranchi socio-cognitive model, thereby validating the feasibility of using LLM activation spaces to analyze socio-cognitive constructs and inform human-AI collaboration.

Anushka Debnath, Stephen Cranefield, Bastin Tony Roy Savarimuthu, Emiliano Lorini

Published Mon, 09 Ma
📖 5 min read🧠 Deep dive

Imagine you have a very smart, but slightly mysterious, robot brain (a Large Language Model, or LLM). You know it can write stories, answer questions, and chat like a human. But here's the big question: Does it actually understand what "trust" means, or is it just guessing the right words?

This paper is like a "brain scan" for that robot. The researchers wanted to peek inside the robot's mind to see how it organizes the concept of trust, and whether that organization matches how humans think about it.

Here is the story of their discovery, broken down with some simple analogies.

1. The Problem: The "Black Box" Mystery

Usually, when we use AI, we treat it like a Black Box. We put a question in one side, and an answer comes out the other. We don't know what happens inside.

  • The Old Way: Researchers would ask the AI, "Do you trust this person?" and see what it said. This is like judging a chef only by the taste of the food, without seeing the kitchen.
  • The New Way (This Paper): The researchers decided to open the box. They looked at the internal wiring (the mathematical "activation space") of the AI to see how it actually feels about trust.

2. The Method: The "Contrastive Prompting" Game

To see how the AI thinks about trust, the researchers played a game of Contrastive Prompting.

Think of the AI's mind as a giant library of feelings. If you ask it to write a story about "Happiness," it pulls out a specific book. If you ask for "Sadness," it pulls a different one.

  • The Trick: They asked the AI to write two stories for every concept: one where the concept is present (e.g., "Katherine helps Alice") and one where it is absent or opposite (e.g., "Katherine ignores Alice").
  • The Result: By subtracting the "ignoring" story from the "helping" story, they isolated the pure "mathematical fingerprint" of Helpfulness or Trust inside the AI's brain. It's like taking a photo of a room with the lights on, then with the lights off, and subtracting the two to see exactly where the lightbulb is.

3. The Map: Five Different "Trust Theories"

Humans have written many books on how trust works. The researchers picked five famous theories (like the Marsh Model, Mayer Model, and Castelfranchi Model) to see which one the AI's brain looked most like.

Imagine these theories are different maps of a city:

  • Map A (Marsh): Says trust is just a probability score based on past behavior.
  • Map B (Mayer): Says trust is about ability, kindness, and honesty.
  • Map C (Castelfranchi): Says trust is a complex mental state involving beliefs, goals, and predictions.

The researchers took the AI's "Trust Fingerprint" and tried to overlay it onto these five maps. They asked: "Does the AI's internal map look like Map A, Map B, or Map C?"

4. The Discovery: The "Castelfranchi" Match

Here is the exciting part. They measured how close the AI's internal map was to the human maps using a "similarity score" (like a dating app match percentage).

  • The Winner: The AI's brain matched the Castelfranchi Model the best!

    • What does this mean? The Castelfranchi model is very "psychological." It says trust isn't just about "did they do the job?" It's about "do I believe they want to do the job, and do I believe they can do it?"
    • The AI seems to understand that trust is a mental attitude, not just a scorecard. It connects trust with concepts like "willingness," "commitment," and "predictability" in a way that feels very human.
  • The Runner-Up: The Marsh Model came in second. This is the more "mathy" model based on past performance. The AI gets this too, but it prefers the deeper, psychological understanding.

5. The Surprise: Where the AI Got It Wrong

The researchers also found some funny mismatches.

  • The "Risk" Confusion: In the Mayer Model, "Risk" is a positive thing. The theory says, "You can only trust someone if you are willing to take a risk."
  • The AI's View: When the researchers checked the AI's brain, it thought Risk and Trust were opposites! To the AI, "Risk" felt more like "Danger" or "Fear," not "Vulnerability."
  • The Lesson: The AI hasn't fully learned the human nuance that sometimes you have to be brave (take a risk) to trust someone. It sees risk as a bad thing, period.

Why Does This Matter?

Think of the AI as a new employee joining your team.

  • Before: You didn't know if they understood your company culture.
  • Now: You know they have a very sophisticated internal map of how relationships work. They understand that trust is about beliefs and goals, not just checking boxes.

The Takeaway:
This paper proves that AI isn't just a parrot repeating words. It has built a structured, internal understanding of human social concepts like trust. By knowing how the AI thinks about trust, we can:

  1. Build better AI: We can tweak the AI's "brain" to make it act more trustworthy.
  2. Understand ourselves: It helps us see where human logic and machine logic differ (like the "Risk" example).

In short, the researchers opened the robot's head, found a very human-like map of trust inside, and realized that while the robot is smart, it still needs a little help understanding that sometimes, taking a risk is the first step to trusting someone.