Believing vs. Achieving -- The Disconnect between Efficacy Beliefs and Collaborative Outcomes

Here is an explanation of the paper "Believing vs. Achieving" using simple language and creative analogies.

The Big Picture: The "Trust Gap"

Imagine you are a captain steering a ship (the Human), and you have a high-tech autopilot system (the AI). Your job is to decide: Do I steer this ship myself, or do I let the autopilot take over for this specific stretch of water?

This paper investigates a strange disconnect in how captains make that decision. The researchers found that what you believe about your skills and the AI's skills before you start often clashes with how you actually judge the situation in the moment.

They call this the gap between "Believing" (your general confidence) and "Achieving" (the actual result of your teamwork).

The Experiment: The "Income Guessing Game"

The researchers set up a game where 240 people had to guess if a person earns more than $50,000 a year based on their age, job, and education.

The Human: The participant.
The AI: A computer program that was pretty good at guessing (about 77% accurate).
The Choice: For each person, the participant could either guess themselves or say, "You know what, AI, you handle this one."

Before the game started, they asked the players: "How good are you at this?" and "How good is the AI?" (These are the General Beliefs).
During the game, for every single guess, they asked again: "How good are you at THIS specific guess?" and "How good is the AI at THIS specific guess?" (These are the Instance Judgments).

They also gave some players "cheat sheets" (Contextual Information):

Data Sheet: Showing how the data is distributed (e.g., "Most people with a PhD earn over $50k").
AI Report: Showing where the AI makes mistakes (e.g., "The AI is bad at guessing for people under 25").
Both: A combination of the two.
Nothing: Just the game.

The Three Big Surprises

1. The "Self-Confidence Anchor" (You are stubborn about yourself)

The Metaphor: Imagine you are a chef who believes, "I am a great cook." Even if you burn a specific steak, you still think, "I'm a great cook; that steak was just weird."
The Finding: People's belief in their own abilities (Self-Efficacy) was like a heavy anchor. No matter what "cheat sheets" they were given, they stuck to their original belief about how good they were. If they thought they were good generally, they thought they were good for every specific guess.

Result: The "cheat sheets" didn't change how people saw themselves.

2. The "AI Optimism" Bias (You think the robot is smarter in the moment)

The Metaphor: Imagine you think your GPS is "okay" generally. But when you are stuck in traffic and the GPS reroutes you perfectly, you suddenly think, "Wow, this GPS is a genius!" You forget your general skepticism and give it a temporary boost of confidence.
The Finding: People had a systematic bias called "AI Optimism." Even if they thought the AI was just "okay" generally, when they looked at a specific task, they suddenly thought, "Oh, the AI will definitely crush this one!"

The Twist: The only thing that stopped this optimism was showing them the AI Report (telling them exactly where the AI fails). The Data Sheet didn't help; only knowing the AI's specific weaknesses fixed the bias.

3. The "Amplifier" Effect (More info makes you more emotional, not smarter)

The Metaphor: Imagine you are driving and you get a map. You think, "Okay, I know the road better now." But instead of driving more calmly, you start making more dramatic decisions. If you feel confident, you drive faster. If you feel the car is better, you let go of the wheel sooner.
The Finding: Giving people more information (Data or AI reports) didn't necessarily make their teamwork better. Instead, it made their decisions more sensitive to their feelings.

If they felt slightly less confident than usual, they handed the task to the AI much faster.
If they felt the AI was slightly better than usual, they handed the task over much faster.
The Problem: This made their delegation behavior (who does the work) swing wildly, but it did not improve the final score. They were making more "emotional" choices, not "smarter" ones.

The Core Problem: "Believing" vs. "Achieving"

The most important takeaway is this: People's gut feelings about who should do the work do not match what actually works best.

The Disconnect: The study found that the factors driving people to delegate (like "I feel the AI is great right now") had a huge impact on who did the work, but almost zero impact on whether the team actually got the right answer.
The Analogy: It's like a basketball coach who keeps swapping players based on who "feels hot" in the moment. The coach makes a lot of swaps (high delegation activity), but the team's score doesn't go up because the swaps weren't actually based on who was the best player for that specific play.

What Should Designers Do? (The Advice)

The paper suggests that just showing people more data or explanations (Transparency) isn't enough. Here is the new advice:

Don't just show the AI's stats; show the Human's bias. Help people realize, "Hey, you are being stubborn about your own skills," or "Hey, you are getting too excited about the AI right now."
Target the "Root Beliefs," not just the moment. Instead of just helping people make one good decision, help them understand their general relationship with AI before they start the task.
Separate "Learning" from "Doing." Give people complex data to help them understand the system (Calibration), but give them simple, clear tools to help them make the actual decision (Decision Support). Don't mix them up, or people get overwhelmed and make emotional mistakes.

Summary

We often think that if we give humans more information about an AI, they will work with it perfectly. This paper says: No.
Humans have a stubborn belief in themselves and a temporary, inflated belief in the AI. Giving them more info doesn't fix this; it just makes their decisions more volatile. To build better teams, we need to design systems that help humans see their own biases, not just systems that show them more charts.

Here is a detailed technical summary of the paper "Believing vs. Achieving — The Disconnect between Efficacy Beliefs and Collaborative Outcomes" by Spitzer and Holstein.

1. Problem Statement

As AI integrates into workflows, humans must decide when to rely on AI advice versus their own judgment. While prior research has focused on factors like trust and confidence, the role of efficacy beliefs (perceptions of one's own capabilities and the AI's competence) remains underexplored. Specifically, there is a gap in understanding:

How general efficacy beliefs (stable, pre-task orientations) translate into instance-wise efficacy judgments (dynamic, task-specific assessments).
How contextual information (data distributions, AI performance metrics) influences this translation.
Whether discrepancies between these beliefs and judgments lead to optimal delegation behavior and improved team performance.

The authors hypothesize that general beliefs act as "cognitive anchors" that may bias instance-wise judgments, potentially creating a disconnect between human intuition (delegation decisions) and actual collaborative outcomes.

2. Methodology

The authors conducted a controlled behavioral study ( $N=240$ ) using a 2×2 between-subjects factorial design.

Task: An income classification task (predicting if a US citizen earns >$50k) based on the American Community Survey dataset. The AI was a decision tree classifier with 77% accuracy.
Participants: 240 US residents recruited via Prolific, divided into four groups (60 per group).
Experimental Conditions (Contextual Information):
1. Control: No additional contextual information.
2. Data: Visualizations of feature distributions (age, education, occupation, hours worked) relative to income.
3. AI: Visualizations of feature distributions relative to the AI's accuracy (showing where the AI succeeds/fails).
4. Combined: Both Data and AI information.
Procedure:
1. Pre-Task: Participants rated General Self-Efficacy and General AI Efficacy.
2. Knowledge Enablement: Participants viewed their assigned contextual information (with a comprehension check).
3. Delegation Phase: Participants viewed 12 instances. For each, they rated Instance-wise Self-Efficacy and Instance-wise AI Efficacy, then decided whether to delegate the task to the AI or solve it themselves. Crucially, no feedback on correctness was given during this phase.
4. Baseline Phase: Participants solved all 12 instances themselves to establish a human-only performance baseline.
5. Post-Task: Demographics and AI literacy surveys.
Analysis: The study measured efficacy discrepancies (Instance-wise rating minus General belief) and analyzed their impact on delegation behavior (binary choice) and team performance (accuracy of the final human-AI team outcome) using mixed-effects logistic regression.

3. Key Contributions

Theoretical Framework: Distinguishes between general efficacy beliefs (stable anchors) and instance-wise efficacy judgments (dynamic assessments), modeling the translation between them.
Identification of "AI Optimism": Discovered a systematic bias where humans rate AI capabilities higher for specific instances than their general beliefs suggest.
Asymmetric Calibration: Demonstrated that contextual information affects self-efficacy and AI-efficacy differently.
Behavior-Performance Disconnect: Revealed that efficacy discrepancies drive delegation behavior significantly more than they drive actual team performance, suggesting humans are miscalibrated in their reliance strategies.
Design Guidelines: Proposed specific interventions to address anchoring biases and optimism, moving beyond simple transparency.

4. Key Results

A. The Anchoring Effect (RQ1)

Self-Efficacy: General self-efficacy beliefs acted as stable anchors. Instance-wise self-efficacy judgments aligned closely with general beliefs across all conditions. There was no systematic bias; participants adjusted their self-assessments flexibly but remained anchored to their baseline.
AI Efficacy: A systematic "AI Optimism" bias was observed. In 3 out of 4 conditions, participants rated the AI's capability for specific instances significantly higher than their general belief of the AI's competence.
- Exception: Only the AI Information condition (showing AI performance metrics) eliminated this optimism bias. Data information alone reduced it but did not eliminate it.

B. The Role of Contextual Information (RQ2)

Self-Efficacy: Contextual information did not significantly moderate the relationship between general and instance-wise self-efficacy. General beliefs remained the dominant predictor regardless of available data.
AI Efficacy: Combined contextual information (Data + AI) significantly weakened the link between general AI beliefs and instance-wise judgments, allowing for more situation-specific assessments.

C. Impact on Delegation and Performance (RQ3)

Delegation Behavior: Efficacy discrepancies strongly predicted delegation.
- If participants felt less capable than usual (negative self-efficacy discrepancy), they delegated more.
- If they felt the AI was better than usual (positive AI efficacy discrepancy), they delegated more.
- Amplification Effect: Contextual information amplified these effects. When information was available, participants were more reactive to their efficacy fluctuations (both for keeping control and delegating).
Team Performance:
- Disconnect: While efficacy discrepancies strongly influenced behavior, they had weak or non-significant effects on actual team performance.
- Self-Efficacy Harm: Higher instance-wise self-efficacy (relative to general belief) actually correlated with worse team performance, suggesting participants retained control when they shouldn't have (overconfidence).
- AI Optimism Harm: The "AI optimism" led to delegation decisions that did not necessarily improve outcomes, as the inflated instance-wise ratings did not align with the AI's actual strengths.

5. Significance and Implications

The study challenges the prevailing assumption in HCI that transparency (providing more data or AI performance metrics) automatically leads to better collaboration.

The "Believing vs. Achieving" Gap: Humans' intuitive delegation strategies (driven by efficacy beliefs) are systematically misaligned with what actually improves team performance. Providing information makes humans more confident in their (potentially flawed) delegation strategies, but not necessarily more accurate.
Asymmetric Updating: Humans are resistant to updating self-efficacy beliefs (ego-protective) but more open to updating AI efficacy beliefs. However, even when AI beliefs are updated, the resulting delegation behavior does not guarantee better outcomes.
Design Guidelines:
1. Make Anchoring Visible: Provide feedback showing how a user's delegation patterns compare to actual outcomes over time to break the anchoring bias.
2. Target Foundational Beliefs: Interventions should address general self-efficacy and AI efficacy before task engagement, not just during individual decisions.
3. Expose AI Optimism: Systems should highlight when a user's instance-specific optimism about AI exceeds the AI's actual historical performance.
4. Separate Calibration from Decision Support: Distinguish between information used for understanding the system (calibration) and information used for making the decision (action), as mixing them may amplify biases.

Conclusion: Effective Human-AI collaboration requires moving beyond simple transparency. Designers must actively guide users to recalibrate their underlying efficacy beliefs and recognize the disconnect between their intuitive reliance strategies and optimal performance outcomes.