Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

Imagine you are trying to understand the heart and soul of a community. You sit down with 12 different people for long, deep conversations about their hopes, fears, and what matters most to them. This is ethnographic research—a bit like being a detective of human culture.

Traditionally, a team of expert human detectives (anthropologists and economists) would listen to these recordings, take notes, and try to agree on the top three "values" driving each person's life (like Security, Freedom, or Tradition). But this is hard work. It takes forever, and even the experts often disagree with each other because human feelings are messy and complicated.

The Big Question:
Can a super-smart AI (a Large Language Model or LLM) sit in on these conversations, listen to the same tapes, and figure out the values just as well as the human experts? And more importantly, can the AI understand when the answer is fuzzy, just like a human does?

Here is the story of what the researchers found, explained with some everyday analogies.

1. The AI as a "Fast, but Sometimes Overconfident" Intern

The researchers treated the AI models like a team of very fast, very smart interns. They asked the AI to listen to the interviews and pick the top three values for each person.

The Good News: The AI was surprisingly good at the "big picture." If you asked, "Did the AI pick the right group of values?" (like picking the right three fruits from a basket), it got it right almost as often as the human experts. It's like an intern who can quickly sort a pile of mail into the right bins.
The Bad News: The AI struggled with the "ranking." If you asked, "Which of these three is the most important?" the AI often got the order wrong. It's like the intern knows you need milk, eggs, and bread, but they put the bread on top of the list when you actually needed the milk first.

2. The "Uncertainty" Test: Does the AI Know What It Doesn't Know?

This is the most interesting part. Human experts know that some interviews are confusing. Sometimes a person talks in circles, and even the experts scratch their heads and say, "I'm not 100% sure what value this is."

The researchers wanted to see if the AI would also say, "I'm not sure," or if it would just confidently guess wrong.

The Result: The AI was often overconfident. Even when the human experts were confused and disagreed with each other, the AI tended to give a very definite answer. It was like a student taking a test who guesses the answer with 100% certainty even when they have no idea what the question means.
The Exception: One model, called Qwen, was the "star student." It was the only one that started to mimic the human experts' confusion. When the humans were unsure, Qwen was also a bit more unsure. It was the most "human-like" in its hesitation.

3. The "Group Chat" Strategy (Ensembles)

Since no single AI was perfect, the researchers tried a trick used in many boardrooms: The Group Chat.
They asked four different AI models to analyze the same interview, and then they took a vote on the final answer.

The Analogy: Imagine asking four different friends to guess the plot of a movie you all watched. If you take a vote on the ending, you usually get a better answer than if you just ask one friend.
The Result: This "Group Chat" method worked wonders. By combining the AI's opinions, they got significantly better results, almost closing the gap between the AI and the human experts.

4. The "Security" Bias

There was one funny quirk. The AI models seemed to think everyone was obsessed with Security.

The Analogy: Imagine a group of friends analyzing a party. The humans say, "Oh, everyone was there to have fun and meet new people." But the AI says, "No, everyone was clearly there just to make sure the exits were safe and the food was fresh."
The AI kept picking "Security" as a top value way more often than the humans did. This suggests the AI has a built-in bias, perhaps because its training data makes it think safety is the most important thing for everyone.

The Bottom Line

Can AI replace human researchers? Not yet.

The Promise: AI is a fantastic tool for doing the heavy lifting. It can read hours of interviews in seconds and get the general "vibe" right. It's a great assistant that can speed up the work.
The Limitation: AI still lacks the "gut feeling" to know when a situation is ambiguous. It tends to be too sure of itself when it should be cautious.

The Takeaway: Think of AI not as a replacement for the human expert, but as a super-fast intern who needs a human manager to double-check the work, especially when the answers aren't clear-cut. If you use the "Group Chat" method (combining multiple AIs) and keep a human in the loop to spot the biases, you can get some incredibly powerful insights into what makes people tick.

Here is a detailed technical summary of the paper "Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research."

1. Problem Statement

Qualitative analysis of open-ended interviews is a cornerstone of ethnographic and economic research, used to uncover latent motivations, values, and culturally embedded behaviors. However, this process is inherently time-consuming, subjective, and prone to inter-expert disagreement due to the ambiguity of human values.

The Gap: While Large Language Models (LLMs) show promise in automating qualitative coding, it remains unclear if they can replicate the nuance and uncertainty patterns of human experts. Specifically, can LLMs distinguish between values that are inherently ambiguous (where experts disagree) versus values where the model is simply uncertain due to its own limitations?
The Challenge: Most existing evaluations focus on aggregate performance metrics on short text snippets. This paper addresses the more complex task of inferring values from long-form, unstructured interviews without explicit keywords, relying on contextual inference based on the Schwartz Theory of Basic Human Values.

2. Methodology

Data and Annotation

Dataset: 12 open-ended, 2-hour interviews with local residents in China (transcribed from Chinese to English by experts).
Ground Truth: Six multidisciplinary experts (anthropologists, economists, investment specialists) independently annotated each interview to identify the top three dominant values from a set of 58 sub-values, mapped to 10 basic motivational types (Schwartz Theory).
Ambiguity Metric: The inter-annotator agreement (Krippendorff's $\alpha$ ) was 0.389, confirming the task's inherent ambiguity. A "human ceiling" baseline was established via leave-one-annotator-out evaluation.

Models Evaluated

The study tested four state-of-the-art, open-source decoder-only models (context window $\ge$ 32k tokens) running in GGUF-quantized format:

DeepSeek-R1-Distill-Llama-8B
Qwen3-30B-A3B-Instruct
Llama-3.3-70B-Instruct
Mistral-Small-3.2-24B-Instruct

Experimental Design

Prompt Engineering: Four strategies were tested:
1. Baseline: Direct prioritization of 10 values.
2. Bias-Constraint (BC): Explicit instruction to maintain objectivity.
3. Profile-Enhanced (PEP): Including interviewee background summaries.
4. Bottom-Up (BUP): Mimicking expert workflow (58 sub-values $\to$ 10 broad values).
Input Strategies: Whole transcript vs. Segmented (5k tokens).
Ensemble Methods: Aggregation of multiple models using Kemeny-Young, Majority Vote, and Borda Count.

Evaluation Metrics

Performance: F1-score (F1@3), Jaccard Similarity (Jaccard@3), and Rank-Biased Overlap (RBO@3).
Uncertainty Alignment:
- Cosine Similarity: Alignment of mean value distributions between models and experts.
- Spearman's $\rho$ : Correlation between model variability (across prompts) and expert disagreement (across annotators).
- Median Std Dev: Magnitude of model uncertainty.

3. Key Contributions

Uncertainty Structure Analysis: The paper moves beyond simple accuracy to analyze whether LLMs exhibit uncertainty patterns that mirror human expert disagreement. It distinguishes between structural misalignment (model is unsure about different values than humans) and magnitude misalignment (model is over/under-confident).
Long-Form Contextual Inference: Unlike prior work on short snippets, this study evaluates LLMs on complex, multi-hour interviews where values are implicit and culturally embedded.
Ensemble Efficacy: Demonstrates that LLM ensembles significantly outperform standalone models in qualitative value identification.
Bias Detection: Identifies systematic model biases, specifically an overemphasis on the "Security" value across all models compared to human experts.

4. Key Results

Performance vs. Human Ceiling

Set-Based Metrics: LLMs approach the human ceiling on F1 and Jaccard scores (identifying the set of top values).
- Qwen3 performed best (F1: 56.6, Jaccard: 43.96), closely trailing the human ceiling (F1: 58.2, Jaccard: 44.5).
- DeepSeek performed poorly (F1: 24.9) with high variance.
Ranking Metrics: All models struggled with exact ranking (RBO scores), with Qwen3 achieving 37.09 vs. human 51.97. This indicates LLMs can identify relevant values but struggle to order them precisely.

Uncertainty and Alignment

Distribution Alignment: Qwen3 showed the highest cosine similarity (0.833) to expert value distributions. DeepSeek was the outlier (0.552).
Uncertainty Structure:
- Qwen3 and Mistral showed moderate alignment with expert uncertainty patterns (Spearman's $\rho$ of 0.457 and 0.379, respectively).
- Overconfidence: Llama and Qwen3 exhibited lower variability across prompts than human experts, suggesting systematic overconfidence (they are too certain about ambiguous values).
- DeepSeek matched the magnitude of expert uncertainty (std dev 0.254 vs. 0.252 for humans) but failed to align on which values were ambiguous.

Prompting and Ensembles

Prompting: Profile-Enhanced Prompting (PEP) on the whole transcript yielded the best results. Bottom-Up prompting performed poorly.
Ensembles: Aggregating models via Majority Vote or Borda Count improved performance by 8–10 points on F1 and RBO metrics and 6–8 points on Jaccard, outperforming any single model.

Systematic Bias

All models significantly over-assigned the "Security" value compared to human experts. This suggests a potential model-induced bias where security is treated as a default or dominant theme in financial/behavioral contexts.

5. Significance and Conclusion

Collaborative Potential: LLMs are viable collaborators for ethnographic research, capable of approaching human-level performance in identifying value sets, particularly when used in ensembles.
Limitations in Ambiguity: While LLMs can identify what values are present, they struggle to replicate the nuanced uncertainty of human experts. They tend to be either overconfident or uncertain about the wrong values, which poses risks for high-stakes decision-making (e.g., investment research).
Bias Awareness: The systematic overemphasis on "Security" highlights the need for researchers to treat LLM outputs as complementary perspectives that require bias correction, rather than ground truth.
Future Directions: The authors suggest future work should focus on preprocessing transcripts into structured Q&A pairs to improve grounding and expanding evaluations to closed-source models and broader datasets.

In summary, the paper concludes that while LLMs are powerful tools for scaling qualitative analysis, they do not yet fully capture the "human element" of interpretive uncertainty, necessitating a hybrid human-in-the-loop approach for robust ethnographic research.