When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

This paper reveals that automatic depression detection models trained on semi-structured clinical interviews often achieve inflated performance by exploiting systematic biases in interviewer prompts rather than genuine linguistic cues from participants, highlighting the critical need to restrict analysis to patient utterances for valid interpretability.

Hasindri Watawana, Sergio Burdisso, Diego A. Moreno-Galván, Fernando Sánchez-Vega, A. Pastor López-Monroy, Petr Motlicek, Esaú Villatoro-Tello

Published 2026-03-27
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a computer to recognize when someone is feeling depressed. To do this, you give the computer thousands of recordings of therapy sessions. In these sessions, a therapist (or a robot acting like one) asks a series of questions, and the patient answers.

The researchers in this paper discovered a sneaky trick that these computers were pulling. Instead of actually listening to the patient's sad or happy words, the computers were learning to "cheat" by looking at the therapist's script.

Here is the story of their discovery, broken down with some everyday analogies.

1. The Setup: The "Scripted" Interview

Think of a clinical interview like a standardized driving test.

  • The Examiner (Interviewer): Has a strict script. "Turn left," "Stop at the red light," "Check your mirrors." They ask these questions in the exact same order for every single driver.
  • The Driver (Patient): Answers freely. Some drivers are nervous and stutter; others are confident and smooth.

The goal is to see if the computer can tell which drivers are "bad" (depressed) just by listening to the conversation.

2. The Problem: The Computer is Cheating

The researchers trained two types of AI models:

  • The "Patient-Only" Model: This AI was only allowed to listen to the driver's answers.
  • The "Interviewer-Only" Model: This AI was only allowed to listen to the examiner's questions.

The Shocking Result: The "Interviewer-Only" model often did better than the "Patient-Only" model.

The Analogy: Imagine a teacher grading a test.

  • The Honest Way: The teacher reads the student's essay to see if they understand the material.
  • The Cheating Way: The teacher looks at the order of the questions. "Ah, the student is answering Question 4, which is always about 'family trauma.' If they are answering Question 4, they must be in the 'sad' group."

The AI wasn't learning about depression; it was learning the pattern of the script. It realized that certain questions (like "How is your family?") always appeared at specific times, and the type of person answering them (depressed vs. not) was predictable based on the question itself, not the answer.

3. The "Shortcut" (Bias)

The paper calls this "Prompt-Induced Bias."

Think of the interview script like a recipe.

  • If you are baking a cake (diagnosing depression), you want to taste the batter (the patient's words) to see if it's sweet enough.
  • But the AI found a shortcut: It realized that if the recipe says "Add eggs now," the cake is always going to be a "depressed cake" in this specific dataset. So, the AI stopped tasting the batter and just looked at the recipe steps.

Because the script is so consistent (the "examiner" asks the same things in the same order), the AI could guess the diagnosis just by knowing which question was being asked, completely ignoring what the patient actually said.

4. The Evidence: Heatmaps

The researchers used "heatmaps" (like a weather map showing hot and cold spots) to see where the AI was looking.

  • The Cheating AI (Interviewer Model): The heatmap showed bright, narrow lines. It was laser-focused on specific moments where the therapist asked a specific question. It ignored 90% of the conversation.
  • The Honest AI (Patient Model): The heatmap was spread out. It was looking at the patient's words throughout the whole conversation, which is how it should work.

5. Why This Matters

This is a big deal because many researchers thought that including the therapist's questions helped the AI understand the context better. This paper says: "No, it's actually making the AI lazy."

If we build a real-world AI doctor that relies on these shortcuts, it might fail miserably in a real situation where the therapist asks questions in a different order or uses different words. The AI would be confused because it learned the script, not the human.

The Takeaway

The authors are saying: "Don't let the AI read the script; make it listen to the person."

To build truly helpful AI for mental health, we need to strip away the interviewer's questions and force the computer to learn from the patient's actual words, emotions, and stories. Otherwise, we aren't detecting depression; we're just detecting how well the AI memorized the therapist's checklist.