Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

This study demonstrates that ChatGPT-based coding of communication data performs consistently across gender and racial/ethnic subgroups, matching human rater reliability and validating its potential for large-scale collaborative assessments.

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you are trying to grade a massive pile of group project conversations. In the past, you would have to hire a team of human teachers to read every single chat message, decide what "skill" each message showed (like "sharing an idea" or "being rude"), and write it down. This is slow, expensive, and exhausting.

Recently, we discovered a super-smart AI (like ChatGPT) that can read these chats and do the grading for us almost instantly. But, just like a new employee, we have to ask: "Is this AI fair? Does it grade everyone the same way, regardless of whether they are a man or a woman, or from a specific racial background?"

This paper is the report card for that AI employee. Here is the breakdown in simple terms:

1. The Setup: The "Grading Machine"

The researchers took three different types of group challenges:

  • The Negotiation: A group trying to plan a fundraiser where everyone wants different things.
  • The Decision: A group trying to pick the best apartment, but everyone only knows half the facts.
  • The Puzzle: A group trying to solve a secret code where letters equal numbers.

They had real people chat about these tasks. Then, they asked two things to happen to every single chat message:

  1. A Human Expert read it and tagged it with a category (e.g., "This is 'Sharing Information'").
  2. ChatGPT read the exact same message and tried to tag it with the same category.

2. The Big Question: Is the AI Biased?

The researchers were worried that the AI might be like a strict teacher who accidentally likes one type of student over another. Maybe the AI was trained on internet data that makes it understand how "White men" talk better than how "Black women" talk. If the AI is biased, it might give unfair scores to certain groups, which would ruin the test results.

To check this, they ran three specific "fairness tests":

  • Test 1: The Agreement Check. Did the AI and the Human agree on the tags for everyone equally? Or did they disagree more often for Black students than White students?
  • Test 2: The Reliability Check. Is the AI's grading consistent? If the AI grades a message today, will it grade a similar message tomorrow the same way? Does this consistency hold up for all groups?
  • Test 3: The "Second Opinion" Check. If a second human teacher read the messages, would they agree with the AI just as much as they agree with the first human teacher? This checks if the AI is acting like a human would.

3. The Results: The AI Passed (With One Tiny Quirk)

The Good News:
For the most part, the AI is fair. It graded men and women, and people of different races, with the same level of accuracy and consistency. It didn't seem to have a "favorite" group. The AI is ready to help scale up these tests so we can assess thousands of students at once without hiring an army of teachers.

The One Weird Glitch:
There was one small hiccup in the "Negotiation" task. The data showed that the AI agreed with the Human Expert less often for Black participants than for White participants.

But here is the twist: The researchers dug deeper and realized this wasn't because the AI was being mean or biased against Black participants.

  • The Analogy: Imagine a basketball referee. Usually, the referee and the coach agree on 90% of the calls. But for the White team, the referee and coach agreed on 99% of the calls (maybe because the White team played very predictably). For the Black team, they only agreed on 80% of the calls.
  • The Reality: The Black team's 80% agreement was actually perfectly normal (just like how humans agree with each other). The "gap" only looked big because the White team's agreement was unusually high, not because the Black team's was unusually low. The AI wasn't failing the Black team; the White team just happened to be a "perfect match" for the AI's training data in that specific game.

4. The Conclusion: A Helpful Assistant, Not a Replacement

The paper concludes that ChatGPT is a fantastic tool for grading communication skills. It is fast, cheap, and, most importantly, fair across different groups of people.

However, the authors add a few important warnings:

  • It's not magic: We still need humans to check the work, especially for complex tasks.
  • It's a work in progress: As AI gets smarter, we need to keep testing it to make sure it stays fair.
  • The "Score" matters: Even if the AI grades every chat message fairly, we still need to make sure the final score given to a student is fair.

In short: We found a robot that can grade group chats as well as a human teacher, and it treats everyone fairly. It's a huge step forward for testing skills in the real world, but we still need to keep an eye on it to make sure it keeps doing a good job.