Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

Imagine you are trying to grade a massive pile of group project conversations. In the past, you would have to hire a team of human teachers to read every single chat message, decide what "skill" each message showed (like "sharing an idea" or "being rude"), and write it down. This is slow, expensive, and exhausting.

Recently, we discovered a super-smart AI (like ChatGPT) that can read these chats and do the grading for us almost instantly. But, just like a new employee, we have to ask: "Is this AI fair? Does it grade everyone the same way, regardless of whether they are a man or a woman, or from a specific racial background?"

This paper is the report card for that AI employee. Here is the breakdown in simple terms:

1. The Setup: The "Grading Machine"

The researchers took three different types of group challenges:

The Negotiation: A group trying to plan a fundraiser where everyone wants different things.
The Decision: A group trying to pick the best apartment, but everyone only knows half the facts.
The Puzzle: A group trying to solve a secret code where letters equal numbers.

They had real people chat about these tasks. Then, they asked two things to happen to every single chat message:

A Human Expert read it and tagged it with a category (e.g., "This is 'Sharing Information'").
ChatGPT read the exact same message and tried to tag it with the same category.

2. The Big Question: Is the AI Biased?

The researchers were worried that the AI might be like a strict teacher who accidentally likes one type of student over another. Maybe the AI was trained on internet data that makes it understand how "White men" talk better than how "Black women" talk. If the AI is biased, it might give unfair scores to certain groups, which would ruin the test results.

To check this, they ran three specific "fairness tests":

Test 1: The Agreement Check. Did the AI and the Human agree on the tags for everyone equally? Or did they disagree more often for Black students than White students?
Test 2: The Reliability Check. Is the AI's grading consistent? If the AI grades a message today, will it grade a similar message tomorrow the same way? Does this consistency hold up for all groups?
Test 3: The "Second Opinion" Check. If a second human teacher read the messages, would they agree with the AI just as much as they agree with the first human teacher? This checks if the AI is acting like a human would.

3. The Results: The AI Passed (With One Tiny Quirk)

The Good News:
For the most part, the AI is fair. It graded men and women, and people of different races, with the same level of accuracy and consistency. It didn't seem to have a "favorite" group. The AI is ready to help scale up these tests so we can assess thousands of students at once without hiring an army of teachers.

The One Weird Glitch:
There was one small hiccup in the "Negotiation" task. The data showed that the AI agreed with the Human Expert less often for Black participants than for White participants.

But here is the twist: The researchers dug deeper and realized this wasn't because the AI was being mean or biased against Black participants.

The Analogy: Imagine a basketball referee. Usually, the referee and the coach agree on 90% of the calls. But for the White team, the referee and coach agreed on 99% of the calls (maybe because the White team played very predictably). For the Black team, they only agreed on 80% of the calls.
The Reality: The Black team's 80% agreement was actually perfectly normal (just like how humans agree with each other). The "gap" only looked big because the White team's agreement was unusually high, not because the Black team's was unusually low. The AI wasn't failing the Black team; the White team just happened to be a "perfect match" for the AI's training data in that specific game.

4. The Conclusion: A Helpful Assistant, Not a Replacement

The paper concludes that ChatGPT is a fantastic tool for grading communication skills. It is fast, cheap, and, most importantly, fair across different groups of people.

However, the authors add a few important warnings:

It's not magic: We still need humans to check the work, especially for complex tasks.
It's a work in progress: As AI gets smarter, we need to keep testing it to make sure it stays fair.
The "Score" matters: Even if the AI grades every chat message fairly, we still need to make sure the final score given to a student is fair.

In short: We found a robot that can grade group chats as well as a human teacher, and it treats everyone fairly. It's a huge step forward for testing skills in the real world, but we still need to keep an eye on it to make sure it keeps doing a good job.

Here is a detailed technical summary of the paper "Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups."

1. Problem Statement

The assessment of communication and collaboration skills at scale traditionally relies on manual coding of communication data (e.g., chat logs) by human raters, a process that is labor-intensive, expensive, and difficult to scale. While Large Language Models (LLMs) like ChatGPT have shown promise in automating this coding with accuracy comparable to humans, a critical gap remains: it is unclear whether these AI-based coding systems perform consistently across different demographic subgroups (e.g., gender, race/ethnicity).

There is a concern that LLMs, trained on vast web-based corpora, may inadvertently replicate or amplify societal biases, leading to differential performance (differential item functioning) across groups. Existing frameworks for evaluating "fairness" in automated scoring (e.g., Williamson et al., 2012) were designed for ordinal/continuous scores (like essay grades) and do not directly apply to nominal categorical coding of communication turns, which are nested within individuals and teams.

2. Methodology

Data Collection

Source: Data was drawn from three distinct collaborative problem-solving (CPS) tasks:
1. Negotiation: Planning a fundraising event with conflicting individual payoffs.
2. Decision-Making: Selecting an apartment where participants hold unique subsets of information.
3. Letter-to-Number: A reasoning task involving uncovering a hidden mapping between letters and numbers.
Participants: 431 unique participants (crowdsourced via Prolific) contributing 8,479 chat turns.
Demographics: The analysis focused on participants identifying as Male/Female and White/Black/Hispanic/Asian.

Coding Framework

A specific CPS coding framework (Kyllonen et al., 2023) was used, categorizing chat turns into five nominal categories:
1. Maintaining Communication (MC)
2. Staying on Task (OT)
3. Eliciting Information (EI)
4. Sharing Information (SI)
5. Acknowledging (AK)
Human Baseline: Two trained human raters (one expert) coded the data. The expert's code served as the reference standard.

AI Implementation

Model: OpenAI's GPT-4o (version 2024-05-13).
Configuration: Temperature set to 0 for maximum determinism; fixed random seed.
Prompt Engineering: Prompts included task goals, the coding framework definition, ~10 representative examples per category, and strict input/output formatting.

Statistical Analysis & Evaluation Checks

The authors adapted the Williamson et al. (2012) framework to address three Research Questions (RQs) using three specific checks:

RQ1: Agreement Consistency (GLMM):
- Used a Generalized Linear Mixed-Effects Model (GLMM) with a binomial distribution and logit link.
- Outcome ( $Y$ ): Binary agreement (1 = AI and Human agreed; 0 = disagreed).
- Fixed Effects: Demographic group (Gender/Race), Task, and their interaction.
- Random Effects: Random intercepts for Person and Team to account for the nested structure of the data (multiple turns per person, multiple people per team).
- Goal: Determine if the probability of AI-Human agreement differs significantly by subgroup.
RQ2: Reliability Consistency (Cohen's Kappa):
- Calculated Cohen's Kappa for Human-AI pairs and Human-Human pairs within each demographic subgroup.
- Used clustered bootstrap resampling at the individual level to generate 95% confidence intervals, preserving intra-person correlations.
- Goal: Assess if the reliability of AI coding varies across subgroups compared to human reliability.
RQ3: Secondary Rater Pattern Consistency:
- Compared the pattern of agreement between AI and a secondary human rater against the pattern of agreement between two human raters across subgroups.
- Goal: Determine if AI coding behaves similarly to human coding when predicting a second human rater's judgment.

3. Key Results

RQ1 (Agreement Consistency):
- Gender: No significant differences in AI-Human agreement were found between Male and Female participants across all tasks.
- Race/Ethnicity: Generally, no significant bias was found. However, a significant interaction was observed for Black participants in the Negotiation task, showing lower AI-Human agreement compared to White participants.
- Crucial Nuance: Further analysis revealed this disparity was not due to lower accuracy for Black participants. Instead, the AI-Human agreement for the White reference group was anomalously high in the Negotiation task (exceeding Human-Human agreement), artificially inflating the relative difference. The absolute agreement for Black participants remained comparable to Human-Human agreement.
RQ2 (Reliability Consistency):
- Cohen's Kappa values for Human-AI coding were generally consistent across gender and racial/ethnic groups.
- While some task-specific variations existed (e.g., Negotiation tasks generally had lower Kappa scores than Decision-Making), there was no evidence that ChatGPT's reliability systematically degraded for any specific demographic group.
RQ3 (Secondary Rater Patterns):
- The subgroup patterns of agreement between ChatGPT and a secondary human rater were comparable to those observed between two human raters. This suggests AI coding does not introduce systematic subgroup-specific prediction errors that differ from human-human variability.

4. Key Contributions

Methodological Adaptation: The paper proposes and operationalizes a three-check framework specifically adapted for evaluating subgroup consistency in nominal categorical coding of nested communication data, bridging a gap in the existing automated scoring literature.
Empirical Evidence of Fairness: It provides robust empirical evidence that GPT-4o can code collaborative communication data with consistency across gender and racial/ethnic groups comparable to human raters.
Diagnosis of "False" Disparities: The study offers a critical analytical insight: apparent demographic disparities in AI performance can sometimes stem from baseline shifts in the reference group (e.g., unusually high performance for the reference group) rather than actual bias against the focal group.
Task Generalizability: The findings hold across three distinct types of collaborative tasks (structured reasoning, open-ended negotiation, and information sharing), suggesting the consistency is not limited to a single interaction style.

5. Significance and Limitations

Significance:

Scalability: Validates the potential for using LLMs to scale the assessment of 21st-century skills (collaboration/communication) without introducing demographic bias at the coding stage.
Policy & Standards: The proposed three-check framework offers a standardized approach for the assessment community to validate and certify AI-based coding systems before deployment.
Responsible AI: Demonstrates that with appropriate prompt engineering and model selection, AI can serve as a reliable complement to human raters, provided specific guardrails and validation steps are taken.

Limitations & Future Directions:

Model Specificity: Results are specific to GPT-4o; future models (e.g., GPT-5, Gemini, Claude) may exhibit different behaviors.
Complexity: The coding framework used was relatively standard; more complex or nuanced frameworks may present different challenges.
Aggregation Risk: While individual codes showed no bias, disparities could still emerge when codes are aggregated into composite scores.
Sample Size: While sufficient for current analysis, larger datasets might reveal smaller, statistically significant differences that require practical significance thresholds to be defined by the community.
Role of AI: The authors conclude that AI is currently a complement to human coding rather than a full replacement, as it does not yet meet all validity standards expected of human raters in high-stakes psychometric contexts.