Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

This paper introduces a maximum likelihood model to estimate LLM usage in AI conference peer reviews, revealing that between 6.5% and 16.9% of text in recent reviews was substantially AI-generated, with higher usage correlated to lower reviewer confidence, proximity to deadlines, and lower engagement with rebuttals.

Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel A. McFarland, James Y. Zou

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you are a judge at a massive, high-stakes talent show. Every year, thousands of experts submit written reviews to decide who gets a prize. These reviews are supposed to be honest, detailed, and unique opinions from human minds.

But recently, a new "ghost writer" has entered the scene: ChatGPT (and other AI tools). It can write reviews that sound very human. The problem? We can't tell just by reading a single review if it was written by a tired human or a super-fast robot. It's like trying to spot a fake diamond by looking at it with the naked eye; sometimes, they look identical.

This paper is about a team of researchers who decided to stop trying to catch the "fake diamonds" one by one. Instead, they built a metal detector for the whole pile of sand.

Here is the story of how they did it, explained simply:

1. The Problem: The "Needle in a Haystack" is Impossible

If you have 10,000 reviews, and you try to check each one individually to see if an AI wrote it, you will fail. Current AI detectors are like bad lie detectors; they get confused easily and often accuse innocent humans of being robots.

The researchers realized: We don't need to know which specific review is fake. We just need to know how much "robot dust" is in the whole pile.

2. The Solution: The "Taste Test" (Distributional Quantification)

Instead of looking at the reviews one by one, the researchers looked at the flavor of the whole batch.

  • The Analogy: Imagine you have a jar of pure orange juice (Human reviews) and a jar of artificial orange soda (AI reviews). You know exactly what both taste like.
  • Now, someone mixes a huge bucket of "Mystery Juice" from the conference reviews.
  • You don't need to taste every single drop to know how much soda is in the bucket. You just take a sip, analyze the flavor profile, and compare it to your pure orange juice and pure soda.
  • If the Mystery Juice tastes 10% like soda, you know 10% of the bucket is artificial.

The researchers did this with words. They found that AI has a specific "accent." For example, AI loves using words like "commendable," "meticulous," "intricate," and "innovative" way more often than humans do. Humans are messier and more varied; AI is a bit too perfect and repetitive with its vocabulary.

3. The Experiment: The "AI vs. Human" Showdown

They tested this "flavor test" on reviews from top computer science conferences (like ICLR and NeurIPS) and compared them to reviews from Nature journals (a different type of science).

The Results were shocking:

  • The AI Conferences: After ChatGPT was released in late 2022, the "robot flavor" in the reviews spiked. They estimated that between 6.5% and 16.9% of the text in these reviews was substantially written or heavily edited by AI.
    • Translation: In some conferences, roughly 1 out of every 6 sentences in a review might have come from a robot.
  • The Nature Journals: In contrast, the reviews for Nature journals showed no spike. They remained almost 100% human. This suggests that AI adoption varies wildly depending on the field.

4. The "Tell-Tale Signs" (Who is using the AI?)

The researchers also looked at when and how people were using these tools. They found some funny and concerning patterns:

  • The Deadline Panic: The closer reviewers got to the deadline, the more "robot flavor" appeared. It's like students waiting until the last minute to use a cheat sheet.
  • The "Et Al." Effect: Reviews that cited other scientists (using "et al.") had less AI. Reviews that didn't cite anyone had more. It seems AI is good at writing fluff but bad at remembering specific names and papers.
  • The "Low Confidence" Link: Reviewers who admitted they weren't very confident in their assessment were more likely to use AI. It's like someone saying, "I don't really know this topic, so I'll ask the robot to write it for me."
  • The "Ghost" Reviewers: Reviewers who used AI were less likely to reply to authors' questions later. They did the bare minimum, let the robot do the work, and then disappeared.

5. The Big Worry: The "Homogenization"

The most interesting finding wasn't just about how much AI was used, but what it did to the conversation.

When humans write reviews, they are diverse. Some are angry, some are confused, some are poetic, some are blunt. It's a chaotic, beautiful human mess.
When AI writes reviews, they all start to sound the same. They become homogenized.

  • The Analogy: Imagine a choir where everyone sings a different note. It's a rich, complex chord. Now imagine everyone starts singing the exact same note at the exact same volume. It's technically "in tune," but it's boring and lacks soul.
  • The researchers found that the more AI was used, the more the reviews sounded like each other. This is dangerous for science because we need different perspectives to catch errors and spark new ideas. If everyone's review sounds like the same robot, we lose that diversity of thought.

The Bottom Line

This paper isn't saying "AI is evil" or "Reviewers are lazy." It's saying: "We have a new tool, and it's changing the landscape faster than we can see."

The researchers built a new, super-fast, super-accurate way to measure this change without needing to catch every single cheater. They found that in the world of AI research, a significant chunk of the "human" conversation is now being generated by machines, and this is making our scientific discussions sound a bit more robotic and less diverse.

The takeaway? We need to have a serious conversation about how we use these tools in science, so we don't accidentally turn our brilliant, diverse community of experts into a room full of identical robots.