C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

This paper introduces C2-Faith, a benchmark derived from PRM800K that evaluates LLM judges on their ability to assess causal and coverage faithfulness in chain-of-thought reasoning, revealing that their performance varies significantly by task and that they struggle to localize errors or accurately score incomplete reasoning.

Avni Mittal, Rauno Arike

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are a teacher grading a student's math homework. The student has written down a long, step-by-step solution to a problem.

In the past, teachers (and AI models) mostly cared about one thing: Did they get the right answer? If the final number was correct, the student got an A.

But recently, we've realized that's not enough. A student might get the right answer by guessing, by copying the answer key, or by writing a bunch of nonsense that sounds smart but doesn't actually make sense. This is called being unfaithful. They have the right destination, but the map they drew to get there is broken.

This paper introduces a new tool called C2-Faith to test how good AI "teachers" (Judges) are at spotting these broken maps.

The Two Ways a Map Can Be Broken

The authors realized there are two main ways a student's reasoning can be "unfaithful," and they named them Causality and Coverage.

  1. Causality (The "Domino Effect" Test):
    Imagine a line of dominoes. If you knock over the first one, the second must fall because of the first.

    • The Test: Does Step B logically happen because of Step A?
    • The Failure: If the student writes, "I multiplied by 5," and the next step says, "Therefore, I added 2," that's a broken domino. The second step didn't follow from the first. It's a Causal error.
  2. Coverage (The "Missing Puzzle Pieces" Test):
    Imagine a puzzle where the student jumps from the picture on the box to the finished picture, skipping 50 pieces in the middle.

    • The Test: Did the student show all the necessary steps to get there?
    • The Failure: If the student says, "The answer is 42," but skipped the part where they did the actual math, they have low Coverage. The reasoning is incomplete, even if the final number is right.

The Experiment: Tricking the AI Teachers

To test if AI judges are good at spotting these errors, the researchers created a "fake exam."

  • The Setup: They took perfect, correct math solutions from a huge database.
  • The Trick: They secretly swapped out one step with a fake, nonsensical one (breaking Causality) OR they deleted a bunch of steps in the middle (breaking Coverage).
  • The Challenge: They asked three top-tier AI models (GPT-4.1, DeepSeek-V3.1, and o4-mini) to act as judges and find the errors.

The Surprising Results

The results were like a game of "Rock, Paper, Scissors." No single AI was the best at everything.

  • The "Spotter" (DeepSeek-V3.1):
    This AI was amazing at looking at two specific steps and saying, "Hey, these don't match!" It was the best at spotting the Causal errors.

    • Analogy: It's like a mechanic who is great at checking if two specific gears fit together, but maybe not so good at looking at the whole engine.
  • The "Detective" (o4-mini):
    This AI was the best at looking at the entire long chain of reasoning and finding exactly where the mistake happened. It was also the most balanced overall.

    • Analogy: It's like a detective who can read a whole novel and point to the exact page where the plot hole is.
  • The "Optimist" (All of them):
    When it came to Coverage (missing steps), all the AI judges were too nice. Even when the student deleted 70% of the math steps, the AI judges still gave them high scores.

    • Analogy: It's like a teacher who sees a student's essay with half the words missing, but because the first and last sentences sound good, the teacher gives it an A. The AI judges were fooled by the "surface look" of the answer.

The "Early Bird" Bias

The researchers also found something funny: When the AI judges tried to find the broken step, they almost always guessed it happened earlier than it actually did.

  • Analogy: If the mistake was in the middle of the story, the AI would say, "I think the problem started in the first chapter!" They are always suspicious of the beginning.

Why This Matters

This paper tells us that we can't just trust an AI to grade another AI's thinking blindly.

  • If you need to check if a specific step makes sense, use DeepSeek.
  • If you need to check the whole chain of logic or find missing pieces, use o4-mini.
  • Warning: Don't trust AI judges to tell you if a reasoning process is "complete" if a lot of steps are missing; they will likely be too generous.

In short: We now have a better way to measure if an AI is actually thinking through a problem, or just pretending to. And we learned that different AI "teachers" have different strengths and weaknesses, just like human teachers do.