ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

This paper introduces ESG-Bench, a human-annotated benchmark dataset designed to evaluate and mitigate hallucinations in large language models when analyzing complex ESG reports, demonstrating that Chain-of-Thought prompting and fine-tuning on this dataset significantly improve factual accuracy and generalizability.

Siqi Sun, Ben Peng Wu, Mali Jin, Peizhen Bai, Hanpei Zhang, Xingyi Song

Published 2026-03-16
📖 4 min read☕ Coffee break read

Imagine you are a detective trying to solve a mystery, but instead of a crime scene, your "crime scene" is a massive, 300-page corporate report about how a company treats the environment, its workers, and its bosses. This is an ESG Report (Environmental, Social, and Governance).

In the past, companies wrote these reports voluntarily. Now, many governments are saying, "You must write these, and they must be true." But these reports are huge, confusing, and sometimes companies try to trick people by exaggerating their good deeds (a practice called "greenwashing").

Enter AI (Large Language Models). You might think, "Great! Let's just ask the AI to read these 300 pages and tell us the truth."

Here is the problem: AI is a confident liar.

If you ask an AI a question about a specific page in a 300-page document, it might not actually look at the page. Instead, it might guess based on what it "remembers" from its general training. It might say, "Oh, they definitely planted 1,000 trees!" when the report actually said they planted 10. This is called a hallucination. In the world of finance and law, a confident lie is dangerous.

The Solution: ESG-Bench

The researchers in this paper built a gym for AI called ESG-Bench.

Think of ESG-Bench as a specialized training ground where they created a massive library of real ESG reports and a set of tricky questions about them.

  • The Test: They asked an AI to answer questions based only on the text provided.
  • The Grading: Human experts (PhD students in sustainability) acted as the referees. They checked every AI answer.
    • Correct: The AI found the fact in the text.
    • Hallucination: The AI made something up or missed a fact that was right there.
    • The Twist: They also taught the AI when to say, "I don't know," if the answer wasn't in the text.

The Secret Weapon: Chain-of-Thought (CoT)

The paper found that simply feeding the AI more data didn't fix the lying. The AI needed to learn how to think before it spoke.

The researchers used a technique called Chain-of-Thought (CoT). Imagine asking a student to solve a math problem.

  • Bad AI: Just blurts out the answer. "The answer is 42!" (Even if it's wrong).
  • CoT AI: Is forced to write down its steps first.
    1. "Okay, the question asks about water usage."
    2. "I need to scan the report for 'water'."
    3. "I found a table on page 45."
    4. "The table says 500 liters."
    5. "Therefore, the answer is 500 liters."

The researchers created a special "4-Step CoT" method for ESG reports. They taught the AI to:

  1. Identify the topic.
  2. Search the document for clues.
  3. Decide if the answer is actually there.
  4. Only then give the answer (or admit defeat).

The Results

When they trained the AI using this "think-first" method, the results were amazing:

  • Less Lying: The AI stopped making up facts.
  • Better Honesty: When the answer wasn't in the report, the AI learned to say, "I can't find that," instead of guessing.
  • General Skills: This training didn't just help with ESG reports; it made the AI better at answering questions in other fields too (like biology or general trivia).

The Big Picture

Think of this paper as a new driver's education course for AI.
Before, AI drivers would speed through a city, guessing where the stop signs were, often causing accidents (hallucinations).
Now, thanks to ESG-Bench, the AI drivers are learning to:

  1. Look at the road signs (the document).
  2. Check their mirrors (search the text).
  3. Stop if they aren't sure (abstain from answering).

This is crucial because as AI starts handling more important tasks like checking corporate laws or medical records, we can't afford for it to be a confident liar. We need it to be a careful, honest detective.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →