ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Imagine you are a detective trying to solve a mystery, but instead of a crime scene, your "crime scene" is a massive, 300-page corporate report about how a company treats the environment, its workers, and its bosses. This is an ESG Report (Environmental, Social, and Governance).

In the past, companies wrote these reports voluntarily. Now, many governments are saying, "You must write these, and they must be true." But these reports are huge, confusing, and sometimes companies try to trick people by exaggerating their good deeds (a practice called "greenwashing").

Enter AI (Large Language Models). You might think, "Great! Let's just ask the AI to read these 300 pages and tell us the truth."

Here is the problem: AI is a confident liar.

If you ask an AI a question about a specific page in a 300-page document, it might not actually look at the page. Instead, it might guess based on what it "remembers" from its general training. It might say, "Oh, they definitely planted 1,000 trees!" when the report actually said they planted 10. This is called a hallucination. In the world of finance and law, a confident lie is dangerous.

The Solution: ESG-Bench

The researchers in this paper built a gym for AI called ESG-Bench.

Think of ESG-Bench as a specialized training ground where they created a massive library of real ESG reports and a set of tricky questions about them.

The Test: They asked an AI to answer questions based only on the text provided.
The Grading: Human experts (PhD students in sustainability) acted as the referees. They checked every AI answer.
- Correct: The AI found the fact in the text.
- Hallucination: The AI made something up or missed a fact that was right there.
- The Twist: They also taught the AI when to say, "I don't know," if the answer wasn't in the text.

The Secret Weapon: Chain-of-Thought (CoT)

The paper found that simply feeding the AI more data didn't fix the lying. The AI needed to learn how to think before it spoke.

The researchers used a technique called Chain-of-Thought (CoT). Imagine asking a student to solve a math problem.

Bad AI: Just blurts out the answer. "The answer is 42!" (Even if it's wrong).
CoT AI: Is forced to write down its steps first.
1. "Okay, the question asks about water usage."
2. "I need to scan the report for 'water'."
3. "I found a table on page 45."
4. "The table says 500 liters."
5. "Therefore, the answer is 500 liters."

The researchers created a special "4-Step CoT" method for ESG reports. They taught the AI to:

Identify the topic.
Search the document for clues.
Decide if the answer is actually there.
Only then give the answer (or admit defeat).

The Results

When they trained the AI using this "think-first" method, the results were amazing:

Less Lying: The AI stopped making up facts.
Better Honesty: When the answer wasn't in the report, the AI learned to say, "I can't find that," instead of guessing.
General Skills: This training didn't just help with ESG reports; it made the AI better at answering questions in other fields too (like biology or general trivia).

The Big Picture

Think of this paper as a new driver's education course for AI.
Before, AI drivers would speed through a city, guessing where the stop signs were, often causing accidents (hallucinations).
Now, thanks to ESG-Bench, the AI drivers are learning to:

Look at the road signs (the document).
Check their mirrors (search the text).
Stop if they aren't sure (abstain from answering).

This is crucial because as AI starts handling more important tasks like checking corporate laws or medical records, we can't afford for it to be a confident liar. We need it to be a careful, honest detective.

1. Problem Statement

The paper addresses the critical challenge of hallucination in Large Language Models (LLMs) when analyzing Environmental, Social, and Governance (ESG) reports.

Context: ESG reporting is becoming a legal requirement globally, yet these documents are often hundreds of pages long, complex, multi-modal (text, tables, graphics), and industry-specific.
The Challenge: LLMs struggle with long-context retrieval and often rely on parametric knowledge that conflicts with source documents. This leads to two types of hallucinations:
1. Additive Hallucinations: The model generates unsupported or fabricated information.
2. Omissive Hallucinations: The model fails to answer even when relevant evidence exists in the text.
Gap: Existing benchmarks (e.g., HaluEval, BioASQ) do not specifically target the unique constraints of long, heterogeneous ESG documents, nor do they provide fine-grained hallucination labels or Chain-of-Thought (CoT) reasoning traces for this domain.

2. Methodology

A. ESG-Bench Construction

The authors constructed a new benchmark dataset using a Model-Then-Annotator pipeline:

Data Collection: 94 unique ESG reports (2020–2024) from diverse sectors (finance, energy, tech, etc.) were sourced from ResponsibilityReports.com.
Question Generation: 270 questions were derived from academic literature, regulatory bodies (e.g., CDP), and GPT-4o generation. Questions cover the three ESG pillars (Environmental, Social, Governance).
Model Response Generation: ChatGPT-4o was prompted to answer questions based on the reports, providing answers, page references, and format types.
Human Annotation: PhD-level experts in sustainability reviewed the model outputs. They assigned labels:
- Correct: Fully supported by context.
- Hallucination: Fabricated or unsupported info.
- Incomplete: Partially accurate but missing key info.
- Answer Not Found: Model failed to answer despite evidence (Omissive).
Dataset Versions:
1. Report-based: 270 QA pairs with human assessments.
2. Hallucination Mitigation Task: 27,000+ instances (1,358 correct, 25,516 hallucinated) designed for training, featuring background passages of varying lengths (avg. 2,604 tokens).

B. Hallucination Mitigation Strategies

The paper proposes a three-stage approach to improve LLM performance:

Supervised Fine-Tuning (SFT): Fine-tuning models on the ESG QA dataset to encourage grounding in explicit textual evidence.
CoT Prompting: Introducing structured reasoning steps during inference to force the model to evaluate evidence availability before answering. Two formats were tested:
- 2-step: Determine if answerable $\rightarrow$ Provide answer.
- 4-step: Identify topic $\rightarrow$ Search report $\rightarrow$ Determine if answerable $\rightarrow$ Provide answer.
CoT-based Fine-Tuning: Fine-tuning models on a subset of data annotated with explicit CoT rationales (intermediate reasoning steps). This internalizes the structured decision-making process, moving beyond surface-level pattern matching.

C. Experimental Setup

Models: Llama-3.2-3B, Gemma-2-2B, and Mistral-7B.
Datasets: ESG-Bench (primary), HaluEval, and BioASQ (for generalizability).
Metrics:
- WA Accuracy: Accuracy on "With Answer" (grounded) cases.
- WoA Accuracy: Accuracy on "Without Answer" (abstention) cases.
- Balanced Accuracy & F1 Score: To measure the trade-off between precision and recall in hallucination detection.

3. Key Contributions

ESG-Bench Dataset: The first structured resource specifically designed for long-context QA and hallucination mitigation in the ESG domain, featuring human-verified hallucination labels and CoT annotations.
Task-Specific CoT Strategy: A novel fine-tuning approach using multi-step CoT rationales tailored to long-document evidence retrieval. This guides models to verify source text before generating answers.
Proxy Supervision Validation: Demonstrated that GPT-4o's binary "groundedness" judgments (Yes/No) can serve as an effective proxy supervision signal for training smaller models to detect hallucinations, reducing the need for massive human annotation during training.
Comprehensive Evaluation: Provided a robust assessment of state-of-the-art LLMs in high-stakes compliance contexts, showing that structured reasoning significantly outperforms standard prompting.

4. Experimental Results

Performance Gains: CoT-based fine-tuning (specifically the 4-step CoT) significantly outperformed both zero-shot (WoF) and standard Supervised Fine-Tuning (SFT) across all models and datasets.
- On ESG-Bench, the 4-step CoT approach achieved 96.00% Balanced Accuracy and 78.62% F1 for LLaMA-3.2-3B, a substantial improvement over the baseline.
- It notably improved WoA Accuracy (the ability to correctly abstain when no answer exists), reducing false positives (hallucinations) while maintaining high recall.
Generalizability: The improvements transferred to general QA benchmarks (HaluEval, BioASQ), indicating that the CoT strategy enhances long-context reasoning capabilities beyond just the ESG domain.
Proxy Validation: GPT-4o's self-assessment of its own outputs showed high agreement (80%+) with human annotations, validating its use as a scalable supervision signal for hallucination-aware training.

5. Significance

Regulatory Compliance: As ESG reporting becomes legally mandated, the ability to automate analysis without hallucinations is crucial for corporate accountability and investor trust.
Scalable Trust: The paper provides a pathway to scale trustworthy AI in socially sensitive domains by combining structured reasoning (CoT) with proxy supervision, making high-reliability analysis feasible for long, complex documents.
Methodological Advancement: It shifts the paradigm from relying on parametric knowledge to evidence-grounded retrieval and verification, offering a blueprint for hallucination mitigation in other long-context, compliance-critical fields (e.g., legal, medical).

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

The Solution: ESG-Bench

The Secret Weapon: Chain-of-Thought (CoT)

The Results

The Big Picture

1. Problem Statement

2. Methodology

A. ESG-Bench Construction

B. Hallucination Mitigation Strategies

C. Experimental Setup

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Diffusion Language Models Know the Answer Before Decoding

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá