When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?

Imagine you have a very smart, very chatty robot friend who loves to answer your questions. This robot has read almost everything on the internet, so it feels confident about everything. But here's the problem: it never admits when it doesn't know something.

If you ask it a tricky question about history or a specific date, it will often just make up a plausible-sounding answer rather than saying, "I'm not sure." In the world of AI, this is called hallucinating. It's like a student who, instead of saying "I don't know the answer," just guesses loudly and confidently, hoping they get lucky.

This paper is about teaching this robot friend a new, very important skill: The Art of Silence.

The Problem: The Robot Who Won't Shut Up

The researchers focused on Time-Based Questions (like "Who was the President in 1995?"). This is tricky because facts change over time.

The Scenario: Imagine asking, "Who was Anna Karina's husband from 1966 to 1967?"
The Mistake: The robot might say, "Pierre Fabre!" because it knows that name is associated with her. But it forgets they divorced in 1965.
The Reality: The question is actually unanswerable based on the facts provided because the timeline is wrong.
The Ideal Robot: A good robot should look at the timeline, realize the dates don't match, and say, "I cannot answer this because the information is contradictory."

Currently, even the smartest robots (like GPT-4o) are terrible at this. They prefer to guess rather than stay silent.

The Solution: Teaching the Robot to "Know When to Stop"

The researchers tried two main ways to fix this:

1. The "Cram Session" (Supervised Fine-Tuning / SFT)

This is like giving the robot a textbook of correct answers and saying, "Memorize this."

The Result: The robot got better at answering questions, but it became overconfident. It started guessing even more confidently on questions it couldn't answer. It learned to talk, but not to listen to its own doubts.

2. The "Video Game Coach" (Reinforcement Learning / RL)

This is where the magic happened. Instead of just giving the robot answers, the researchers set up a game with rewards.

The Rules:
- If the robot gives the right answer: +10 points.
- If the robot gives a wrong answer: -100 points.
- If the robot says "I don't know" when the question is unanswerable: +100 points.
- If the robot says "I don't know" when it could have answered: -50 points.

They also taught the robot to think before it speaks (using something called "Chain of Thought"). It's like asking the robot to whisper its thought process to itself before shouting the final answer.

The Big Surprise:
They used a small robot (only 1.5 billion "brain cells") and trained it with this video game coach.

The Result: This small robot became better at knowing when to stay silent and when to speak than the massive, super-expensive GPT-4o robot! It learned that silence is golden.

The "Secret Sauce" Analogy

Think of the robot's brain like a kitchen:

The Context (The Ingredients): The researchers tried giving the robot different "ingredients" to help it cook the answer.
- Whole Context: Giving the robot the entire cookbook. (Too much noise!)
- Knowledge Graphs: Giving the robot a list of facts. (Helpful, but not a game-changer.)
- Chain of Thought: Asking the robot to write down its recipe step-by-step before cooking. (This was the secret sauce!)

They found that simply giving the robot more information (like more facts or longer texts) didn't help much. But teaching it how to think step-by-step and then rewarding it for being honest about its uncertainty worked wonders.

The Takeaway

This paper proves that silence is a skill, not a bug.

Small can be better: A small, well-trained robot can outperform a giant, untrained one if it knows when to shut up.
Rewards matter: You can't just tell a robot to "be honest." You have to reward it for being honest and punish it for making things up.
Thinking is key: Making the robot "think out loud" (step-by-step reasoning) is the best way to help it figure out if it actually knows the answer or if it's just guessing.

In short, the researchers taught AI that sometimes, the most intelligent thing you can do is say, "I don't know." And in a world full of confident liars, that's a superpower.

Here is a detailed technical summary of the paper "When Silence is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?"

1. Problem Definition

Large Language Models (LLMs) frequently suffer from hallucination, producing fluent but incorrect answers even when the necessary information is missing, ambiguous, or contradictory. This issue is particularly acute in Temporal Question Answering (Temporal QA), where facts evolve over time, and questions may be unanswerable due to outdated or conflicting temporal evidence.

The core problem addressed is the lack of abstention capability in LLMs: the ability to recognize when a question cannot be answered based on the provided context and to explicitly refuse (e.g., output "No Answer") rather than guessing. Existing methods like calibration or standard prompting often fail in complex reasoning scenarios, leading to overconfident and erroneous responses in time-sensitive domains.

2. Methodology

The authors propose a systematic framework to train LLMs to abstain effectively in temporal QA, moving beyond simple calibration to treat abstention as a teachable skill.

A. Data and Information Extraction

The study utilizes the TimeQA dataset, which contains time-sensitive questions with both answerable and unanswerable instances. To investigate how different information types affect reasoning, the authors extract three types of inputs:

Original Context ( $c$ ): The full document provided with the question.
Time-Related Sub-Context ( $t_c$ ): A filtered subset of the context containing only facts relevant to the specific timestamp in the question (extracted via GPT-4o-mini).
Knowledge Graphs (KG): Structured temporal facts (head, relation, tail, timestamp) extracted from the context, retrieved via semantic similarity (Faiss) or keyword matching (KeyBERT).

B. Training Pipeline: CoT + RL

The core innovation is a two-stage training pipeline designed to enhance reasoning and abstention:

Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT):
- A "trusted reasoning set" is curated using GPT-o1 to generate step-by-step reasoning traces ( $r$ ) and final answers ( $\hat{a}$ ).
- Only samples where the generated answer matches the ground truth are kept.
- The model is SFTed on this data to learn the process of reasoning before attempting to answer.
Reinforcement Learning (RL) with GRPO:
- The SFTed model is further optimized using Group Relative Policy Optimization (GRPO).
- Reward Design: A rule-based reward system is crucial:
  - Format Reward: Ensures the model outputs reasoning within <thought> tags and the answer within <answer> tags.
  - Answer Reward:
    - $+1.0$ if the model correctly outputs "No Answer" for unanswerable questions.
    - $Rouge\text{-}L + EM$ (Exact Match) if the model answers a solvable question correctly.
    - $0$ for hallucinations (answering unanswerable questions) or over-abstention (refusing answerable questions).

3. Key Contributions

First Empirical Study on Temporal Abstention: This is the first work to systematically analyze LLM abstention specifically in the context of temporal reasoning, a domain where dynamic facts make "not knowing" a common and necessary state.
RL-Driven Abstention Skill: The paper demonstrates that abstention is a learnable skill. By combining CoT-SFT (as a cold start) with RL, even small models (1.5B parameters) can surpass much larger closed-source models (like GPT-4o) in reasoning accuracy and abstention reliability.
Analysis of Information Types: The study reveals that explicit CoT supervision is far more effective for inducing reasoning and abstention skills than implicit information (such as filtered sub-contexts or Knowledge Graphs).
Trade-off Discovery: The authors identify a critical trade-off between overall accuracy and abstention ability. While RL improves reasoning, it can still induce overconfidence if not carefully balanced with data distribution and reward design.

4. Experimental Results

The experiments were conducted on TimeQA-Easy and TimeQA-Hard using models ranging from 0.5B to 7B parameters (Qwen2.5, Llama3.2) and compared against GPT-3.5 and GPT-4o.

Performance Gains:
- A Qwen2.5-1.5B-Instruct model, initialized with CoT-SFT and fine-tuned with RL, achieved a 3.46% improvement in Exact Match (EM) on TimeQA-Easy and 5.80% on TimeQA-Hard compared to GPT-4o.
- The RL-tuned model improved the True Positive rate (correctly identifying unanswerable questions) by 20% over a pure SFT variant.
Impact of Training Methods:
- SFT alone tends to induce overconfidence, reducing False Positives (FP) but significantly increasing False Negatives (FN), meaning the model answers unanswerable questions too often.
- RL significantly boosts "honesty" (True Positives) on easy tasks but can reverse on hard tasks if the model finds it easier to speculate than to abstain.
Information Input:
- Providing Time-Related Sub-Contexts ( $t_c$ ) yielded better results than full contexts or Knowledge Graphs, suggesting that noise reduction is more beneficial than adding structured knowledge for small models.
- Knowledge Graphs provided limited benefit compared to CoT supervision and sometimes degraded performance due to noise or loss of continuity.
Generalization:
- Models trained on Temporal QA showed poor generalization to Out-of-Distribution (OOD) non-temporal tasks (e.g., MMLU, SQuAD), indicating that abstention skills are difficult to transfer across domains.
- Training on "Hard" temporal data did not generalize well to "Easy" data under RL, whereas SFT showed better cross-difficulty generalization.

5. Significance and Implications

Reliability in High-Stakes Domains: The findings are critical for applications in medicine, law, and finance, where hallucinating a date or fact can have severe consequences. The ability to "know when to stop" is as important as the ability to answer.
Efficiency of Small Models: The study proves that smaller models (1.5B) can outperform massive proprietary models (GPT-4o) in complex reasoning tasks if equipped with the correct training pipeline (CoT + RL). This democratizes access to high-reliability reasoning.
Future Directions: The paper highlights that robust abstention requires careful balancing of data distribution (the ratio of answerable vs. unanswerable questions) and reward design. It suggests that future research must focus on principled uncertainty-aware training strategies to build truly reliable LLMs.

In conclusion, the paper establishes that silence is golden for LLMs: by explicitly training models to abstain using Reinforcement Learning guided by Chain-of-Thought, we can significantly reduce hallucinations and improve reliability in dynamic, time-sensitive environments.