Attention-guided Evidence Grounding for Spoken Question Answering

Imagine you are a detective trying to solve a mystery. You have a spoken question (like a witness describing a crime) and a massive library of documents (the evidence). Your goal is to find the exact page in the library that holds the answer and read it out loud.

This is the challenge of Spoken Question Answering.

The Problem: The "Distracted Detective"

Currently, most AI systems work like a tired, overworked detective.

The Old Way (Cascaded Systems): First, the detective listens to the witness, writes down what they said (transcription), and then starts reading the library.
- The Flaw: If the detective mishears a word while writing it down, the rest of the investigation is ruined. It's also slow because they have to do two separate jobs.
The "Hallucination" Problem: Even if the detective reads the right books, they sometimes make things up. They might say, "The suspect was wearing a red hat," even though the text never mentioned a hat. This is called hallucination. They aren't actually looking at the evidence; they are just guessing based on what they think might be true.

The Solution: AEG (The "Spotlight" Detective)

The paper introduces a new system called AEG (Attention-guided Evidence Grounding). Think of this as giving your detective a magical spotlight.

Instead of reading the whole library blindly, the AI uses its internal "attention mechanism" (a way of deciding what is important) to shine a bright light on the specific sentences that answer the question.

How it Works (The Two-Step Process)

1. The Natural Instinct (Grounding with Attention)
When a human reads a long text, they naturally scan for keywords. The AI tries to do the same. It looks at the spoken question and the text, and it asks, "Which parts of this text are glowing the brightest?"

The Issue: In the beginning, the AI's "spotlight" is weak and blurry. It shines on everything equally, making it hard to find the real answer. It's like trying to find a needle in a haystack with a dim flashlight.

2. The Training (Learning to Focus on Evidence - LFE)
This is the paper's secret sauce. The researchers realized the AI needed to be taught how to use its spotlight effectively.

The Analogy: Imagine a student who is bad at studying. They read the whole textbook but remember nothing. The teacher (the LFE training) says, "No, don't just read everything. Highlight only the sentences that answer the specific question."
The Result: Through this special training, the AI learns to ignore the noise (irrelevant text) and focus its "spotlight" intensely on the Key Evidence. It learns to say, "I am 100% sure the answer is in this paragraph, and I will ignore the rest."

Why This is a Big Deal

1. No More "Fake News" (Reduced Hallucinations)
Because the AI is forced to point to the exact sentence it used to build the answer, it can't just make things up. If it can't find the evidence, it knows it doesn't know the answer. It's like a lawyer who must cite the specific page of the law book before making an argument.

2. Speed and Efficiency
Old systems are like a relay race: Person A runs, passes the baton to Person B, who runs, then passes to Person C. If Person A trips, everyone fails.

AEG is like a single sprinter who does the whole job in one go. It listens to the question and finds the answer simultaneously.
The Result: It is 62% faster than the old methods and doesn't lose information by trying to transcribe the audio first.

3. Trustworthy
The system can show you exactly where it found the answer. It highlights the text (the evidence) so you can verify it yourself. It's transparent, not a "black box."

Summary

In simple terms, this paper teaches AI to stop guessing and start looking. By training the AI to focus its "mental spotlight" on the most important parts of a text when answering a spoken question, the system becomes:

Faster (no need to transcribe audio first).
More Accurate (less lying/hallucinating).
More Trustworthy (it shows its work).

It turns a confused, distracted AI into a sharp, focused detective who knows exactly where to look for the truth.

1. Problem Statement

Spoken Question Answering (Spoken QA) is a cross-modal task where a model must answer a spoken query based on a textual context. Current systems face two primary challenges:

Hallucinations and Factual Inaccuracy: Even with correct context, models often generate answers inconsistent with the source material, limiting their use in high-stakes fields like medicine and law.
Lack of Interpretability: Existing approaches rarely provide explicit evidence grounding, making it difficult for users to verify why an answer was generated.
Limitations of Cascaded Systems: Traditional pipelines (ASR $\to$ LLM) suffer from error propagation (ASR errors degrade downstream performance), high latency, and loss of paralinguistic information.
Diffuse Attention in Cross-Modal Models: While pre-trained Speech Large Language Models (SpeechLLMs) have internal attention mechanisms, their raw attention distributions in cross-modal (speech-text) scenarios are often diffuse and uncalibrated, failing to clearly distinguish relevant evidence from irrelevant context.

2. Methodology: Attention-guided Evidence Grounding (AEG)

The authors propose AEG, an end-to-end framework that leverages the internal cross-modal attention of SpeechLLMs to explicitly locate and ground key evidence. The framework consists of two main components:

A. Grounding with Attention (Inference Stage)

This module utilizes the pre-trained model's attention weights to identify relevant context segments without additional training.

Process: During the "prefill" stage of inference, the model processes the audio query and textual context.
Attention Extraction: The system aggregates self-attention weights across specific layers (identified as layers 10–28 in experiments) and heads.
Scoring: Token-level weights are averaged to compute a segment-level importance score ( $A(c_i)$ ) for each context document.
Selection & Marking: Segments exceeding a threshold $\tau$ are selected as "key evidence." These segments are then annotated with special markers (e.g., <EVIDENCE> and </EVIDENCE>) before being passed to the decoder for answer generation.

B. Learning to Focus on Evidence (LFE) (Training Stage)

To address the issue of diffuse attention in pre-trained models, the authors introduce LFE, a supervised fine-tuning (SFT) paradigm.

Objective: To "teach" the SpeechLLM to perform a "scan-then-focus" process, sharpening its attention on critical information and filtering out noise.
Training Task: Formulated as a selection generation task. Given a spoken query and a set of candidate contexts, the model is trained to generate only the ground-truth evidence segments.
Mechanism: By forcing the model to reconstruct only the relevant evidence, the auto-regressive loss function naturally penalizes attention on non-evidence tokens, effectively calibrating the attention distribution to distinguish query-relevant segments from irrelevant context.

3. Key Contributions

AEG Framework: A novel end-to-end framework that transforms implicit attention patterns in SpeechLLMs into explicit evidence markers, enhancing both factual accuracy and interpretability.
LFE Paradigm: A specialized fine-tuning method that significantly improves the model's ability to discriminate key evidence in cross-modal speech-text scenarios, overcoming the limitations of raw pre-trained attention.
Efficiency and Robustness: The approach avoids the error propagation of cascaded ASR systems by operating directly on audio embeddings in the latent space, while simultaneously reducing inference latency.

4. Experimental Results

The method was evaluated on SQuAD, HotpotQA, and MuSiQue datasets using various SpeechLLMs (including GPT-4o Audio, Qwen3-Omni, and LongCat-Flash-Omni).

Answer Accuracy (EM): AEG with LFE consistently outperformed baselines across all models. For example, on the 30B-parameter Qwen3-Omni-30B-A3B, it achieved improvements of 1.93%, 2.73%, and 0.87% on HotpotQA, MuSiQue, and SQuAD respectively compared to the baseline.
Evidence Grounding Precision: The LFE module drastically improved evidence selection. On SQuAD, the F1 score for evidence grounding jumped from 43.49% (AEG w/o LFE) to 80.02% (AEG with LFE).
Comparison with Cascaded Systems:
- Latency: AEG achieved an average latency of 238 ms, significantly faster than cascaded systems (ASR + Reranker) which ranged from 405 ms to 625 ms (a ~62% reduction).
- Performance: AEG (8.2B params) outperformed the strongest cascaded baseline (Whisper-Large-v3 + Qwen3-Reranker-8B, 9.6B params) in F1 score (80.02% vs. 79.14%) and Hit Rate (91.16% vs. 90.11%), while avoiding ASR transcription errors.

5. Significance

This work represents a significant step forward in reliable and interpretable Spoken QA. By explicitly grounding answers in evidence through a trainable attention mechanism, AEG mitigates the "hallucination" problem common in generative AI. Furthermore, its end-to-end efficiency makes it highly suitable for real-time applications where low latency and high factual accuracy are critical, offering a superior alternative to traditional cascaded architectures that suffer from error accumulation and high computational costs.

Attention-guided Evidence Grounding for Spoken Question Answering

The Problem: The "Distracted Detective"

The Solution: AEG (The "Spotlight" Detective)

How it Works (The Two-Step Process)

Why This is a Big Deal

Summary

1. Problem Statement

2. Methodology: Attention-guided Evidence Grounding (AEG)

A. Grounding with Attention (Inference Stage)

B. Learning to Focus on Evidence (LFE) (Training Stage)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain

Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction

A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context