TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering

Imagine you are trying to teach a very smart, but slightly stubborn, robot assistant how to watch a live surgery and answer questions about what's happening.

The robot is already good at reading text, but when it watches a video, it tends to ignore the moving pictures and just guess the answer based on the words you used to ask the question. If you ask, "Is the doctor moving the tool forward?" it might guess "forward" just because that's a common answer, even if the video shows the tool moving backward. This is called linguistic bias.

The researchers in this paper wanted to fix this so the robot actually watches the video, paying attention to how things change from one second to the next. Here is how they did it, explained simply:

1. The Problem: The "Snapshot" Robot

Standard AI models often look at a video like a stack of still photos. They might look at one frame, then the next, but they don't really "talk" to each other to understand the story.

The Analogy: Imagine trying to understand a movie by looking at a single, frozen photo every 5 seconds. You might see a car, but you won't know if it's speeding up, braking, or crashing. You might guess the car is "fast" just because you've seen fast cars in other movies, not because you saw this car move.

2. The Solution: "TemporalDoRA"

The team created a new training method called TemporalDoRA. Think of it as giving the robot a pair of "temporal glasses" that force it to connect the dots between frames.

They did two clever things to build these glasses:

The "Group Chat" in the Brain:
Usually, when the robot learns, it processes each frame independently. The researchers inserted a special "Group Chat" (called Multi-Head Attention) right inside the robot's learning module.
- The Analogy: Imagine a classroom where students usually work alone on their homework. The teacher (TemporalDoRA) says, "Before you turn in your answer, you must talk to your neighbors for 10 seconds to compare notes." This way, if one student missed a detail in the video, their neighbor might have caught it. The robot learns to mix information from different moments in time to get the full picture.
The "Fine-Tuning" Trick:
Retraining a giant AI from scratch is like rebuilding a whole house just to fix a leaky faucet—it's too expensive and risky. Instead, they used PEFT (Parameter Efficient Fine-Tuning), which is like just replacing the faucet.
- The Analogy: Most methods try to adjust the entire faucet (the whole weight of the AI). TemporalDoRA is smarter: it only adjusts the handle (the low-rank branch) while keeping the main pipe frozen. This ensures the robot doesn't forget everything it already knew (the "frozen backbone") while still learning the new skill of watching time pass.

3. The New Test: "REAL-Colon-VQA"

To prove their method works, they didn't just use standard tests. They built a new dataset called REAL-Colon-VQA.

The Analogy: Imagine a driving test where the instructor asks, "Is the car turning left?" If the robot just memorized that "turning left" is the answer to that specific sentence, it would pass.
The Twist: The researchers asked the exact same question in 20 different ways (e.g., "Is the vehicle steering port?" vs. "Is the car going left?"). This is the Out-of-Template test. If the robot is just guessing based on words, it will fail. If it actually watched the video, it will get it right every time, no matter how you ask.

4. The Results

When they tested their new "TemporalDoRA" robot:

It got much better at answering questions when they were phrased differently.
It stopped guessing based on word patterns and started looking at the actual video evidence.
It did all this without needing a supercomputer to retrain the whole model; it was lightweight and efficient.

The Bottom Line

TemporalDoRA is like teaching a student to stop memorizing the answers to specific questions and start actually understanding the story. By forcing the AI to "chat" between video frames and only tweaking the parts of its brain that need changing, it becomes a much more reliable assistant for doctors, capable of spotting critical, short-lived moments in surgery that a human might miss.

Here is a detailed technical summary of the paper "TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering."

1. Problem Statement

Surgical Video Question Answering (VideoQA) requires models to accurately ground answers in temporal evidence (e.g., tool actions, camera motion, transient occlusions) within endoscopic videos. Current State-of-the-Art (SOTA) Surgical Vision-Language Models (VLMs) suffer from two main issues:

Linguistic Bias: Models often rely on text priors and frequent answer patterns rather than visual evidence. This leads to poor performance when questions are rephrased ("Out-of-Template"), as the model fails to adapt to linguistic variations.
Inadequate Temporal Modeling in PEFT: Standard Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and DoRA apply low-rank updates independently to each token/frame. They lack explicit mechanisms to model frame-to-frame interactions within the adaptation pathway, limiting their ability to exploit sparse temporal evidence.
Data Scarcity: Fully fine-tuning video VLMs is impractical in clinical settings due to the lack of large annotated datasets.

2. Methodology: TemporalDoRA

The authors propose TemporalDoRA, a video-specific PEFT formulation that extends Weight-Decomposed Low-Rank Adaptation (DoRA). The method introduces two key architectural innovations applied specifically to the vision encoder:

A. Temporal Multi-Head Attention (MHA) in the Bottleneck

Unlike standard PEFT which processes tokens independently, TemporalDoRA inserts a lightweight Multi-Head Attention (MHA) module inside the low-rank bottleneck of the vision encoder.

Mechanism: After the down-projection ( $W_{\downarrow}$ ) compresses features into a low-rank space, the features are reshaped into temporal sequences. MHA is applied along the time dimension ( $T$ ) before the up-projection.
Benefit: This enables content-dependent frame-to-frame aggregation. The model can attend to the most informative frames and down-weight redundant or corrupted ones, allowing short-lived clinical events to influence the update.

B. Selective Weight Decomposition (Residual-Only)

Standard DoRA reparameterizes the entire effective weight ( $W_0 + \Delta W$ ) into direction and magnitude. TemporalDoRA modifies this:

Mechanism: It applies the direction-magnitude decomposition only to the trainable low-rank up-projection branch ( $W_{\uparrow}$ ), while keeping the original pretrained weight ( $W_0$ ) frozen and undecomposed.
Benefit: This preserves the stable directionality of the pretrained backbone and confines adaptation capacity strictly to the low-rank branch. It prevents overfitting in low-data surgical settings and ensures the residual starts from zero, maintaining optimization stability.

Formulaic Representation:
The residual output $h(X)$ is computed as:
$h(X) = XW_0 + \alpha \cdot \text{MHA}(XW_{\downarrow}) W_{\uparrow}$
Where $W_{\uparrow}$ is decomposed into a normalized direction matrix $\hat{V}$ and a learnable magnitude vector $m$ .

3. Key Contributions

TemporalDoRA Architecture: A novel PEFT method that integrates temporal mixing (MHA) directly into the low-rank adaptation subspace and utilizes selective weight decomposition. This achieves temporally aware updates with minimal parameter overhead (updating only ~0.22% of parameters, ~8.6x fewer than ST-Adapter).
REAL-Colon-VQA Dataset: A new benchmark dataset for colonoscopy VideoQA containing 6,424 clip–question pairs.
- It includes paired In-Template and Out-of-Template questions (rephrased versions) to explicitly evaluate model sensitivity to linguistic variation.
- Annotations cover procedural dynamics (motion, occlusion, tool usage) and lesion attributes.
Robustness Analysis: Comprehensive evaluation demonstrating that temporal mixing within the low-rank branch is the primary driver for improved robustness against rephrasing, outperforming standard PEFT and adapter-based baselines.

4. Experimental Results

The method was evaluated on REAL-Colon-VQA and EndoVis18-VQA using two backbones: Qwen3-VL-2B and InternVL3-1B.

Out-of-Template Performance: TemporalDoRA consistently achieved the best results on Out-of-Template splits, indicating superior robustness to linguistic rephrasing.
- Example (Qwen3-VL-2B on REAL-Colon-VQA): TemporalDoRA achieved a ROUGE-L of 0.731 on Out-of-Template questions, significantly outperforming the next best baseline, ST-Adapter (0.653).
- Example (EndoVis18-VQA): TemporalDoRA improved keyword accuracy to 0.326 compared to LoRA's 0.304.
In-Template Performance: The method maintained competitive accuracy on standard (In-Template) questions, showing it does not sacrifice general performance for robustness.
Ablation Studies:
- Temporal Operator: MHA provided the best balance between In-Template and Out-of-Template performance compared to 3D Conv, LSTM, Mamba, and Self-Attention.
- Architecture: Combining bottleneck MHA with residual-only decomposition yielded larger gains than simply adding MHA to existing LoRA/DoRA implementations.

5. Significance and Impact

Clinical Reliability: By forcing models to rely on temporally consistent visual evidence rather than linguistic shortcuts, TemporalDoRA reduces the risk of hallucination in high-stakes surgical decision-making.
Efficiency: It offers a highly parameter-efficient solution (~0.22% trainable parameters) suitable for clinical environments where data is scarce and computational resources may be limited.
Benchmarking: The introduction of REAL-Colon-VQA sets a new standard for evaluating VideoQA robustness, moving beyond simple accuracy to test true temporal grounding and resistance to linguistic bias.

Limitations & Future Work:
The primary limitation is the computational overhead introduced by the MHA within the bottleneck, which can be costly for very long video clips. Future work aims to develop more efficient temporal operators and extend PEFT into the Large Language Model (LLM) decoder to further mitigate language bias.

TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering

1. The Problem: The "Snapshot" Robot

2. The Solution: "TemporalDoRA"

3. The New Test: "REAL-Colon-VQA"

4. The Results

The Bottom Line

1. Problem Statement

2. Methodology: TemporalDoRA

A. Temporal Multi-Head Attention (MHA) in the Bottleneck

B. Selective Weight Decomposition (Residual-Only)

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation