Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

Imagine you are teaching a student to recognize different steps in a complex recipe, like making a sandwich. You give them a video tutorial and a checklist of what happens when (e.g., "First, spread the peanut butter," then "Add the jelly").

Now, imagine someone made a mistake in the checklist. Maybe they wrote "Add the jelly" before "Spread the peanut butter," or they labeled a picture of a banana as "peanut butter."

If you let the student study this messy video, they will get confused. They will try to memorize the steps, but every time they look at the "banana" picture, their brain will scream, "Wait, that doesn't make sense!" They will struggle to learn that specific part, no matter how many times they review it.

This paper introduces a clever way to find those mistakes in the checklist by watching how the student struggles.

Here is the breakdown of their idea, "Loss Knows Best," using simple analogies:

1. The Core Idea: The "Struggle Score"

In machine learning, when a computer model tries to learn from data, it makes mistakes. We measure these mistakes with something called "Loss." Think of Loss as a "Struggle Score."

Low Struggle Score: The model understands the concept easily. It's like the student getting a question right on the first try.
High Struggle Score: The model is confused. It's like the student staring at a question, scratching their head, and getting it wrong over and over.

Usually, as a student studies more (training over many "epochs"), their Struggle Score goes down for everything because they are learning.

2. The Secret Weapon: The "Cumulative Sample Loss" (CSL)

The authors realized that good labels (correct instructions) become easy to learn quickly. The Struggle Score drops fast and stays low.

However, bad labels (mistakes in the video) are different.

If a frame is labeled "peanut butter" but shows a "banana," the model can never truly learn it. No matter how much it studies, the Struggle Score stays high because the instruction is wrong.
If the steps are in the wrong order (e.g., "Eat the sandwich" before "Make the sandwich"), the model gets confused about the timeline. The Struggle Score becomes erratic and jagged.

The authors created a metric called Cumulative Sample Loss (CSL). Imagine taking a photo of the student's Struggle Score at every single moment of their study session and averaging them all together.

The "Easy" Frames: The average score is low.
The "Bad" Frames: The average score is high.

By looking at this average "Struggle Score," the system can point a red flag at the specific frames in the video that are likely mislabeled or out of order.

3. How It Works (The Process)

The Study Session: They train a video model normally, but they save a "snapshot" of the model's brain after every single day of study (every training epoch).
The Review: They take a test video and run it through every single snapshot of the model's brain.
The Diagnosis: They calculate the average Struggle Score for every single frame.
- If a frame is consistently hard to learn (high CSL), the system says, "Hey, this part of the video is probably labeled wrong!"
- If the Struggle Score spikes suddenly at a specific moment, it might mean the order of events is wrong (temporal disordering).

4. Why This Is Special

No Magic Crystal Ball: You don't need to know where the mistakes are beforehand. The system finds them on its own just by watching how the model learns.
It Catches Two Types of Errors:
- Wrong Label: Like calling a dog a cat.
- Wrong Order: Like putting the "finish" step before the "start" step.
It's Efficient: Once the model is trained, finding the errors is fast and doesn't require retraining the whole system.

The Real-World Impact

The authors tested this on two very different types of videos:

Surgery Videos (Cholec80): Watching surgeons remove gallbladders. If a step is labeled wrong, it could be dangerous. This system helps ensure the training data for future AI surgeons is perfect.
DIY Videos (EgoPER): Videos of people making coffee or sandwiches. These are often messy and have human errors in the descriptions. This system helps clean up those datasets so AI can learn to do chores better.

Summary

Think of this paper as a detective that uses the model's confusion as a clue. Instead of trying to guess which video frames are wrong, it asks the model: "Which parts of this video made you struggle the most?" The parts where the model struggled the most are the ones where the human annotators likely made a mistake.

It turns the model's "pain" (high loss) into a powerful tool for cleaning up data, ensuring that the AI we build in the future learns from the truth, not from typos and mix-ups.

1. Problem Statement

High-quality video datasets are essential for training robust models in temporally structured tasks such as action recognition, phase segmentation, and event detection. However, real-world video datasets frequently suffer from two types of annotation errors:

Semantic Mislabeling: Frames or segments are assigned incorrect class or phase labels (e.g., labeling "gallbladder removal" as "gallbladder retraction").
Temporal Disordering: The sequence of events violates the natural temporal progression (e.g., steps in a surgical procedure appearing out of order), even if individual frame labels are locally correct.

Existing methods for detecting these errors often rely on:

Visual anomaly detection (which fails when the visual content is normal but the label is wrong).
Prior knowledge of which samples are corrupted (required for "machine unlearning" approaches).
Static image-based heuristics that do not account for temporal dependencies.

The core challenge is to automatically identify both semantic and temporal annotation errors in video datasets without ground-truth noise masks, additional supervision, or retraining the model from scratch.

2. Methodology: Cumulative Sample Loss (CSL)

The authors propose a model-agnostic, training-free framework that leverages the dynamics of training loss to detect errors. The central hypothesis is that correctly labeled frames are learned early and quickly, resulting in a rapid drop in loss, whereas mislabeled or disordered frames remain difficult for the model to learn, resulting in persistently high or erratic loss trajectories.

Key Components:

Training with Checkpointing:
- A temporal video model (e.g., a ResNet-18 backbone with a Transformer-based segmentation head) is trained normally for $E$ epochs.
- Model weights (checkpoints) are saved at every epoch: $\{\theta^{(1)}, \dots, \theta^{(E)}\}$ .
Post-Hoc Auditing (Inference):
- For a test video, the model is run through every saved checkpoint to generate a loss trajectory for each frame $x_t$ .
- The loss $\hat{\ell}^{(e)}_t$ is computed for frame $x_t$ using the label $y_t$ and the model weights from epoch $e$ .
Cumulative Sample Loss (CSL) Calculation:
- The CSL for a frame is defined as the average loss over the entire training trajectory:
  $CSL(x_t) = \frac{1}{E} \sum_{e=1}^{E} \hat{\ell}^{(e)}_t$
- Interpretation:
  - Low CSL: Indicates the frame was learned early and consistently (likely correct).
  - High/Erratic CSL: Indicates the model struggled to fit the frame throughout training (likely mislabeled or temporally inconsistent).
Error Flagging:
- Semantic Mislabeling: Typically results in sustained high CSL across a contiguous segment due to persistent semantic disagreement.
- Temporal Disordering: Often manifests as sharp spikes in CSL around phase boundaries where the temporal order violates the learned structure.
- Frames are flagged as anomalies if their CSL exceeds a threshold $\tau$ or falls within the top- $k\%$ of the distribution.

3. Key Contributions

Novel Framework: Introduction of a lightweight, model-agnostic method for detecting annotation errors using Cumulative Sample Loss (CSL) derived from training checkpoints.
Unified Detection: The method simultaneously detects semantic mislabeling and temporal disordering without requiring prior knowledge of error locations or noise distributions.
Training-Free Auditing: Once a model is trained, the auditing process requires no retraining, fine-tuning, or extra supervision; it only requires forward passes over saved checkpoints.
State-of-the-Art Performance: Demonstrated superior performance on complex procedural video datasets compared to existing video anomaly detection and error detection baselines.

4. Experimental Results

The method was evaluated on two distinct benchmarks: Cholec80 (surgical workflow) and EgoPER (egocentric procedural understanding).

Datasets & Setup:

Cholec80: 80 laparoscopic videos with 7 surgical phases. Synthetic errors (mislabeling and phase swapping) were injected into test sets.
EgoPER: 28+ hours of instructional videos (e.g., making coffee, tea) with fine-grained step annotations.

Performance Metrics:

Segment-wise Error Detection Accuracy (EDA): Proportion of erroneous segments detected.
Frame-wise Micro-AUC: Separability between clean and corrupted frames.

Key Findings:

EgoPER: The proposed method (LossFormer) achieved the highest AUC across all tasks (e.g., 70.2 AUC on the "Tea" task, outperforming the best baseline EgoPED by 4.2 points). It maintained a strong average EDA of 57.0%.
Cholec80:
- Mislabeling: Achieved 92.0 AUC and 85.9 EDA, significantly outperforming baselines (e.g., +20.7% AUC improvement over EgoPED).
- Disordering: Achieved 78.5 AUC and 74.5 EDA, demonstrating robustness in detecting temporal inconsistencies where other methods failed.
Qualitative Analysis: CSL trajectories clearly distinguished clean data (low, stable loss) from errors (high spikes or sustained high loss).

Ablation Studies:

Feature Extractor: Partial fine-tuning of the ResNet-18 backbone was crucial. Fully frozen backbones failed to capture domain-specific cues, leading to false positives.
Temporal Modeling: Transformers significantly outperformed CNNs in detecting temporal disordering (AUC 78.45 vs. 48.12) because they capture long-range dependencies. CNNs performed slightly better on pure semantic mislabeling but were less effective for sequence errors.
Robustness: The method remained effective even when 10% of the training data was corrupted, as CSL aggregates loss behavior over the entire trajectory rather than relying on a single converged model.

5. Significance and Impact

Dataset Auditing: Provides a powerful, scalable tool for cleaning large-scale video datasets, which is critical for improving the reliability of downstream ML models in healthcare, robotics, and education.
No Ground Truth Required: Unlike many prior works, this approach does not need a "gold standard" of known errors to function, making it applicable to real-world, uncurated datasets.
Interpretability: The CSL metric offers an interpretable signal, visualizing exactly where and why a model struggles, aiding human reviewers in prioritizing data cleaning efforts.
Generalizability: Being model-agnostic, the framework can be integrated into any standard video learning pipeline using existing training checkpoints.

In conclusion, the paper establishes that a model's own training difficulty (reflected in loss trajectories) is a potent diagnostic signal for identifying both semantic and temporal annotation errors in video data, offering a practical solution for ensuring data quality in complex temporal tasks.