Enhancing Continual Learning for Software Vulnerability Prediction: Addressing Catastrophic Forgetting via Hybrid-Confidence-Aware Selective Replay for Temporal LLM Fine-Tuning

Imagine you are the head of a security team for a massive, ever-changing city. Your job is to spot dangerous buildings (software vulnerabilities) before they collapse.

In the past, you might have hired a detective who studied a giant photo album of old buildings. But here's the problem: buildings change. New materials are used, new construction methods appear, and old blueprints become useless. If your detective only studies the photo album from 2018, they will fail to spot a new type of trapdoor invented in 2024.

This paper is about teaching a super-smart AI detective (a Large Language Model) how to keep learning as the city evolves, without forgetting everything it learned yesterday.

Here is the story of how they did it, explained simply:

1. The Problem: The "Forgetting" Detective

The researchers tried to train an AI on software code. But they noticed a big issue called Catastrophic Forgetting.

The Analogy: Imagine a student studying for a history exam. They study the 1990s perfectly. Then, they study the 2000s. When they take the test, they ace the 2000s questions but have completely forgotten the 1990s.
In the paper: If the AI learns only on the newest code, it forgets how to spot older types of bugs. If it tries to learn everything from scratch every time, it takes too long and gets confused.

2. The Solution: The "Smart Flashcard" System

The team tested eight different ways to help the AI remember. They found that the best method was something they called Hybrid-CASR.

Think of this like a Smart Flashcard System for the AI:

The Old Way (Window-Only): The AI throws away all old flashcards and only studies the newest ones. It learns fast but forgets the past.
The Expensive Way (Cumulative Training): The AI tries to read every single flashcard it has ever seen, from day one to today. This is accurate but takes forever (like reading the entire library every night).
The Hybrid-CASR Way (The Winner): The AI keeps a small, special box of flashcards. But it's not just a random box.
1. It picks the "Hard Ones": It keeps cards for the bugs it is unsure about (the ones it keeps getting wrong).
2. It balances the deck: In the real world, "Fixed" code is common, and "Vulnerable" code is rare. If the AI just picks random hard cards, it might only see "Fixed" code. Hybrid-CASR forces the box to have an equal mix of "Vulnerable" and "Fixed" cards so the AI doesn't get biased.

3. The Experiment: A Time-Travel Test

Most computer science papers test AI by shuffling the data randomly (like mixing up a deck of cards). But the researchers said, "No, that's cheating!"

Their Rule: You can only use knowledge from last month to predict bugs in this month. You cannot peek at the future.
The Result: They ran this test for 42 two-month periods (from 2018 to 2024).

4. The Surprising Findings

Time doesn't matter as much as we thought: Whether they taught the AI in 1-month chunks or 12-month chunks, the results were almost the same. The AI is surprisingly flexible.
More data isn't always better: Trying to train on all history (the Cumulative method) made the AI slightly smarter but took 16 times longer to run. It wasn't worth the wait.
The Winner: The Hybrid-CASR method was the "Goldilocks" solution. It was:
- Accurate: It caught the most bugs (about 67% success rate).
- Fast: It was much quicker than re-reading the whole history.
- Stable: It didn't forget the old bugs as easily as the others.

5. The Real-World Takeaway

The paper concludes that while AI is getting better at spotting software bugs, it's not a magic wand yet.

The AI is a great assistant, not a replacement. It can flag potential issues, but a human still needs to double-check them.
Efficiency is key. You don't need a supercomputer to keep your security AI up to date. A smart, selective memory system (like Hybrid-CASR) works just as well and is much cheaper to run.

In a nutshell: The researchers taught an AI detective to keep a "highlighted notebook" of its hardest mistakes and a balanced mix of old and new cases. This allowed it to stay sharp in a changing world without burning out or forgetting its past.

1. Problem Statement

The paper addresses the critical challenge of deploying Large Language Models (LLMs) for software vulnerability detection in real-world, evolving environments.

Temporal Distribution Shift: Vulnerability patterns change over time (concept drift). Most existing evaluations use random train-test splits, which ignore temporal order, leading to data leakage and overestimated performance.
Catastrophic Forgetting: When models are updated incrementally with new data (Continual Learning), they tend to "forget" previously learned vulnerability patterns.
Class Imbalance: Vulnerable functions are often a minority compared to fixed/non-vulnerable code, and this ratio fluctuates over time.
Computational Constraints: Retraining on all historical data (cumulative training) is computationally prohibitive for frequent updates, while simple "window-only" training leads to poor retention of past knowledge.

2. Methodology

Experimental Setup

Model: The authors use microsoft/phi-2 (a 2.7B parameter decoder-only LLM) adapted with LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.
Dataset: A CVE-linked dataset spanning 2018–2024, derived from the CVEfixes database. It includes function-level instances (vulnerable vs. fixed) primarily in C/C++.
Temporal Protocol: The timeline is segmented into bi-monthly windows (42 windows total). The evaluation follows a strict forward-chaining protocol:
- Models are trained on window $W_t$ and tested on $W_{t+1}$ .
- Backward retention is tested on $W_{t-k}$ (lags 1, 3, 5, 6) to measure forgetting.
- This prevents temporal leakage, ensuring models only use knowledge available at the time of prediction.

Continual Learning Strategies Evaluated

The study compares eight strategies:

Baselines: Zero-shot (pre-trained only), Window-only (train only on current window), and Cumulative (train on all history).
Replay-Based:
- Replay-1P/3P: Uniform sampling from the previous 1 or 3 windows.
- CASR (Confidence-Aware Selective Replay): Prioritizes uncertain samples (low confidence) for replay.
- Hybrid-CASR (Proposed): Combines uncertainty-based selection with explicit class balancing. It ensures the replay buffer maintains a balanced ratio of Vulnerable and Fixed samples while prioritizing difficult cases.
Regularization-Based:
- LB-CL: Label-balanced loss (class-weighted cross-entropy).
- OLoRA: Orthogonality constraints on LoRA updates to prevent interference with past knowledge.

Evaluation Metrics

Macro-F1: Primary metric to handle class imbalance equally.
IBR@k (Immediate Backward Retention): F1 score on past windows after training on current data.
Efficiency: Training time per window, GPU memory usage, and F1 per minute.

3. Key Contributions

Temporal Evaluation Protocol: The authors establish a deployment-faithful protocol using forward-chaining and lagged backward tests on a 6-year CVE dataset, avoiding the data leakage common in random-split evaluations.
Hybrid-CASR Algorithm: A novel replay method that addresses both catastrophic forgetting and class imbalance. It selects samples based on model uncertainty but enforces a balanced representation of Vulnerable and Fixed classes in the replay buffer.
Granularity Ablation: A systematic analysis of temporal window sizes (1, 2, 3, 6, 12 months), challenging the assumption that a single optimal window size exists.
Resource-Performance Analysis: A comprehensive trade-off analysis showing that cumulative training is computationally inefficient compared to selective replay methods.

4. Key Results

Performance (Forward Prediction):
- Hybrid-CASR achieved the highest mean Macro-F1 of 0.667, significantly outperforming the Window-only baseline (0.651) with a p-value of 0.026.
- Cumulative training yielded a similar F1 (0.661) but required 15.9x more training time, making it impractical for frequent updates.
Knowledge Retention (Backward):
- Replay-1P showed the highest immediate retention (IBR@1 = 0.791).
- Hybrid-CASR achieved strong retention (IBR@1 = 0.741) with a very low decay rate (4.2% over 6 lags), balancing plasticity (learning new patterns) and stability (retaining old ones).
- OLoRA performed poorly (F1 = 0.599), suggesting orthogonality constraints are too rigid for evolving vulnerability patterns.
Temporal Granularity:
- Different window sizes (monthly to annual) yielded remarkably similar mean F1 scores (0.651–0.669).
- Quarterly (3-month) windows achieved the best average performance, but the differences were marginal, suggesting organizations can prioritize resource availability over strict performance optimization regarding window size.
Efficiency:
- Hybrid-CASR reduced training time per window by ~17% compared to Window-only (432s vs. 520s) while improving F1.
- It achieved an efficiency of 0.093 F1/min, compared to 0.075 for the baseline.

5. Significance and Implications

Practical Deployment: The study demonstrates that Hybrid-CASR offers the most practical trade-off for real-world vulnerability detectors. It allows for frequent model updates on single-GPU environments without catastrophic forgetting or excessive computational cost.
Revisiting "More Data": The results challenge the notion that training on all historical data (cumulative) is always better. In the presence of concept drift, exhaustive memory can lead to interference, whereas selective replay is more effective.
Role of Human Expertise: Even the best models achieve ~66% Macro-F1. The authors conclude that LLM-based detectors should be viewed as decision-support tools requiring human verification, especially during periods of rapid drift (e.g., major security campaigns).
Future Directions: The work highlights the need for adaptive windowing strategies and evaluation protocols that better simulate zero-day scenarios. It also notes limitations regarding potential pre-training contamination (phi-2 was trained on data overlapping the evaluation period) and the focus on C/C++ languages.

In summary, this paper provides a rigorous framework for evaluating LLMs in temporal vulnerability detection and proposes Hybrid-CASR as a robust, efficient solution to the stability-plasticity dilemma in continual learning.