Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection

Imagine you hire a highly trained security guard (a Neural Network) to watch over your digital castle. This guard is supposed to spot intruders (hackers) and let in friendly visitors (normal traffic).

But what if someone secretly taught the guard a secret handshake?

The Problem: The Secret Handshake (Backdoors)

In the world of AI, a "backdoor" is like a secret trigger planted by a hacker.

Normal Behavior: When the guard sees a normal person, they act perfectly. They stop bad guys and let good guys through. You can't tell anything is wrong.
The Trigger: But if a visitor wears a specific, strange hat (the trigger), the guard suddenly forgets their job. They might let a known criminal walk right in, or ignore a real threat, all because of that hat.

The scary part? The guard looks completely normal until that specific hat appears. Finding this secret rule is incredibly hard because the guard's brain is a complex, black box.

The Solution: Tracing the Guard's Thoughts

The authors of this paper came up with a clever way to find and fix these secret handshakes without firing the guard or retraining them from scratch. They call it "Active Paths."

Think of the neural network as a massive city with thousands of roads connecting different neighborhoods.

The Normal Flow: When a normal visitor arrives, traffic flows through the usual, busy streets.
The Trigger Flow: When the "hat" (trigger) appears, the guard's brain lights up a super-fast, super-direct highway that only gets used when that hat is present. It's like a secret tunnel that bypasses all the normal security checks.

The researchers realized that these "secret tunnels" are abnormally strong and distinct.

How They Detect It (The Detective Work)

Instead of trying to guess what the trigger looks like, they asked: "What does the guard's brain look like when it sees a trigger versus when it sees a normal person?"

Map the Traffic: They ran thousands of examples through the guard's brain and mapped out which roads (neural connections) were used.
Group the Patterns: They used a sorting machine (clustering) to group the traffic patterns.
- Group A: Normal traffic patterns (the busy, chaotic city streets).
- Group B: The weird, straight-line highway used only when the trigger is present.
Spot the Difference: By comparing the two groups, they could instantly see which specific feature (like the "hat") was causing the guard to take the secret tunnel. In their experiment, the "hat" was a specific number in the network data (called TTL).

How They Fix It (The Surgery)

Once they found the secret tunnel, they didn't need to retrain the whole guard (which takes a long time and costs a lot of money). Instead, they performed a tiny, precise surgery:

The Cut: They simply cut the wires (removed the weights) that connected the "hat" feature to the first part of the guard's brain.
The Result: The guard can no longer take the secret tunnel. If someone wears the hat, the guard ignores it and treats them like a normal person. The guard's ability to spot real criminals remains 100% intact.

Why This Matters for the Military and Security

The paper was written for a military context, which makes sense. Imagine a military base using AI to detect cyberattacks.

The Risk: If the AI was trained on data downloaded from the internet, a hacker could have planted a backdoor in that data.
The Fix: This method allows security teams to scan their AI, find these "secret tunnels," and cut them out immediately. It's like finding a hidden trapdoor in a fortress and bricking it up, ensuring the fortress is safe again without having to rebuild the whole castle.

In a Nutshell

The Villain: A hidden rule that makes AI behave badly only when a specific secret is present.
The Hero: A method that traces the AI's thought process to find the "secret highway" used by the villain.
The Victory: Cutting that specific highway to stop the villain, while keeping the AI smart and fast for everyone else.

It's a way to make AI explainable (we know why it acted weird) and fixable (we can remove the bad part without breaking the good part).

Here is a detailed technical summary of the paper "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection."

1. Problem Statement

Machine Learning (ML) models are vulnerable to backdoor attacks, where an attacker poisons the training data so that the model behaves normally on clean inputs but produces attacker-specified outputs when a specific trigger (a specific pattern or value) is present.

The Challenge: Detecting these triggers is notoriously difficult because the model's behavior on clean data remains indistinguishable from a non-poisoned model.
The Context: The paper focuses on Intrusion Detection Systems (IDS), specifically those using Neural Networks (NNs) to analyze network traffic (Netflows). In military and security contexts, reliance on external or open-source datasets for training increases the risk of undetected backdoors.
The Observation: The authors hypothesize that for tabular data, backdoor triggers manifest as abnormally strong "active paths" during the forward propagation of the neural network. These paths exhibit behavior similar to high-importance features but are driven by the trigger rather than legitimate data patterns.

2. Methodology

The proposed approach consists of two main phases: Detection (via clustering local feature contributions) and Elimination (via pruning active paths). The method relies on piecewise linear activation functions (e.g., ReLU), which allow for the extraction of linear coefficients representing feature contributions.

A. Core Concepts

Explainable Slope Coefficients ( $\beta$ ): For a given input, the pre-activation of the output layer can be expressed as a linear function. The coefficients ( $\beta$ ) indicate how much each input feature contributes to the prediction.
Local Feature Contributions ( $\phi$ ): Defined as $\phi_{ij} = \beta_{ij}x_{ij}$ , this measures the specific contribution of feature $j$ for observation $i$ .
Active Paths: A collection of adjacent weights connecting an input feature to an output node through active (non-zero) neurons. When ReLU is used, inactive nodes (negative pre-activation) are pruned, creating a sparse structure of active paths.

B. Phase 1: Backdoor Detection

The detection process involves three steps:

Feature Contribution Extraction: Pass the dataset through the network to calculate local feature contributions ( $\phi$ ) for all samples.
Clustering:
- Apply Kernel PCA (with a cosine kernel) for dimensionality reduction.
- Use HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to group samples based on their contribution vectors.
- Hypothesis: Backdoored samples will cluster together because they rely on a specific, uniform trigger feature, whereas clean samples will form a larger, more diverse cluster.
Cluster Comparison:
- Identify the largest cluster (representing normal behavior) as a benchmark.
- Calculate the Mean Square Difference of feature contributions between the largest cluster and other clusters.
- Features with the highest deviation are flagged as potential triggers.

C. Phase 2: Backdoor Elimination

Once a trigger feature is identified, the authors propose a model editing approach that avoids retraining:

Path Identification: Determine which active paths are predominantly used by the backdoored cluster (specifically those used more than a threshold $T$ times).
Weight Pruning:
- Identify weights connecting the trigger features to the first hidden layer that are part of these dominant backdoor paths.
- Set these specific weights to zero.
- Additionally, remove weights that are unused by either clean or backdoored data to fully mitigate the behavior.
Result: The model retains its ability to process clean data (as those paths remain) but loses the specific pathway required to activate the backdoor.

3. Key Contributions

Novel Detection Mechanism: A method to detect backdoors by analyzing active paths and local feature contributions, offering inherent explainability (unlike "black box" activation clustering).
Automatic Elimination Strategy: A technique to remove backdoors by surgically pruning weights associated with trigger paths in the first hidden layer, without retraining the model or relabeling data.
IDS Application: Demonstration of the approach in a Network Intrusion Detection System (NIDS) scenario, proving its viability for critical security infrastructure.

4. Experimental Results

The authors tested their approach on the AIT-IDSv2 dataset (Netflow data) using a fully connected feed-forward neural network.

Experiment 1 (Single Feature Trigger):
- Setup: A backdoor was planted using the TTL_max feature (set to 66) to flip malicious traffic labels to benign. Only 1% of data was poisoned.
- Detection: Clustering successfully separated the backdoored samples. TTL_max showed the highest contribution difference between clusters.
- Elimination: Pruning weights associated with TTL_max in the first hidden layer.
- Outcome: The backdoor was neutralized (poison accuracy dropped from 99.98% to ~~98.90%, effectively restoring normal classification), while clean data accuracy remained stable (~~99.30%).
Experiment 2 (Two Feature Trigger):
- Setup: Trigger used both TTL_max (66) and TTL_min (61).
- Detection: Successfully identified both features as the primary contributors to the anomalous cluster.
- Elimination: Weights for both features were pruned.
- Outcome: The backdoor was eliminated, and model performance on clean data was preserved.

Key Metrics:

Poison Accuracy: Reduced from near 100% (successful attack) to levels consistent with normal misclassification rates after elimination.
Clean Accuracy: Remained largely unaffected (e.g., 99.29% $\to$ 99.30%), demonstrating that the pruning did not degrade general model performance.

5. Significance and Limitations

Significance

Explainability by Design: Unlike many detection methods that only flag anomalies, this approach identifies which features and how they contribute to the anomaly, aiding human analysts.
Resource Efficiency: The elimination process requires only a single forward pass and weight modification, avoiding the computational cost of retraining or the logistical burden of relabeling data.
Military/Security Relevance: Addresses the need for robust AI in defense (NATO AI strategy), particularly where models must be trained on potentially untrusted external datasets.

Limitations

Activation Constraints: The method currently requires piecewise linear activation functions (e.g., ReLU, Leaky ReLU) to compute the linear coefficients. It does not directly apply to non-linear activations like Sigmoid or Tanh without modification.
Data Availability: The detection method requires access to a dataset containing the trigger (poisoned data). It cannot detect backdoors in a "black box" model where only clean inputs are available.
Domain Expertise: The method flags anomalous contributions but cannot automatically distinguish between a malicious backdoor and legitimate, strong feature correlations (overfitting). Human domain experts must verify the flagged features.
Generalization: Experiments were conducted on a specific synthetic dataset; further validation on diverse, real-world datasets and different architectures (e.g., CNNs) is needed.

Conclusion

The paper presents a robust, explainable framework for securing Neural Network-based IDS against backdoor attacks. By leveraging the mathematical properties of active paths in ReLU networks, the authors demonstrate that backdoors can be both detected through contribution clustering and eliminated through targeted weight pruning, all without the need for expensive retraining. This offers a practical solution for maintaining the reliability of AI-driven security systems in high-stakes environments.