SYNAPSE: Framework for Neuron Analysis and Perturbation in Sequence Encoding

Imagine you have a super-smart, black-box robot that can do amazing things, like spotting computer viruses in a stream of code or guessing your mood from a text message. It works great, but nobody inside the company knows how it makes those decisions. It's like a magician who never reveals their tricks.

This is a big problem. If the robot makes a mistake in a hospital or a security system, the consequences could be disastrous. We need to know if the robot is reliable, or if it's just guessing.

This paper introduces a new tool called SYNAPSE. Think of SYNAPSE as a "Robot X-Ray and Stress Test" kit. Instead of trying to rebuild the robot or teach it new things (which is hard and expensive), SYNAPSE lets scientists peek inside the robot's brain while it's working, poke specific parts, and see what happens—all without breaking the robot.

Here is how it works, using some everyday analogies:

1. The Problem: The "Black Box" Brain

Modern AI models (like the ones powering chatbots or security systems) are built like giant, multi-layered onion structures. Inside, there are millions of tiny switches called neurons.

The Old Way: Scientists used to guess which switches were important by looking at the outside or by turning the whole machine off and on again to see what broke. This was messy and didn't tell them exactly which switch did what.
The SYNAPSE Way: SYNAPSE is like a smart flashlight. It shines a light on the specific neurons (switches) inside the layers of the onion, ranks them by importance, and then lets you gently "silence" (turn off) just the top ones to see if the robot still works.

2. How SYNAPSE Works (The Three Steps)

Step A: The Map (Explainability)
Imagine the robot is a huge library. SYNAPSE first creates a map of which books (neurons) are used most often to answer specific questions. It doesn't move the books; it just reads the catalog to see which ones are the "stars" of the show.

Step B: The Ranking (Analysis)
Once it has the map, it ranks the neurons.

Global Ranking: Which neurons are the "superstars" used for everything?
Label-Specific Ranking: Which neurons are the "specialists" only used when the robot is looking for a specific thing (like "Anger" or "Virus")?

Step C: The Stress Test (Intervention)
This is the fun part. SYNAPSE uses a "remote control" (called a forward hook) to temporarily mute those top neurons while the robot is working.

Analogy: Imagine a choir singing a song. SYNAPSE mutes the tenor section. Does the song fall apart? Or does the rest of the choir cover for them?
If the song falls apart, those neurons were critical. If the song keeps going, the robot has redundancy (backup plans).

3. What Did They Discover?

The researchers tested SYNAPSE on two very different jobs:

Cybersecurity: Detecting malware (computer viruses) in system logs.
Emotion AI: Guessing if a text is angry, happy, or sad.

Here are the surprising findings:

The "Swiss Army Knife" Effect: They expected to find tiny, isolated neurons that did one specific job perfectly. Instead, they found that information is spread out. It's like a team of workers where everyone knows a little bit about everything. If you fire the "best" worker, the others can usually pick up the slack. The robot is surprisingly robust against random damage.
The "Achilles' Heel": However, the robot isn't perfect. While it's good at general tasks, it has specific weak spots.
- Example: In the virus detector, the robot was great at spotting most viruses, but if you silenced just a few specific neurons, it completely failed to spot a specific type of virus (TheTick) while still working perfectly for everything else. It's like a security guard who is great at spotting pickpockets but gets completely fooled by a specific type of fake ID.
The "Tipping Point": If you mess with the robot's brain too much (silence too many neurons), it doesn't just get a little worse; it suddenly crashes, like a house of cards collapsing.

4. Why Does This Matter?

SYNAPSE proves that we can test AI safety without needing to retrain the model or have access to its secret training data.

For Security: It helps us find the "backdoors" in AI. If a hacker knows which specific neurons control the "Virus Detected" signal, they could try to silence just those to hide a virus. SYNAPSE helps us find those weak spots so we can fix them.
For Trust: It shows us that AI isn't magic. It's a machine with specific strengths and weaknesses. By understanding exactly where those weaknesses are, we can build better, safer, and more transparent AI systems.

In a Nutshell

SYNAPSE is a stress-test toolkit for AI. It treats the AI model like a complex machine, maps out its internal gears, and then gently removes the most important gears one by one to see how much the machine can handle before it breaks. It turns the "black box" into a transparent, testable system, helping us build AI that we can actually trust.

1. Problem Statement

The rapid adoption of Artificial Intelligence, particularly Transformer-based models (e.g., BERT, LLMs), in high-stakes domains like cybersecurity and healthcare has raised critical concerns regarding transparency, trustworthiness, and robustness. While these models achieve state-of-the-art performance, they operate as "black boxes," making it difficult to explain decision-making processes or assess their vulnerability to attacks.

Existing neuron-level interpretability approaches suffer from several limitations:

Descriptive Nature: They often only describe neuron behavior without providing a mechanism for controlled intervention.
Task/Architecture Dependence: Many methods require retraining or are specific to Natural Language Processing (NLP), limiting their applicability to other domains (e.g., malware detection).
Lack of Systematic Stress-Testing: There is a gap in frameworks that can systematically rank neurons by importance and perform training-free, reversible interventions to quantify functional robustness across different architectures and data modalities.

2. Methodology: The SYNAPSE Framework

SYNAPSE is a systematic, training-free, and non-destructive framework designed to analyze and stress-test the internal behavior of Transformer models. It operates at inference time using a modular pipeline consisting of three main blocks:

A. Explainability Block

Activation Extraction: Extracts per-layer [CLS] (Classification) token activations. The [CLS] token serves as a condensed, sequence-level representation, chosen for computational efficiency over token-level extraction.
Linear Probe Training: A lightweight linear classifier is trained on the frozen [CLS] activations. This probe does not alter the original model but quantifies the contribution of individual neurons to the classification task.
Ranking Generation: The probe weights are used to generate two types of neuron importance rankings:
1. Global Ranking: Sum of absolute weights across all classes (label-agnostic).
2. Class-Conditional Ranking: Absolute weight for a specific target label (label-aware).

B. Analysis Block

Selection Strategy: Based on a selection percentage $p$ , the framework identifies the top- $k$ neurons to be targeted.
Scope Definition: Interventions can be applied to the entire model (multi-layer) or specific layers (e.g., last-layer only).

C. Adversarial Block (Intervention)

Using PyTorch forward hooks, SYNAPSE performs controlled interventions during inference without modifying model weights. The interventions include:

Silencing Strategies:
- Global Undirected: Silences top- $k$ globally important neurons.
- Global Directed: Silences neurons most influential for a specific target label.
- Per-Class: Silences neurons associated with a specific class to test class-conditional brittleness.
Noise Injection: Adds Gaussian noise to specific neuron dimensions in the [CLS] vector.
Output/Weight Manipulation:
- Logit Bias: Adds a constant offset to specific class logits.
- Weight Reweighting: Temporarily modifies the classification head weights (W) or bias (b) to simulate tampering.
- FGSM Adaptation: Applies gradient-based perturbations to input embeddings for baseline comparison.

3. Key Contributions

SYNAPSE Framework: A modular, non-destructive pipeline that automates the extraction of activations, training of probes, and execution of targeted interventions via forward hooks.
Causal Silencing Strategies: Introduces three complementary intervention mechanisms (Global Undirected, Global Directed, and Per-Class) to evaluate robustness and sensitivity without retraining.
Architecture-Agnostic Analysis: Utilizes compact [CLS] representations to enable scalable, efficient analysis applicable to various Transformer encoders (BERT, BigBird, DistilBERT, Longformer).
Cross-Domain Validation: Validates the framework across two heterogeneous domains: Malware Detection (system-call sequences) and Emotion Classification (natural language), demonstrating domain-independent patterns.

4. Experimental Results

The framework was evaluated on the MalwSpecSys dataset (malware detection) and the GoEmotions dataset (emotion classification).

Key Findings:

Distributed Redundancy: Task-relevant information is encoded in broad, overlapping subsets of neurons rather than isolated units. Consequently, silencing a small fraction of neurons causes gradual performance degradation rather than immediate collapse.
Class-Wise Asymmetry: While information is distributed globally, specific classes exhibit heterogeneous specialization.
- Some models (e.g., BERT, BigBird) rely heavily on small, specialized neuron sets for specific classes; silencing these causes the F1-score for those classes to drop to zero.
- Other models (e.g., Longformer) show more balanced degradation but still exhibit strong class-specific sensitivities.
Robustness vs. Architecture:
- Malware Detection: Models designed for long sequences (Longformer, BigBird) generally showed higher baseline robustness to input noise but could be brittle to specific neuron ablations.
- Weight/Logit Attacks: Small, structured manipulations in the logit space or weight space (classification head) were often sufficient to redirect predictions significantly, sometimes more effectively than neuron silencing.
- Noise Sensitivity: Random noise injection showed that robustness is not linearly correlated with baseline accuracy; some high-performing models collapsed under high noise levels while others remained stable.
Cross-Domain Consistency: The pattern of distributed information and class-specific brittleness was observed in both the cybersecurity (malware) and NLP (emotion) domains, suggesting a fundamental property of Transformer architectures.

5. Significance and Implications

Unified Auditing Tool: SYNAPSE provides a standardized, reproducible method for auditing model internal robustness across different architectures and domains, addressing the lack of generalizable interpretability tools.
Security Insights: The framework reveals that while models are robust to random noise and distributed ablation, they possess structural weaknesses where specific decision pathways are highly sensitive to targeted perturbations. This highlights the risk of "trojan" neurons or specific weight manipulations in deployed systems.
Guidance for Robustness: By identifying which neurons or weight configurations are critical for specific classes, the framework guides the development of more robust models and potential defense mechanisms (e.g., pruning redundant neurons or hardening specific decision pathways).
Threat Modeling: The study validates realistic threat models (TH-1 to TH-4), demonstrating how internal manipulation (e.g., bit flips, weight tampering) can lead to catastrophic failures in high-stakes environments like malware detection.

In conclusion, SYNAPSE shifts the paradigm from descriptive neuron analysis to causal, operational stress-testing, revealing that Transformer models rely on redundant but asymmetric internal structures that are vulnerable to targeted, fine-grained interventions.