TP-Spikformer: Token Pruned Spiking Transformer

Imagine you have a very smart, energy-efficient robot brain (called a Spiking Neural Network or SNN) that is trying to learn how to recognize things in a video. Unlike a human brain that fires electricity constantly, this robot brain only "fires" a tiny spark when it sees something important. This makes it incredibly fast and low-power, perfect for running on small devices like drones or smartwatches.

However, there's a problem. To get really good at recognizing things (like spotting a cat in a crowd), we've made these robot brains huge and complicated. They are like a library with millions of books, but when the robot tries to read a story, it feels like it has to read every single page of every single book before it can tell you the ending. This takes too much time and battery power.

Enter the authors of this paper with TP-Spikformer. Think of it as a super-smart editor for the robot's brain.

The Problem: Reading Every Page

Imagine you are watching a movie. If you had to read the script for the whole movie to understand the plot, you'd be reading a lot of boring scenes where nothing happens (like a shot of an empty sky or a quiet hallway).

The Old Way: The robot brain reads every single "token" (a tiny piece of the image or video frame), even the boring ones. It wastes energy processing the empty sky just to get to the part where the cat jumps.
The Result: The robot is accurate but slow and hungry for battery.

The Solution: The "Smart Editor" (TP-Spikformer)

The authors created a method to teach the robot brain to skip the boring parts without losing the story. They call this Token Pruning.

Here is how their "Smart Editor" works, using two simple rules:

1. The "Spot the Difference" Rule (Spatial Intelligence)

Imagine you are looking at a photo of a dog in a park.

The grass is all the same green.
The sky is all the same blue.
But the dog has fur, ears, and a tail that look very different from the grass.
The Editor's Move: The editor looks at the photo and says, "Hey, this patch of grass is just like the grass next to it. It's boring. Let's skip reading it." But it says, "This patch with the dog's ear? That's unique! Let's keep reading that."
In the Paper: They call this the Spatial Scorer. It finds the "interesting" parts of the image that look different from their neighbors.

2. The "What Changed?" Rule (Temporal Intelligence)

Now, imagine the video starts. The dog is sitting still, then suddenly it barks and jumps.

The grass didn't move. The sky didn't move.
But the dog's mouth moved, and its position changed.
The Editor's Move: The editor looks at the video frame-by-frame. It says, "The grass looked the same as the last second. Skip it." But, "The dog's mouth just opened! That's a big change! Keep reading that!"
In the Paper: This is the Temporal Scorer. It spots things that are moving or changing over time.

The Magic Trick: "Early Stopping" (Not Deleting, Just Ignoring)

Here is the clever part. Usually, when you delete boring parts of a document, you might mess up the formatting, making it hard to read later.

Old Methods: Some previous methods tried to physically cut out the boring words. This often broke the structure of the sentence, requiring the robot to be retrained from scratch to learn how to read the new, broken sentences.
TP-Spikformer's Method: Instead of cutting the words out, the editor tells the robot: "Don't waste energy reading this boring paragraph, but keep it in the book so the page numbers stay the same."
The Analogy: Imagine a teacher telling a student, "You don't need to solve these 50 easy math problems to get the answer, but keep the paper in front of you so the next teacher knows where to look."
The Benefit: The robot saves energy by skipping the work, but the "structure" of the brain stays perfect. This means we can use this editor on any existing robot brain without having to retrain it from scratch!

Why Does This Matter?

The authors tested this "Smart Editor" on many different tasks:

Recognizing images (Is that a cat or a dog?)
Finding objects (Where is the car in this traffic jam?)
Tracking movement (Follow that bird as it flies through the trees).

The Results:

Speed: The robot became 1.4x to 2x faster.
Battery: It used significantly less power (up to 40% less energy).
Accuracy: It barely lost any accuracy (sometimes even got slightly better because it focused only on the important stuff!).
No Re-training: You can take an existing, pre-trained robot brain and just apply this editor. No need to spend weeks teaching it again.

The Big Picture

Think of TP-Spikformer as a personal assistant for your robot brain. It looks at the massive amount of data the robot is about to process, says, "Hey, you don't need to look at all of this. Just focus on the dog, the car, and the moving bird. Ignore the sky and the grass."

This allows us to put powerful, smart AI into small, battery-powered devices (like your glasses, your phone, or a drone) without them running out of juice or getting slow. It's a simple, smart way to make AI more efficient, just like how our own brains naturally ignore the background noise to focus on what matters.

1. Problem Statement

Spiking Neural Networks (SNNs) are recognized for their energy efficiency due to event-driven computing. However, recent advancements in Spiking Transformers (e.g., Spikformer, SDT-V1/V3, QKFormer) have achieved high accuracy by adopting large-scale architectures. This success comes with significant drawbacks:

High Computational Cost: These models require massive synaptic operations (e.g., SDT-V3 requires 28.4 billion ops/sec) and memory, hindering deployment on resource-constrained edge devices.
Limitations of Existing Pruning: Current token pruning methods for SNNs suffer from two main issues:
1. Structural Modification: They often require altering the network architecture (adding trainable modules, changing connections) or introducing new tokens.
2. Retraining Overhead: Most methods require full retraining or fine-tuning, which is computationally expensive and reduces generalizability.
Feature Map Integrity: Direct token removal (dropping tokens) breaks the spatial structure required by convolutional layers in advanced hierarchical SNNs (like QKFormer), making standard pruning infeasible for these architectures.

2. Methodology: TP-Spikformer

The authors propose TP-Spikformer, a simple yet effective token pruning framework designed to reduce storage and computation while maintaining competitive performance. It consists of two core components:

A. Heuristic Spatiotemporal Information-Retaining Criterion (IRToP)

Inspired by the human visual system, which prioritizes spatially salient and temporally dynamic regions, IRToP evaluates token importance without learnable parameters.

Spatial Scorer: Measures the dissimilarity between a token and its local spatial neighbors (using a representative mean token in a $k \times k$ window). Tokens that differ significantly from their surroundings (high saliency) receive higher scores.
Temporal Scorer: Measures the variation of a token between consecutive time steps. Tokens with significant changes over time (carrying rich temporal dynamics) receive higher scores.
Integration: The final score for a token is the sum of its normalized spatial and temporal scores. Tokens are ranked, and the top- $K$ are classified as "informative," while the rest are "uninformative."

B. Information-Retention Token Pruning Architecture (IR-Arc)

Instead of directly removing uninformative tokens (which would break feature map dimensions in hierarchical models), IR-Arc employs a Block-Level Early Stopping strategy:

Selection: Based on IRToP scores, tokens are split into informative ( $I$ ) and uninformative ( $U$ ) sets.
Processing:
- Informative Tokens: Undergo full forward propagation through the Self-Attention (SSA) and MLP modules.
- Uninformative Tokens: Skip the SSA and MLP computations (early stopping) but are retained unchanged in the feature map.
Reassembly: The processed informative tokens and the skipped uninformative tokens are reassembled into their original spatial positions to restore the feature map size for the next block.

Benefit: This approach reduces computational overhead (skipping calculations) and memory usage while preserving the spatial structure required by convolutional layers in advanced SNNs.

3. Key Contributions

Neuroscience-Inspired Criterion (IRToP): A training-free, heuristic method that effectively identifies informative tokens by combining spatial distinctiveness and temporal dynamics, eliminating the need for retraining.
Architecture-Agnostic Pruning (IR-Arc): A novel pruning architecture that uses early stopping rather than token dropping. This allows TP-Spikformer to be applied to feature-variant spiking transformers (like QKFormer and SDT-V3) that rely on convolutional operations, a capability previous methods lacked.
Zero-Fine-tuning Capability: The method achieves competitive performance without requiring retraining from scratch or fine-tuning, significantly lowering the barrier for deployment.
Scalability and Versatility: Validated across diverse architectures (Spikformer, QKFormer, SDT-V1/V3) and tasks (Classification, Segmentation, Detection, Event-based Tracking).

4. Experimental Results

The paper validates TP-Spikformer across multiple benchmarks:

Image Classification (ImageNet-1K):
- On SDT-V1, retaining ~51% of tokens reduced operations by 48% and power by 38% with only a 1.53% accuracy drop.
- On QKFormer, retaining ~53% of tokens cut operations by 47% and power by 20% while maintaining 82.53% accuracy.
- On SDT-V3, significant efficiency gains were observed with minimal accuracy loss.
Downstream Tasks:
- Semantic Segmentation (ADE20K): With 56% token retention, throughput increased by 1.7x with only a 0.2% mIoU drop.
- Object Detection (COCO): Achieved 1.4x throughput with a 1% mAP drop.
- Event-based Tracking: Outperformed many RGB-based trackers and rivaled advanced event-based trackers (SDTrack) using only 56% of tokens.
Efficiency Gains:
- Training: Reduced training time and GPU memory usage significantly (e.g., 7.5 hours saved on ImageNet training for SDT-V3).
- Inference: Increased inference throughput by up to 41% on various models.
Zero-Fine-tuning: The method maintains high accuracy even when applied directly to pre-trained weights without any fine-tuning, outperforming other SNN pruning methods that require retraining.

5. Significance

TP-Spikformer addresses a critical bottleneck in the deployment of Spiking Transformers: the trade-off between high accuracy and computational cost.

Practical Deployment: By enabling training-free compression and supporting hierarchical architectures (which are state-of-the-art), it makes SNNs viable for real-world edge applications with limited resources.
Generalizability: The method is orthogonal to other compression techniques (like quantization) and has been shown to work on NLP tasks (GLUE benchmark), suggesting broad applicability beyond computer vision.
Efficiency: It offers a path to deploy large-scale SNNs on neuromorphic hardware and resource-constrained devices without sacrificing the performance benefits of Transformer-based models.