TT-SNN: Tensor Train Decomposition for Efficient Spiking Neural Network Training

Imagine you are trying to teach a robot to recognize objects, like a cat or a car, using a special kind of brain called a Spiking Neural Network (SNN).

Unlike regular computer brains (Artificial Neural Networks) that constantly chatter with numbers, an SNN is more like a human brain. It stays quiet until it gets a specific signal, then it "fires" a tiny electrical spike. This makes it incredibly energy-efficient, like a lightbulb that only turns on when you need it.

However, there's a catch: Training these spiking brains is slow and memory-hungry. It's like trying to learn a complex dance by watching a video frame-by-frame, writing down every single movement, and then rewinding to check your mistakes. Because the SNN processes information over time (timesteps), the computer has to remember everything from every single moment to learn correctly. This takes up a massive amount of memory and computing power.

The paper you shared, TT-SNN, introduces a clever new way to speed this up. Here is the breakdown using simple analogies:

1. The Problem: The "Heavy Backpack"

Imagine a student trying to learn a subject. In a standard SNN, the student carries a giant backpack filled with every single note they've ever taken, every calculation they've ever made, and every intermediate step. As the class gets longer (more timesteps), the backpack gets so heavy the student can barely move. This is the "memory and computation overhead" the paper talks about.

2. The Solution: The "Lego Breakdown" (Tensor Train)

The authors realized that the "notes" in the backpack (the mathematical weights) are actually very repetitive and redundant. You don't need to write the whole encyclopedia; you just need the key chapters.

They used a technique called Tensor Train Decomposition.

The Analogy: Imagine a giant, solid block of cheese (the original heavy weight). Instead of carrying the whole block, you slice it into four smaller, manageable Lego bricks.
The Result: The student no longer carries one giant block. They carry four small bricks. This makes the backpack 8 to 9 times lighter! The student can move much faster because they aren't weighed down.

3. The Innovation: Running in Parallel (PTT)

Previous methods tried to use these Lego bricks, but they made the student do them one by one, in a line.

Old Way (Sequential): The student picks up Brick 1, puts it down. Then picks up Brick 2, puts it down. Then Brick 3... It's still a bit slow because they are doing things one after another.
The Paper's Way (Parallel TT): The authors said, "Why wait?" They set up a system where the student can pick up Brick 2 and Brick 3 at the same time with two hands.
The Metaphor: It's like a kitchen. Instead of one chef chopping onions, then tomatoes, then peppers one by one, you have two chefs chopping the veggies simultaneously. The meal gets ready much faster, and the final dish tastes just as good (or even better) because no information was lost in the process.

4. The "Half-Time" Trick (HTT)

The paper also noticed something interesting about how these spiking brains learn.

The Insight: In the beginning of a video clip (early timesteps), the brain needs to see everything clearly to get the gist. But later in the video, it already knows what's happening, so it doesn't need to look at every single detail again.
The Trick: They created a "Half-Time" mode. In the early moments, the student does the full work. In the later moments, they only do half the work (using fewer Lego bricks).
The Result: This saves even more energy, especially for video data where the scene doesn't change drastically every second.

5. The Custom Engine (Hardware Accelerator)

Finally, the authors realized that standard computer chips weren't built to handle this new "two-chefs-at-once" style of cooking.

The Analogy: Imagine you invented a new, super-fast way to fold laundry, but your washing machine is old and can only handle one shirt at a time. You'd be wasting your new method's potential.
The Fix: They designed a custom "training accelerator" (a specialized chip) that has four different workstations (clusters) working together perfectly to handle these parallel Lego bricks. This chip ensures that the energy savings are real, not just theoretical.

The Bottom Line

By breaking down the heavy math into smaller pieces, doing them simultaneously, and skipping unnecessary work later in the process, TT-SNN makes training these efficient spiking brains:

8x lighter (less memory needed).
9x faster (less math to calculate).
28% more energy-efficient (saves battery).

And the best part? The robot still learns just as well as before. It's like giving a marathon runner a lighter pair of shoes and a better running strategy—they finish the race faster without getting tired.

Here is a detailed technical summary of the paper "TT-SNN: Tensor Train Decomposition for Efficient Spiking Neural Network Training."

1. Problem Statement

Spiking Neural Networks (SNNs) are promising energy-efficient alternatives to Artificial Neural Networks (ANNs) due to their sparse, event-driven binary activations. However, training SNNs via Backpropagation (BP) with surrogate gradients faces significant challenges:

Memory and Computation Overhead: SNN training requires storing intermediate activations across multiple timesteps and layers, leading to high memory consumption.
Inefficiency of Existing Optimizations: While techniques like quantization, pruning, and knowledge distillation have improved inference efficiency, they often fail to address the specific computational bottlenecks of the training phase.
Limitations of Sequential Tensor Decomposition: Previous attempts to apply Tensor Train (TT) decomposition to SNNs (Sequential TT or STT) decompose weights into sub-convolutions computed sequentially. This approach suffers from information loss due to asymmetric kernel sizes (e.g., $3\times1 $and$ 1\times3$), which fails to capture perpendicular feature information effectively.

2. Methodology

The authors propose TT-SNN, a framework that integrates Tensor Train decomposition into SNNs with a focus on parallelizing the training pipeline.

A. Core Decomposition Strategy

The method decomposes standard 2D convolutional weights ( $O \times I \times K \times K$ ) into four smaller sub-convolutions (TT-cores) using the Tensor Train format. Unlike traditional DNN applications where cores are reconstructed, TT-SNN keeps the weights decomposed during training to reduce storage and FLOPs.

B. Key Architectural Modules

Parallel TT (PTT):
- Mechanism: Instead of computing sub-convolutions sequentially (as in STT), PTT computes the second and third sub-convolutions (using $3\times1 $and$ 1\times3$ kernels) in parallel, both taking the output of the first sub-convolution as input.
- Benefit: This mimics a $3\times3$ kernel (minus the four corners), allowing the network to capture both vertical and horizontal features simultaneously, thereby mitigating the information loss inherent in sequential asymmetric kernels.
- Formula: $y_t = [(x_t * w^{(1)} * w^{(2)}) + (x_t * w^{(1)} * w^{(3)})] * w^{(4)}$ .
Half TT (HTT):
- Mechanism: Leveraging the observation that SNNs capture more information in early timesteps, HTT employs a "half-diagonal" computation strategy. It uses full sub-convolutions in early timesteps and only half the sub-convolutions in later timesteps.
- Benefit: This reduces redundancy in later timesteps, further lowering computational costs and training latency.
Training Pipeline & Reconstruction:
- Initialization: The base SNN is initialized, and optimal TT-ranks are determined using Variational Bayesian Matrix Factorization (VBMF).
- Training: The model trains using decomposed weights with the PTT or HTT pipeline.
- Inference Reconstruction: After training, the decomposed weights are mathematically reconstructed (contracted) back into a single standard convolutional weight matrix to maintain standard spike-based inference without architectural changes.

C. Hardware Accelerator Design

Recognizing that existing SNN accelerators are designed for sequential layer processing, the authors propose a Multi-Cluster Systolic Array Accelerator:

Architecture: Features 4 computation clusters. Cluster 1 handles the first sub-convolution. Clusters 2 and 3 operate in parallel to handle the PTT/HTT parallelism. Cluster 4 handles the final sub-convolution.
Dataflow: Uses an output-stationary dataflow for clusters 1 and 4, and weight-stationary for clusters 2 and 3 to match latencies.
Optimization: The design hides SRAM read latency by pre-filling weights for parallel clusters while Cluster 1 is processing inputs.

3. Key Contributions

First Application of TT in SNNs: This is the first work to apply Tensor Train decomposition specifically for SNN training, introducing a parallel computation pipeline (PTT) to overcome the limitations of sequential decomposition.
Novel Modules (PTT & HTT): Introduction of the Parallel TT module for feature preservation and the Half TT module for timestep redundancy reduction.
Custom Hardware Accelerator: Design of a multi-cluster systolic array specifically tailored to exploit the parallelism of PTT and HTT, addressing the mismatch with existing single-layer accelerators.
Plug-and-Play Compatibility: The TT-SNN modules can be flexibly integrated into various existing SNN architectures (e.g., ResNet, VGG) without significant accuracy degradation.

4. Experimental Results

Experiments were conducted on static datasets (CIFAR-10/100) and a dynamic event-based dataset (N-Caltech101) using ResNet18 and ResNet34 architectures.

Parameter & FLOP Reduction:
- On N-Caltech101, TT-SNN achieved a 7.98 $\times$ reduction in trainable parameters and a 9.25 $\times$ reduction in FLOPs.
- On CIFAR-10, parameter reduction was 6.13 $\times$ .
Training Efficiency:
- Time: PTT reduced training time by 17.76% on CIFAR-10 and 17.66% on N-Caltech101 compared to the baseline. HTT achieved even higher time reductions (up to 22.43%).
- Energy:
  - On existing accelerators, PTT/HTT showed mixed results due to lack of parallel support (PTT actually increased energy by 10.9% due to DRAM traffic).
  - On the proposed multi-cluster accelerator, PTT reduced training energy by 28.3% and HTT by 43.5% compared to the Sequential TT (STT) baseline.
Accuracy:
- PTT maintained or slightly improved accuracy compared to the baseline (e.g., 77.24% on N-Caltech101 vs. 77.13% baseline).
- HTT showed a slight accuracy drop on dynamic datasets (N-Caltech101) due to information loss in later timesteps, but performed well on static datasets.
Compatibility: Integrating PTT into other SNN training methods (tdBN, TEBN, TET, NDA) consistently reduced training time by 9–25% with negligible accuracy loss.

5. Significance

This work bridges the gap between theoretical tensor decomposition and practical SNN training efficiency. By shifting from sequential to parallel tensor computation, the authors successfully address the memory and latency bottlenecks of SNN training. The introduction of a dedicated hardware accelerator demonstrates that algorithmic innovations (PTT/HTT) must be paired with architectural co-design to fully realize energy and time savings. The results confirm that TT-SNN is a viable, high-performance solution for training SNNs on both static and dynamic event-based data, paving the way for more efficient neuromorphic computing systems.