Performance Benchmarking of Tensor Trains for… — Plain-Language Explanation

The Big Problem: Too Much Data, Too Little Space

Imagine you are trying to understand how a complex material (like a high-tech metal alloy or a composite) behaves under stress. To do this, scientists use a "microscope" to look at the material's tiny internal structure.

In the past, these microscopes gave us small, manageable pictures. But new technology now gives us ultra-high-resolution images containing tens of billions of tiny pixels (called voxels).

The problem is that trying to run the math on these massive images using traditional methods is like trying to carry a mountain of sand in a paper bag. The computer runs out of memory (the bag rips) or takes so long to calculate that the result is useless by the time it arrives.

The Solution: "Quantum-Inspired" Compression

The authors propose a new way to handle this data using a mathematical trick called Tensor Trains (TT).

Think of the material's data as a giant, 3D Rubik's Cube made of billions of tiny blocks.

The Old Way (FFT): Trying to solve the problem by looking at every single block individually. This requires a massive warehouse to store the data and a supercomputer to crunch the numbers.
The New Way (Tensor Trains): Instead of storing every single block, you realize the cube has a pattern. You can describe the whole thing by storing just a few "instruction manuals" (called cores) that tell you how the blocks connect. This is like compressing a 4K movie into a tiny file without losing the picture.

This method is called "Quantum-Inspired" because it borrows a technique from quantum physics (the Quantum Fourier Transform) to solve the math, even though the authors are running it on regular supercomputers, not actual quantum computers.

The Experiment: Who is the Fastest Runner?

The authors wanted to see if this new "compressed" method could run fast on modern computer chips. They tested three different types of hardware:

CPU: The standard brain of a computer (like a reliable, all-purpose workhorse).
GPU: A chip designed for graphics and parallel processing (like a team of 10,000 ants working together).
TPU: A specialized chip made by Google specifically for AI (like a Formula 1 race car built for one specific type of track).

They built a new engine (using a software tool called JAX) to run their "compressed" math on these chips and timed how fast they went.

The Results: It Depends on the Race

The paper found that there is no single "winner." It depends on the size of the problem and the type of math being done:

For huge, parallel tasks (The GPU Wins): When the math involves doing millions of simple calculations at once (like adding up huge lists), the GPU was the fastest. It scales up beautifully, handling massive datasets that would crash the other chips.
For smaller or more complex tasks (The TPU Wins): For certain types of math that are harder to split up, the TPU was surprisingly efficient, often beating the CPU and sometimes the GPU.
The CPU: It was the slowest, but it was the most stable. It didn't crash when the data got too big, whereas the accelerators sometimes ran out of memory.

A Glitch in the Matrix:
The authors found a specific problem with the TPU. When trying to do a specific type of complex math (called SVD) on very large, high-precision numbers, the TPU would get confused and stop working correctly. To fix this, they had to use a slightly slower but more stable "backup plan" (Polar Decomposition) just for the TPU.

The Final Verdict: Breaking the Limits

The most exciting part of the paper is what they achieved with this new setup:

They successfully ran homogenization simulations on datasets with 70 billion grid points.

The Catch: The best traditional methods (using standard FFT) simply cannot do this. They run out of memory long before reaching that size.
The Breakthrough: By using the "compressed" Tensor Train method on these accelerators, they were able to solve problems that were previously impossible.

Summary

Think of this paper as a test drive for a new, fuel-efficient engine (Tensor Trains) in three different cars (CPU, GPU, TPU).

They proved that this engine can drive much further (handle much larger data) than the old engines.
They found that the GPU is the best car for long, straight highway drives (massive parallel data).
They found that the TPU is great for specific, technical tracks, though it has a few quirks with high-precision math.
Most importantly, they showed that with this new engine, we can finally drive through "traffic jams" (massive datasets) that used to be completely blocked off.

Technical Summary: Performance Benchmarking of Tensor Trains for Quantum-Inspired Homogenization on TPU, GPU, and CPU Architectures

Problem Statement
Recent advances in high-resolution CT-imaging have generated ultra-high-resolution microstructural datasets (reaching tens of billions of voxels) that challenge traditional homogenization approaches. While state-of-the-art Fast Fourier Transform (FFT)-based homogenization techniques are effective for moderate datasets, their memory footprint and computational cost scale as $O(dN^d \log N)$ , rendering them inefficient for industrial-scale problems. Although hardware accelerators (GPUs and TPUs) offer computational power, the extreme memory requirements of high-resolution data often exceed their capacity. While Quantum Fourier Transforms (QFT) offer theoretical exponential speed-ups, they remain impractical due to the lack of fault-tolerant quantum hardware. Consequently, there is a need for "quantum-inspired" classical algorithms that leverage low-rank tensor representations to overcome these memory and computational bottlenecks.

Methodology
The paper investigates the performance of the Superfast Fourier Transform (SFFT)-based homogenization algorithm, which utilizes Tensor Train (TT) and Tensor Train Operator (TTO) formats to represent high-order tensors. The study proceeds in two phases:

Fundamental Operation Benchmarking: The authors implemented fundamental TT algebra operations (addition, multiplication, contraction, orthogonalization, and compression) using the JAX framework across three hardware architectures: Dual Intel Xeon Gold 6240R CPUs, NVIDIA A100 GPUs, and Google TPU v4-8. Two implementation modes were compared: a "list-format" (cores stored as a list of arrays) and a "batched-format" (cores stored within a single batched array). The study utilized complex64 precision to ensure accuracy, operating TPUs outside their typical BF16-optimized regime. Performance was analyzed via execution times and Roofline models to determine memory-bound versus compute-bound regimes.
Accelerated Homogenization Application: The SFFT-based homogenization workflow was adapted for these accelerators. To address the high overhead of Just-In-Time (JIT) compilation in JAX when tensor ranks change dynamically, a "coarse-graining" strategy was introduced. This restricts tensor ranks to multiples of a base rank ( $r_0 = 16$ ) to minimize recompilation events. For TPU implementations, standard SVD-based compression was replaced with Polar decomposition-based compression to ensure numerical stability under complex64 arithmetic, where SVD was observed to fail to converge at high discretizations.

Key Contributions

First Systematic TPU Benchmarking: The paper provides the first rigorous benchmarking of fundamental TT operations on TPU hardware, including a direct performance comparison against GPUs and CPUs.
Hardware-Accelerated TT Algebra: It presents efficient implementations of TT algebra on modern accelerators, evaluating the feasibility of list-format versus batched-format storage and identifying specific performance characteristics (e.g., memory-bound vs. compute-bound behavior) for different operations.
Practical Implementation of SFFT Homogenization: The authors successfully adapted the SFFT-based homogenization algorithm for GPU and TPU execution, enabling the simulation of datasets ranging from 300 million to 70 billion grid points—sizes infeasible for standard GPU-based FFT reference implementations.
Stability Analysis: The work identifies numerical instabilities in TPU-based SVD operations under complex64 precision and proposes Polar decomposition as a stable alternative for high-discretization regimes.

Results

Operation Performance:
- Parallel Operations: For highly parallelizable operations (addition, multiplication, TT-TTO contraction), GPUs demonstrated superior scalability at high discretization levels, eventually surpassing TPUs. TPUs showed low overhead at lower discretizations but were strictly memory-bound across the tested range.
- Serial Operations: For serial operations (orthogonalization, compression), TPUs generally outperformed GPUs across the full regime. However, SVD-based compression on TPUs failed to converge at discretizations around $2^7$ under complex64 precision, necessitating the switch to Polar decomposition.
- Roofline Analysis: GPUs were predominantly compute-bound for complex operations, while TPUs remained memory-bound for parallel tasks but transitioned toward compute-bound behavior for serial tasks at larger discretizations.
Homogenization Scaling:
- The GPU-based quantum-inspired solver successfully scaled up to approximately 70 billion grid points ( $2^{18}$ points per dimension), significantly exceeding the memory limits of the cuFFT-based reference implementation (limited to $2^{12}$ points).
- CPU and TPU versions reached $2^{14}$ and $2^{10}$ points per dimension, respectively, limited by memory capacity.
- While the absolute execution times of the SFFT method were not yet fully optimized compared to highly tuned cuFFT libraries, the scaling behavior indicated that the SFFT approach would eventually outperform FFT-based methods as problem sizes increased, particularly for geometries with separable structures where TT ranks remain moderate.
Accuracy: The method maintained a relative error below 5% for effective material properties, controlled by the compression cutoff parameter.

Significance and Claims
The paper claims to establish a foundation for high-performance, large-scale tensor-based homogenization on modern accelerators. It demonstrates that Tensor Train techniques can overcome both memory and computational bottlenecks in industrial-scale simulations, enabling the homogenization of massive datasets previously infeasible on conventional accelerators.

The authors emphasize that this work does not modify the fundamental SFFT algorithm but focuses on its efficient implementation and acceleration. They position the method as a complementary tool for data-driven multiscale modeling, capable of generating accurate reference solutions for training neural operators. The study concludes that while the approach is currently limited to approximately low-rank geometries (e.g., pixelized microstructures from layered composites or lattice materials), it represents a viable pathway toward scalable, physics-based quantum-inspired solvers for multiscale material modeling. The authors remain modest regarding immediate industrial applicability for arbitrary microstructures, noting that future work is required to address numerical stability on TPUs and to extend these methods to higher-order tensor networks.

Performance Benchmarking of Tensor Trains for accelerated Quantum-Inspired Homogenization on TPU, GPU and CPU architectures