A Study on Inference Latency for Vision Transformers on Mobile Devices

Imagine you have a brand-new, super-smart robot assistant (a Vision Transformer or ViT) that you want to carry in your pocket. This robot is amazing at recognizing faces, reading signs, and understanding the world around you. However, there's a catch: your pocket is small (your mobile phone), and your robot is used to working in a massive, unlimited warehouse (the cloud).

The paper by Zhuojin Li and his team asks a simple but critical question: "Can this giant robot actually run fast enough on a tiny phone without getting stuck, overheating, or draining the battery?"

Here is the story of their investigation, broken down into everyday concepts.

1. The Old Way vs. The New Way: The Factory Analogy

For years, phones used CNNs (Convolutional Neural Networks) to see. Think of a CNN like a traditional assembly line. A worker looks at a small part of a car, then moves to the next part, then the next. They only see what's right in front of them at any given moment. It's efficient, but it takes a long time to understand the whole picture.

ViTs are the new kids on the block. They work like a team of detectives in a war room. Instead of looking at one piece at a time, every detective looks at every other piece simultaneously to figure out how they all connect.

The Problem: This "war room" approach is incredibly smart, but it requires a lot of talking and checking. On a phone, which has limited memory and power, this constant "talking" (called self-attention) slows everything down.

2. The Big Discovery: It's Not Just About Brains, It's About Memory

The researchers tested 190 different versions of these robot brains on 6 different types of phones (from iPhones to various Androids). They compared them to 102 older models (CNNs).

They found three surprising things:

The "FLOPs" Lie: In the tech world, people often guess how fast a model is by counting its math problems (called FLOPs). It's like guessing how long a road trip will take just by counting the miles.
- The Finding: For the new robots (ViTs), counting miles is useless! A ViT might have the same "mile count" as an old robot but take twice as long to finish the trip. Why? Because the new robots get stuck in traffic (memory bottlenecks) much more often.
The Memory Bottleneck: The old robots were like trucks that carried heavy loads but didn't need to stop often. The new robots are like delivery drones that need to constantly refuel and reload data. If the phone's "fuel line" (memory bandwidth) isn't wide enough, the drone just sits there waiting.
The "Shape" Matters: The researchers discovered that how the data is arranged in the phone's memory is crucial.
- Analogy: Imagine you have a stack of books. If you need to read them, it's fast if they are stacked neatly (Channel-First). But if you have to take them apart, rearrange them, and stack them differently just to read one page (Channel-Last), you waste time. The new robots often need to do this rearranging, which slows them down.

3. The "Magic Trick" of Activation Functions

The paper also looked at a specific math trick the robots use called GELU (an activation function).

The Analogy: Imagine a bouncer at a club. Sometimes, the bouncer checks your ID quickly. Other times, if your ID looks a certain way, the bouncer decides to do a full background check, which takes forever.
The Finding: The speed of the robot depends on the actual numbers inside the image it's looking at. If the image has certain patterns, the "bouncer" (GELU) gets slow. This is impossible to predict just by looking at the robot's blueprint; you have to actually run it to see how fast it is.

4. The Solution: Building a "Crystal Ball" Dataset

Since it's so hard to predict how fast these robots will be, the researchers decided to build a giant library of test runs.

They created 1,000 fake robots (Synthetic ViTs) with every possible combination of parts.
They ran all of them on the 6 phones using two different "operating systems" (PyTorch and TensorFlow).
They measured exactly how long each one took.

Why do this?
Now, they have a Crystal Ball. If a developer wants to build a new robot tomorrow, they don't need to build it, put it on a phone, and wait to see if it's slow. They can just look at the Crystal Ball (the dataset) and say, "Oh, if you use this part and that part, it will take 50 milliseconds."

5. Why This Matters for You

This research is a game-changer for two main reasons:

Designing Better Apps (NAS): Imagine you are designing a self-driving car app. You want the car to be smart but not slow. With this "Crystal Ball," the computer can automatically design the perfect robot brain that fits your phone perfectly, saving you hours of trial and error.
Splitting the Work (Collaborative Inference): Sometimes a phone is too weak to do the whole job. You might want to do the easy part on the phone and send the hard part to the cloud. This study helps figure out exactly where to cut the work so the phone doesn't get stuck waiting for the cloud.

The Bottom Line

The paper tells us that Vision Transformers are powerful but tricky on phones. They are often slower and thirstier for memory than we thought. But by understanding why they are slow (memory traffic, data shapes, and specific math tricks), the researchers built a massive dataset that acts as a GPS for developers. This GPS helps them navigate the complex world of mobile AI to build faster, smarter, and more efficient apps for our pocket-sized computers.

1. Problem Statement

The rapid advancement of Vision Transformers (ViTs) has revolutionized computer vision, often outperforming traditional Convolutional Neural Networks (CNNs) in accuracy. However, deploying these models on mobile devices presents significant challenges:

Computational Complexity: ViTs rely on self-attention mechanisms with quadratic computational complexity ( $O(N^2)$ ) relative to the number of tokens, making them computationally expensive compared to the linear complexity of CNNs.
Resource Constraints: Mobile devices have limited RAM, memory bandwidth, and heterogeneous CPU cores (large, medium, small), which complicates the deployment of memory-intensive ViT architectures.
Lack of Empirical Data: Existing literature often evaluates ViTs on cloud GPUs or focuses solely on CNNs/NLP Transformers. There is a scarcity of quantitative data regarding the actual inference latency of real-world ViTs on mobile CPUs and GPUs, and how factors like memory formats, activation functions, and ML frameworks impact performance.
Prediction Gap: Current latency prediction methods often rely on FLOPs (Floating Point Operations), which are proven to be inaccurate proxies for ViT latency due to variable execution times of specific operations (e.g., GELU activations) and memory-bound bottlenecks.

2. Methodology

The authors employed a three-pronged methodology: extensive profiling of real-world models, the generation of a synthetic dataset, and the training of machine learning (ML) predictors.

A. Real-World Profiling

Dataset: Measured inference latencies for 190 real-world ViTs (from Timm and HuggingFace) and 102 real-world CNNs.
Platforms: Deployed on 6 mobile devices (including Google Pixel 4, Samsung Galaxy S10/A03s, Motorola One Fusion, and Apple iPhone 12/XS) with diverse System-on-Chips (SoCs) and heterogeneous CPU core configurations.
Frameworks: Evaluated primarily on PyTorch Mobile and compared against TensorFlow Lite (TFLite).
Variables: Tested various CPU core combinations, data representations (FP32 vs. Int8 quantization), and input resolutions.

B. Synthetic Dataset Construction

To overcome the limitation of having only 190 real-world models for training robust predictors, the authors designed a Search Space to generate 1,000 synthetic ViTs:

Architecture: Hierarchical design with an encoder and decoder, incorporating representative building blocks found in State-of-the-Art (SOTA) efficient ViTs (e.g., EfficientViT, Swin).
Components: Included variations in token mixers (Separable Convolution vs. Self-Attention), normalization (BatchNorm vs. LayerNorm), activation functions (GELU vs. SiLU), and memory formats (NCHW vs. NHWC).
Purpose: To create a diverse dataset covering a wide range of architectural configurations and memory access patterns.

C. Prediction Modeling

Approach: Trained ML latency predictors (Lasso, Random Forest, and Gradient Boosted Decision Trees - GBDT) on the synthetic dataset.
Features: Used operation-level features (FLOPs, input/output shapes, filter sizes, memory formats) to predict the latency of individual operations (Conv, Linear, MatMul, Activation, etc.).
Aggregation: End-to-end latency was estimated by summing the predicted latencies of all operations within a model.

3. Key Contributions

Comprehensive Empirical Analysis: Provided the first quantitative comparison of 190 real-world ViTs vs. 102 CNNs on mobile devices, revealing that ViTs are generally more memory-bound and exhibit higher latency than CNNs with similar FLOPs.
Identification of Latency Factors:
- Memory Formats: Demonstrated that memory format (NCHW vs. NHWC) significantly impacts convolution latency due to transpose/reshape overheads in PyTorch Mobile.
- Activation Functions: Revealed that GELU activation latency is highly sensitive to input values (due to internal approximations in libm), making FLOPs an unreliable metric for ViT latency.
- Framework Differences: Showed that different ML frameworks (PyTorch vs. TFLite) and their underlying libraries (XNNPACK vs. QNNPACK) cause substantial latency variations, particularly regarding quantization performance on different CPU cores.
ViT Latency Dataset: Released a dataset containing profiling information for 1,190 models (190 real-world + 1,000 synthetic) across 6 mobile platforms and 2 frameworks.
Accurate Latency Predictors: Developed ML predictors (specifically GBDT) that can accurately estimate ViT latency for both synthetic and unseen real-world architectures, achieving low Mean Average Percentage Errors (MAPE).

4. Key Results

Performance Characteristics

Latency vs. FLOPs: ViTs consistently show longer inference times than CNNs with comparable FLOPs. For example, a 5.0 GFLOP ViT incurred 1.75x the latency of a 4.95 GFLOP CNN.
Memory Bottleneck: ViTs are more memory-bound than CNNs. Increasing memory frequency yielded speedups of >3.4x for 75% of ViTs, compared to only 28% for CNNs. Conversely, increasing CPU clock speed provided less benefit to ViTs.
Memory Consumption: ViTs consume more memory than CNNs of similar size, and this gap widens significantly as input resolution increases (e.g., a 43% difference at 512x512 resolution).

Prediction Accuracy

Synthetic ViTs: GBDT predictors achieved high accuracy with MAPEs of 4.4% (PyTorch Mobile) and 4.8% (TFLite) for end-to-end latency.
Real-World ViTs: The predictors trained on synthetic data successfully generalized to real-world models, achieving MAPEs of 8.2% (PyTorch) and 6.1% (TFLite).
Impact of Training Size: Even with a small training set (30 models), GBDT maintained low errors (7.6% on PyTorch), making it viable for Neural Architecture Search (NAS) where profiling every candidate is too costly.
Hardware Variations: Predictions were accurate across different core combinations, though errors increased slightly with heterogeneous core usage due to thread synchronization overhead.

5. Significance and Applications

This work bridges the gap between theoretical model complexity and practical mobile deployment:

Neural Architecture Search (NAS): Enables the automated design of ViT architectures that satisfy strict latency constraints on target mobile devices without requiring expensive on-device profiling for every candidate.
Collaborative (Split) Inference: Facilitates optimal model partitioning between edge and cloud by accurately predicting the latency of specific model segments, balancing local computation savings against transmission costs.
Framework Optimization: Provides actionable insights for ML framework developers and model designers on how to optimize memory layouts and activation functions to reduce latency on mobile hardware.

In conclusion, the paper establishes that while ViTs are computationally heavier and more memory-intensive than CNNs on mobile devices, their latency can be accurately predicted using ML models trained on a diverse synthetic dataset, enabling efficient deployment strategies for next-generation mobile vision applications.