A Study on Inference Latency for Vision Transformers on Mobile Devices

This paper quantitatively analyzes the inference latency of real-world Vision Transformers on mobile devices by comparing them with CNNs, leading to the creation of a comprehensive dataset that enables accurate latency prediction for new ViT architectures.

Zhuojin Li, Marco Paolieri, Leana Golubchik

Published 2026-02-20
📖 5 min read🧠 Deep dive

Imagine you have a brand-new, super-smart robot assistant (a Vision Transformer or ViT) that you want to carry in your pocket. This robot is amazing at recognizing faces, reading signs, and understanding the world around you. However, there's a catch: your pocket is small (your mobile phone), and your robot is used to working in a massive, unlimited warehouse (the cloud).

The paper by Zhuojin Li and his team asks a simple but critical question: "Can this giant robot actually run fast enough on a tiny phone without getting stuck, overheating, or draining the battery?"

Here is the story of their investigation, broken down into everyday concepts.

1. The Old Way vs. The New Way: The Factory Analogy

For years, phones used CNNs (Convolutional Neural Networks) to see. Think of a CNN like a traditional assembly line. A worker looks at a small part of a car, then moves to the next part, then the next. They only see what's right in front of them at any given moment. It's efficient, but it takes a long time to understand the whole picture.

ViTs are the new kids on the block. They work like a team of detectives in a war room. Instead of looking at one piece at a time, every detective looks at every other piece simultaneously to figure out how they all connect.

  • The Problem: This "war room" approach is incredibly smart, but it requires a lot of talking and checking. On a phone, which has limited memory and power, this constant "talking" (called self-attention) slows everything down.

2. The Big Discovery: It's Not Just About Brains, It's About Memory

The researchers tested 190 different versions of these robot brains on 6 different types of phones (from iPhones to various Androids). They compared them to 102 older models (CNNs).

They found three surprising things:

  • The "FLOPs" Lie: In the tech world, people often guess how fast a model is by counting its math problems (called FLOPs). It's like guessing how long a road trip will take just by counting the miles.
    • The Finding: For the new robots (ViTs), counting miles is useless! A ViT might have the same "mile count" as an old robot but take twice as long to finish the trip. Why? Because the new robots get stuck in traffic (memory bottlenecks) much more often.
  • The Memory Bottleneck: The old robots were like trucks that carried heavy loads but didn't need to stop often. The new robots are like delivery drones that need to constantly refuel and reload data. If the phone's "fuel line" (memory bandwidth) isn't wide enough, the drone just sits there waiting.
  • The "Shape" Matters: The researchers discovered that how the data is arranged in the phone's memory is crucial.
    • Analogy: Imagine you have a stack of books. If you need to read them, it's fast if they are stacked neatly (Channel-First). But if you have to take them apart, rearrange them, and stack them differently just to read one page (Channel-Last), you waste time. The new robots often need to do this rearranging, which slows them down.

3. The "Magic Trick" of Activation Functions

The paper also looked at a specific math trick the robots use called GELU (an activation function).

  • The Analogy: Imagine a bouncer at a club. Sometimes, the bouncer checks your ID quickly. Other times, if your ID looks a certain way, the bouncer decides to do a full background check, which takes forever.
  • The Finding: The speed of the robot depends on the actual numbers inside the image it's looking at. If the image has certain patterns, the "bouncer" (GELU) gets slow. This is impossible to predict just by looking at the robot's blueprint; you have to actually run it to see how fast it is.

4. The Solution: Building a "Crystal Ball" Dataset

Since it's so hard to predict how fast these robots will be, the researchers decided to build a giant library of test runs.

  • They created 1,000 fake robots (Synthetic ViTs) with every possible combination of parts.
  • They ran all of them on the 6 phones using two different "operating systems" (PyTorch and TensorFlow).
  • They measured exactly how long each one took.

Why do this?
Now, they have a Crystal Ball. If a developer wants to build a new robot tomorrow, they don't need to build it, put it on a phone, and wait to see if it's slow. They can just look at the Crystal Ball (the dataset) and say, "Oh, if you use this part and that part, it will take 50 milliseconds."

5. Why This Matters for You

This research is a game-changer for two main reasons:

  1. Designing Better Apps (NAS): Imagine you are designing a self-driving car app. You want the car to be smart but not slow. With this "Crystal Ball," the computer can automatically design the perfect robot brain that fits your phone perfectly, saving you hours of trial and error.
  2. Splitting the Work (Collaborative Inference): Sometimes a phone is too weak to do the whole job. You might want to do the easy part on the phone and send the hard part to the cloud. This study helps figure out exactly where to cut the work so the phone doesn't get stuck waiting for the cloud.

The Bottom Line

The paper tells us that Vision Transformers are powerful but tricky on phones. They are often slower and thirstier for memory than we thought. But by understanding why they are slow (memory traffic, data shapes, and specific math tricks), the researchers built a massive dataset that acts as a GPS for developers. This GPS helps them navigate the complex world of mobile AI to build faster, smarter, and more efficient apps for our pocket-sized computers.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →