A Hybrid Vision Transformer Approach for Mathematical Expression Recognition

This paper proposes a Hybrid Vision Transformer approach with 2D positional encoding and a coverage attention decoder to address the complexities of mathematical expression recognition, achieving a state-of-the-art BLEU score of 89.94 on the IM2LATEX-100K dataset.

Anh Duy Le, Van Linh Pham, Vinh Loi Ly, Nam Quan Nguyen, Huu Thang Nguyen, Tuan Anh Tran

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot to read a handwritten math equation from a piece of paper and turn it into a computer code (LaTeX) that a calculator can understand. This is the challenge of Mathematical Expression Recognition.

The problem is tricky because math isn't just a line of text like a sentence in a book. It's a 2D puzzle. You have numbers sitting on top of others (superscripts), fractions with lines in the middle, and symbols scattered all over the page. A standard text reader just looks left-to-right, but a math reader needs to understand the whole picture at once.

Here is how the authors of this paper solved that puzzle, explained simply:

1. The Old Way vs. The New Way

The Old Way (The Assembly Line):
Previous methods tried to solve this in two steps. First, they cut the image into tiny pieces to find individual numbers and symbols (like cutting a pizza into slices). Then, they tried to guess how those slices fit together. This often failed because if you cut a fraction wrong, the whole equation makes no sense.

The New Way (The "Hybrid Vision Transformer"):
The authors built a new system called HVT. Think of this as a super-smart detective that doesn't just look at individual clues but understands the entire crime scene at once.

2. The Three-Stage Detective Team

The system works in three main stages, like a relay race:

Stage A: The "Eagle Eye" (The Encoder)

First, the system needs to look at the math image.

  • The CNN Backbone: Imagine a pair of high-powered binoculars. This part (a ResNet) scans the image to get a general idea of what's there. It's good at spotting shapes and edges but might miss the big picture.
  • The Vision Transformer (ViT): This is the "brain" of the operation. Unlike the binoculars, the ViT looks at the whole image at once. It uses a mechanism called Self-Attention.
    • Analogy: Imagine you are in a crowded room. A normal person only hears the person standing next to them. The ViT, however, can instantly hear and understand the conversation of everyone in the room, even if they are on opposite sides. This helps the robot understand that a "plus" sign at the top of a fraction is related to a "minus" sign at the bottom, even though they are far apart.
  • The 2D Map: Since math has height and width (like a grid), the authors gave the robot a special 2D GPS. This ensures the robot knows exactly where a symbol is located in the grid, not just its order in a line.

Stage B: The "Memory Keeper" (The [CLS] Token)

The ViT produces a huge amount of data. To pass this to the next stage, the system uses a special token called [CLS].

  • Analogy: Think of the [CLS] token as the Team Captain. After the whole team (the image features) has discussed the problem, the Captain summarizes the entire discussion into one single, perfect sentence. The next stage doesn't need to read the whole meeting minutes; it just listens to the Captain's summary to start writing the answer.

Stage C: The "Translator" (The Decoder)

Now the robot needs to write the answer in LaTeX code.

  • Coverage Attention: This is the robot's "checklist." As the robot writes the code, it keeps a running list of which parts of the image it has already looked at.
    • The Problem it Solves: Sometimes, a robot might get confused and write the same symbol twice (over-parsing) or skip a symbol entirely (under-parsing). The checklist reminds the robot: "Hey, you already looked at that fraction bar! Don't look at it again, move on to the next part!"
  • The Result: The robot writes the code one symbol at a time, checking its list to ensure it hasn't missed anything or repeated itself.

3. Why Was This Successful?

The authors tested their robot on a massive dataset of 100,000 math equations (IM2LATEX-100K).

  • The Score: They achieved a score of 89.94 (BLEU score), which is the highest ever recorded for this task.
  • The Secret Sauce: By combining the "local vision" of the binoculars (CNN) with the "global understanding" of the Transformer, and adding a "checklist" (Coverage Attention) to prevent mistakes, they created a system that understands math like a human does—seeing the whole structure, not just the parts.

Summary

In short, the authors built a robot that:

  1. Sees the whole math image at once (not just piece by piece).
  2. Understands how far-apart symbols relate to each other.
  3. Keeps a checklist to make sure it doesn't skip or repeat symbols while writing the answer.

This approach is a major leap forward, making it much easier for computers to read complex scientific documents, textbooks, and research papers.