Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS

This paper demonstrates that selecting an optimal subset of body landmarks combined with spline-based imputation enables isolated Brazilian Sign Language (LIBRAS) recognition that is both 5 times faster and as accurate as state-of-the-art methods, overcoming the speed-accuracy trade-off of previous OpenPose-based approaches.

Daniele L. V. dos Santos, Thiago B. Pereira, Carlos Eduardo G. R. Alves, Richard J. M. G. Tello, Francisco de A. Boldt, Thiago M. Paixão

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a computer to understand Brazilian Sign Language (LIBRAS). The computer needs to "see" a person signing and figure out what word they are saying.

For a long time, the best way to do this was like hiring a super-precise, but incredibly slow, robot architect. This robot (called OpenPose) would scan a video of a signer and draw a detailed skeleton of every single joint, finger, and facial feature. It was so detailed it had over 500 points of data per frame. The problem? It took so long to draw these skeletons that the system couldn't keep up with real-time conversation. It was like trying to win a race while carrying a heavy backpack full of bricks.

Then, someone suggested using a lightweight, fast drone instead (called MediaPipe). This drone could draw the skeleton in a flash—5 times faster! But there was a catch: when the researchers just swapped the slow robot for the fast drone, the computer got confused. The accuracy dropped because the drone was too "noisy" and included too much irrelevant detail (like every tiny wrinkle on a face) that distracted the computer.

This paper is the story of how the researchers fixed the drone to make it both fast AND smart.

Here is the breakdown of their solution using simple analogies:

1. The "Needle in a Haystack" Problem

The fast drone (MediaPipe) gives you 543 points of data. Imagine you are looking for a specific word in a book, but the book is filled with 500 pages of random noise and only 50 pages of actual story. The computer gets overwhelmed trying to read the noise.

The researchers realized they didn't need all the data. They needed to pick the right pages.

2. The "Curated Playlist" Strategy

The team tested five different ways to select which body parts to focus on. Think of it like making a playlist for a party:

  • The "Everything" Playlist: Includes every song ever made (too long, boring).
  • The "Face-Only" Playlist: Only songs about emotions (misses the dance moves).
  • The "ASL-2nd" Playlist (The Winner): They found a specific mix of body parts that worked best. It focused on the hands (where the signs happen), the mouth (for facial expressions), and the shoulders/arms (for posture). They threw away the rest of the "noise."

The Result: By using this "curated playlist" of body landmarks, the computer understood the signs just as well as the slow, heavy robot, but it did it with much less data.

3. The "Auto-Correct" Feature

Sometimes, the fast drone gets distracted by bad lighting or a hand moving too fast, and it "loses" a landmark (a point disappears from the data). It's like a text message where a few letters get dropped.

The researchers added a Spline Imputation step. Think of this as a smart "Auto-Correct" for the computer. If a point disappears for a split second, the computer looks at where the point was before and where it will be after, then draws a smooth, logical line to fill in the gap. This made the system much more reliable, boosting accuracy significantly.

4. The Final Scorecard

By combining the curated body parts (ASL-2nd strategy) with the smart auto-correct, they achieved a massive win:

  • Speed: The system is now 5 times faster than the previous best method. It can process signs almost in real-time.
  • Accuracy: It is just as accurate (or even better) than the slow, heavy methods.
  • Efficiency: It uses a lightweight tool (MediaPipe) that can run on regular computers, not just supercomputers.

The Big Picture

Imagine you want to recognize a friend waving at you from across a busy street.

  • The Old Way: You stop, pull out a microscope, measure the exact angle of every hair on their head, and calculate the wind speed. It's accurate, but you miss the wave because you took too long.
  • The New Way: You quickly spot their hand, their face, and their shoulders. You ignore the background noise. You fill in the gaps if they blink. You recognize the wave instantly.

This paper proves that for teaching computers sign language, less is often more. You don't need to see everything; you just need to see the right things, and you need to be fast enough to keep up with the conversation.