Geometry-Aware Metric Learning for Cross-Lingual Few-Shot Sign Language Recognition on Static Hand Keypoints

Imagine you are trying to teach a robot to understand sign language. The problem is that there are over 300 different sign languages in the world, but for most of them, we don't have enough video examples to train a smart AI. It's like trying to learn a new language by only reading a few pages of a dictionary.

This paper proposes a clever solution: Teach the robot the "shape" of the hand, not the "location" of the hand.

Here is the breakdown of their idea using simple analogies:

1. The Problem: The "Camera Angle" Confusion

Imagine you are taking a photo of a friend making a "peace sign" (V-shape) with their fingers.

Scenario A: You take the photo from far away. Their hand looks tiny.
Scenario B: You take the photo from the side. Their hand looks squashed.
Scenario C: You take the photo from above. Their hand looks different again.

If you show these three photos to a standard computer program, it gets confused. It thinks, "Wait, is this a different sign? The hand is in a different spot, it's a different size, and it's facing a different way!" This is called Domain Shift. The computer is too focused on where the hand is in the room, rather than what the hand is actually doing.

This is a huge problem when you only have a few examples (called "Few-Shot Learning"). If you only show the computer 5 examples of a sign, and they all look different because of the camera angle, the computer will never learn the true shape of the sign.

2. The Solution: The "Stick Figure" Geometry

The authors realized that instead of giving the computer the raw coordinates (X, Y, Z) of the hand, they should give it the angles between the finger joints.

Think of a hand like a puppet made of sticks and hinges.

If you move the puppet closer to the camera, the sticks get bigger, but the angle at the hinge doesn't change.
If you rotate the puppet, the sticks point in different directions, but the angle at the hinge stays exactly the same.

The researchers created a special "language" for the computer that only speaks in angles. They measured the angle between every joint in the hand (like the bend in your knuckle).

Raw Coordinates: "My finger is at position (10, 20, 5)." (Changes if you move the camera).
Angle Descriptor: "My finger is bent at 45 degrees." (Stays the same no matter where you are).

They call this a Geometry-Aware approach. It's like describing a song by its melody (the relationship between notes) rather than the volume or the speed at which it's played.

3. The Magic Trick: Cross-Lingual Transfer

Now, here is the really cool part. The researchers trained their AI on American Sign Language (ASL), which has thousands of examples. Then, they tried to use that same AI to recognize signs in Thai, Brazilian, and Arabic sign languages, which have very few examples.

Usually, this fails because the cameras and recording conditions are different. But because their AI was trained on angles (the pure shape of the hand), it didn't care about the camera or the distance.

The Analogy: Imagine you learn to recognize a "Triangle" by looking at it in a book. Later, someone shows you a triangle drawn in the sand, or a triangle made of sticks. Even though the materials and sizes are different, you recognize it instantly because you learned the geometry, not the material.

4. The Results: A Giant Leap

The results were impressive:

Within the same language: Using angles made the AI much smarter, especially when there were very few examples to learn from.
Across different languages: The AI trained on ASL could recognize Thai signs almost as well as if it had been trained specifically on Thai data. In some cases, it was even better than training from scratch!

Summary

The paper is essentially saying: "Stop teaching robots to memorize where hands are in a room. Teach them to understand how hands bend."

By focusing on the invariant geometry (the angles that never change), they built a lightweight, efficient system that can learn new sign languages with very little data, acting as a universal translator for the "shape" of human hands. This is a huge step toward making sign language technology accessible to the hundreds of languages that currently lack digital support.

Here is a detailed technical summary of the paper "Geometry-Aware Metric Learning for Cross-Lingual Few-Shot Sign Language Recognition on Static Hand Keypoints."

1. Problem Statement

Sign Language Recognition (SLR) systems face a critical bottleneck: the scarcity of annotated data for the world's 300+ sign languages. While large-scale datasets exist for major languages (e.g., ASL), most lack sufficient training examples.

The Challenge: Cross-lingual few-shot transfer (pretraining on a data-rich source and adapting to a target with few examples) is a promising solution. However, conventional methods using normalized $(x, y, z)$ hand keypoints suffer from domain shift.
The Root Cause: Keypoint coordinates are sensitive to extrinsic factors like camera viewpoint, hand scale, and recording distance. In a few-shot regime (e.g., $K=5$ ), class prototypes (centroids) are estimated from very few samples. If these samples contain high extrinsic variance, the prototypes become unstable, leading to poor generalization across different datasets or languages.

2. Methodology

The authors propose a Geometry-Aware Metric Learning Framework that replaces coordinate-based representations with a mathematically invariant geometric descriptor.

A. Representation: SO(3)-Invariant Angles

Instead of using raw or normalized coordinates, the system computes a 20-dimensional inter-joint angle descriptor derived from MediaPipe hand keypoints.

Derivation: The hand skeleton (21 keypoints) is treated as a kinematic tree rooted at the wrist. For each of the 20 non-wrist joints, an anatomical triplet (parent, pivot, child) is defined.
Calculation: The angle $\theta_k$ is computed using the normalized dot product of displacement vectors between the pivot and its parent/child.
Mathematical Invariance: The authors prove that these angles are invariant to similarity transforms (rotation, translation, and isotropic scaling). Unlike coordinates, which require ad-hoc normalization that may fail under extreme shifts, angles inherently discard extrinsic variance.
- Formula: $\theta_k = \arccos\left(\frac{u_k \cdot v_k}{\|u_k\|\|v_k\|}\right)$ , where $u_k$ and $v_k$ are vectors from the pivot to the parent and child.

B. Model Architecture

The framework uses a Prototypical Network (ProtoNet) for few-shot classification:

Encoder: A lightweight Multi-Layer Perceptron (MLP) with $\sim105$ k parameters (for the angle input) maps the 20-dimensional angle vector to a 128-dimensional embedding. A Transformer encoder was also tested but performed similarly or worse, suggesting simple encoders suffice for well-designed geometric features.
Classification: Queries are classified by finding the nearest class prototype (mean of support embeddings) in the embedding space using Euclidean distance.

C. Experimental Protocol

Datasets: Four typologically diverse fingerspelling alphabets: ASL (29 classes), LIBRAS (21), Arabic SL (31), and Thai (42).
Protocol: Deterministic 5-way $K$ -shot episodes ( $K \in \{1, 3, 5\}$ , $Q=15$ queries).
Evaluation Modes:
- Within-domain: Train and test on the same language.
- Cross-lingual: Pretrain on a source language, then evaluate on a target language with Frozen encoders (no adaptation) or Target-Supervised (fine-tuning the last layer).

3. Key Contributions

Geometry-Invariant Representation: Introduction of a 20D angle descriptor proven to be invariant to rotation, translation, and scaling. This eliminates the need for spatial normalization and stabilizes class prototypes in few-shot settings.
Cross-Lingual Few-Shot Benchmark: Establishment of a deterministic evaluation protocol across four diverse sign languages, revealing that invariant features often allow cross-lingual transfer to exceed within-domain performance.
Systematic Baselines: Comprehensive comparison against input-space nearest-neighbor, episode-linear classifiers, and full-data baselines, quantifying the specific cost of learning from limited data.

4. Key Results

Within-Domain Performance:
- On smaller datasets (LIBRAS, Arabic, Thai), the angle representation significantly outperformed normalized coordinates.
- Example: On Arabic SL (5-shot), angle features achieved 89.8% accuracy vs. 64.5% for raw coordinates (a 25.3 percentage point gain).
- On the large ASL dataset, concatenating raw coordinates and angles (raw_angle) yielded the best results (95.4%), suggesting that when data is abundant, absolute position adds complementary value.
Cross-Lingual Transfer (Frozen Encoder):
- Geometric invariance drastically reduced domain shift.
- Example: Pretraining on ASL and testing on LIBRAS (frozen) yielded 95.0% with angles, compared to 86.5% with raw coordinates.
- Surprising Finding: In several cases (e.g., ASL $\to$ Thai), cross-lingual transfer with a frozen encoder outperformed training from scratch on the target language alone. This proves the invariant embedding captures portable geometric structures shared across languages.
Ablation Studies:
- Removing normalization steps degraded raw coordinate performance by ~5 percentage points but had negligible impact ( $\le 0.3$ pp) on angle features, confirming theoretical invariance.
Efficiency: The lightweight MLP encoder requires only ~105k parameters, making the approach highly efficient.

5. Significance and Implications

Scalability for Low-Resource Languages: This approach provides a scalable path to SLR for the hundreds of sign languages lacking large datasets. By leveraging geometric invariance, models can be pretrained on major languages and adapted to new languages with minimal data.
Robustness to Acquisition Conditions: The method is robust to variations in camera setup, distance, and hand size without requiring complex normalization pipelines, making it suitable for real-world deployment.
Privacy: Since the system operates on keypoints and angles rather than raw RGB images, it offers better privacy preservation.
Theoretical Insight: The work demonstrates that in few-shot learning, inductive bias (in this case, geometric invariance) is often more critical than model complexity. By removing nuisance variables at the representation level, the model can learn more stable class prototypes from very few examples.

Limitations: The current work focuses on static, single-hand fingerspelling. It does not yet address dynamic signs, two-handed gestures, or non-manual markers (facial expressions), which are crucial for full sign language communication. Future work will need to extend these geometric principles to temporal and multi-modal data.