SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

Imagine you are trying to teach a robot to understand Sign Language.

The problem is that sign language isn't just "speaking with your hands." It's a complex dance involving hand shapes, movement speed, where your hands are in space, facial expressions, and even which hand you use.

Currently, trying to teach computers this is like trying to learn a new language by only looking at a dictionary and guessing. It's slow, expensive, and often wrong because humans have to manually label every single movement, which takes hours for just one minute of video.

Enter SignAgent. Think of SignAgent not as a single robot, but as a super-smart project manager who hires a team of specialized experts to do the heavy lifting.

Here is how it works, using some everyday analogies:

1. The Team Structure

SignAgent is built like a small office with three key roles:

The Orchestrator (The Project Manager): This is the "brain" (a Large Language Model). It doesn't do the grunt work itself. Instead, it looks at a video, thinks, and says, "Okay, I need to know the hand shape first, then the movement, then check the dictionary." It coordinates the whole process.
SignGraph (The Librarian): This is a giant, digital library of sign language rules. It knows that a "thumbs up" hand shape means something different if your palm is facing left versus right. The Manager asks the Librarian for facts to back up its decisions.
The Toolset (The Specialists): These are the workers who actually look at the video.
- The Handshape Specialist looks at the fingers.
- The Movement Specialist watches how the hands travel.
- The Location Specialist checks where the hands are relative to the body.

2. The Two Big Jobs

The paper tests this team on two specific tasks, which are like two different types of puzzles.

Task A: The "Subtitle Puzzle" (Pseudo-gloss Annotation)

Imagine you have a video of someone signing and a sentence in English: "I want to buy a red car."
The computer needs to figure out which part of the video matches "I," which matches "want," and which matches "car."

The Old Way: A computer might just guess based on how the video looks, often mixing up the order or picking the wrong word.
The SignAgent Way:
1. The Manager looks at the English sentence and gets a list of possible sign words.
2. It asks the Specialists to break the video into chunks and describe the hand shapes and movements.
3. It asks the Librarian to check the rules: "Does this hand shape match the word 'car'?"
4. The Manager weighs all this evidence. If the video shows a hand moving like driving, but the hand shape looks like "apple," the Manager uses logic to realize, "Ah, the person is signing 'car' but with a slight variation."
5. Result: It creates a perfect, timed subtitle list that matches the video, even if the signing was messy or fast.

Task B: The "Grouping Game" (ID Glossing)

In sign language, the same word can be signed in slightly different ways. For example, the word "Basketball" might be signed with one hand or two hands. To a computer, these look like two totally different words. To a human linguist, they are the same word with different "flavors."

The Old Way: Computers group these by how similar they look. If the hands look different, the computer thinks they are different words. This creates hundreds of tiny, confusing groups.
The SignAgent Way:
1. The Specialists group videos that look similar.
2. The Manager then asks the Librarian: "Wait, the 'one-hand' group and the 'two-hand' group both use the same hand shape and movement rules. Are they actually the same word?"
3. The Manager says, "Yes, merge them!"
4. Result: Instead of having 5 different groups for "Basketball," SignAgent correctly groups them all into one clean, logical category. It reduces confusion and makes the data much cleaner.

3. Why This Matters

Before SignAgent, creating a database of sign language was like trying to build a library by hand-writing every book. It was too slow to scale.

SignAgent is like giving the librarian a robot assistant that can read the books, check the facts, and organize the shelves instantly.

It's faster: It can process huge amounts of video.
It's smarter: It understands the rules of the language, not just the pictures.
It's trustworthy: Every decision it makes is backed up by evidence it can show you (like a receipt for a purchase).

The Bottom Line

SignAgent is a new tool that helps humans teach computers sign language. It doesn't replace human experts; instead, it acts as a tireless, super-organized assistant that handles the boring, repetitive work of labeling and organizing, so human linguists can focus on the big picture. It turns a messy, slow process into a clean, fast, and accurate one.

1. Problem Statement

Sign Languages (SLs) are complex visual-gestural languages relying on coordinated manual (handshape, movement, location) and non-manual (facial cues) phonological components. Current computational research faces two major bottlenecks:

Linguistic Gap: Most systems operate at the "gloss" level (word-level labels) without understanding the underlying phonological structure, missing crucial linguistic nuances.
Annotation Bottleneck: Manual linguistic annotation is prohibitively expensive and slow (taking >1 hour to annotate 1 minute of video). This prevents the creation of large-scale, phonologically-aware datasets required for training deep learning models.

Existing automated methods often lack the ability to perform linguistic reasoning over multimodal signals, leading to inconsistent or superficial annotations.

2. Methodology: The SignAgent Framework

The authors propose SignAgent, a novel agentic framework that leverages Large Language Models (LLMs) to automate SL annotation and dataset curation. The system is designed to reason over multimodal evidence using a hierarchical toolset and knowledge grounding.

Core Architecture

The framework consists of three primary components:

SignAgent Orchestrator: A reasoning LLM (decoder-only) acting as the central controller. It manages a multi-step decision-making loop (ReAct-style), decomposing tasks, invoking tools, and refining its internal state based on feedback. It does not generate glosses from scratch but selects and orders candidates based on evidence.
SignGraph: A knowledge-grounded retrieval agent. It accesses two directed knowledge graphs:
- LexicalKnowledgeGraph: Contains dictionary entries and phonological components (handshape, movement, location) with their relations.
- LinguisticKnowledgeGraph: Contains linguistic concepts and features extracted from reference materials.
Toolset: A hierarchical suite of tools divided into:
- Base Tools (Red): Low-level modules for foundational analysis (Handshape, Movement, Location classifiers, Sign Segmentor, Glosser, SignLemma for lemmatization, and Handedness Detector).
- Enhanced Tools (Yellow): Modules that synthesize Base Tool outputs into structured, task-ready evidence. They fuse visual and phonological cues, calculate uncertainty, and provide interpretable statistics (e.g., overlap scores) for the Orchestrator.

Key Workflows

The framework is evaluated on two downstream tasks:

Task A: Pseudo-gloss Annotation

Goal: Given a signed video and a translated text sentence, assign and order the correct gloss labels to video segments.
Process:
1. The Orchestrator uses SignLemma to generate a candidate set of gloss tokens from the text.
2. It calls GlossEvidenceCollector to retrieve visual embeddings, temporal boundaries, and phonological predictions for video segments.
3. The Orchestrator reasons over five evidence types: Visual similarity, Phonological overlap, Hand activity, Temporal coherence, and Semantic context.
4. It performs a constrained assignment, reordering the initial token set to match the video sequence without hallucinating new tokens (ensuring token conservation).

Task B: ID Glossing (Lexical Variant Identification)

Goal: Identify and group different visual variants of the same lexical sign (e.g., "basketball" signed with one hand vs. two hands) into stable "ID glosses."
Process:
1. Visual Clustering: Initial clustering is performed using video embeddings (SignRep).
2. Refinement: The Orchestrator evaluates candidate clusters using Visual ID Glossing (distance matrices) and Clustered Phonological Analysis (Jaccard overlap of phonological features).
3. Decision Logic: The agent proposes MERGE or KEEP operations based on visual proximity, phonological agreement (handshape/movement/location overlap), and handedness compatibility.
4. Validation: The system ensures every sample is assigned exactly once, rejecting invalid outputs.

3. Key Contributions

First Agentic SL Framework: Introduces the first application of agentic reasoning for SL annotation, combining tool-augmented multimodal evidence with knowledge-grounded retrieval (SignGraph).
Incremental Performance Gains: Demonstrates through ablation studies that the agentic reasoning layer provides consistent improvements over fixed-pipeline approaches (e.g., simple lemmatization or GBDT rankers) by resolving conflicting evidence.
Public Data Release: The curated datasets resulting from this process are made publicly available to support linguistically grounded SL research.
Interpretability: Unlike black-box models, SignAgent provides auditable reasoning traces, justifying every annotation decision with specific evidence (visual distances, phonological overlaps).

4. Experimental Results

The framework was evaluated on British Sign Language (BSL) and American Sign Language (ASL) datasets.

Task 1: Pseudo-gloss Annotation (BSLCorpus)

Metrics: Longest Common Subsequence (LCS) and Kendall's $\tau$ (rank correlation).
Performance: SignAgent achieved 60.85% LCS and $\tau$ = 0.374 on "Fair" sentences, outperforming the best baseline (Sign2GPT Lemmatization: 57.26% / 0.232).
Significance: On "Poor" (difficult) sentences, SignAgent significantly reduced negative rank correlation (from -0.333 to 0.083), proving its ability to handle complex reordering decisions where traditional methods fail.

Task 2: ID Glossing (ASLCitizen)

Metrics: Cluster entropy (H), Silhouette coefficient, and Calinski-Harabasz ratio.
Performance: SignAgent reduced the number of IDs per gloss from 4.81 (SignRep baseline) to 2.30, indicating less fragmented clusters.
Quality: Improved Silhouette scores from -0.04 to 0.06 and Calinski-Harabasz from 6.75 to 7.58, demonstrating more coherent and well-separated clusters.
Qualitative: Successfully merged visually distinct but phonologically identical variants (e.g., one-handed vs. two-handed "basketball") that visual-only baselines failed to group.

5. Significance and Conclusion

SignAgent represents a paradigm shift in SL research by moving from purely visual pattern matching to linguistically grounded reasoning.

Scalability: It offers a scalable solution to the annotation bottleneck, enabling the creation of large-scale, phonologically rich datasets.
Collaboration: It is designed as an assistive tool for linguists and curators, providing auditable, evidence-based suggestions rather than replacing expert judgment.
Future Work: The authors note limitations in handling non-manual features and prosody, and identify extending the framework to low-resource sign languages and enriching the toolset as critical next steps.

In summary, SignAgent demonstrates that agentic LLMs, when equipped with specialized linguistic tools and knowledge graphs, can significantly outperform traditional fixed-pipeline methods in both the accuracy and interpretability of sign language data curation.