Affinity Contrastive Learning for Skeleton-based Human Activity Understanding

Imagine you are trying to teach a robot to recognize human movements, like "reading a book," "writing a letter," or "drinking water."

In the past, robots (AI models) were good at telling the difference between big, obvious actions, like "jumping" vs. "sitting." But they struggled with subtle actions that look very similar. For example, "reading" and "writing" both involve holding a hand near a face and moving fingers. To a robot, these might look identical, causing it to get confused.

This paper introduces a new smart teaching method called ACLNet (Affinity Contrastive Learning Network). Here is how it works, using simple analogies:

1. The Problem: The "Confused Student"

Imagine a student taking a test.

Old Method: The teacher says, "If you get the answer wrong, just remember: 'Reading' is NOT 'Writing'." The student tries to push these two ideas apart in their mind, but they are still stuck together because they look so similar.
The Flaw: The old method also ignores "tricky" examples. Sometimes a person reads a book very fast, looking like they are typing. The old method gets confused by these weird cases and makes mistakes.

2. The Solution: The "Family Reunion" (Inter-Class Affinity)

The authors realized that instead of just saying "these are different," we should group similar actions into families first.

The Analogy: Think of a family reunion. You have a "Reading Family" that includes "Reading," "Writing," "Typing," and "Checking a Phone." These actions share a "family trait" (using hands near the face).
How ACLNet does it: It looks at the robot's mistakes. If the robot keeps confusing "Reading" with "Typing," it puts them in the same Motion Family.
The Benefit: Instead of just pushing them apart blindly, the robot learns: "Okay, you are in the same family, so you look alike. But now, let's look really closely at the tiny differences between you so you don't get mixed up." It creates a "super-class" to help the robot understand the relationship between similar actions.

3. The "Strict Coach" for Tricky Cases (Intra-Class Marginal Strategy)

Sometimes, even within the same action (like "Reading"), some people do it weirdly. Maybe someone reads while walking, or holds the book upside down. These are "Anomalous Positive Samples" (weird examples of the right answer).

The Analogy: Imagine a coach training a runner. Most runners run normally. But one runner runs with a limp or a funny stride.
Old Method: The coach treats the funny runner the same as the normal ones, which confuses the team.
ACLNet's Method: The coach says, "We know you are a runner (the right class), but your style is very different from the others. We need to make sure you are still clearly a runner, but we also need to make sure you don't accidentally look like a walker."
The Result: It creates a "safety zone" (a margin) around the normal examples. It forces the robot to be extra careful with the weird examples so they don't get mixed up with other activities.

4. The "Dynamic Temperature" (Adapting the Rules)

The paper also mentions a "dynamic temperature schedule."

The Analogy: Think of a thermostat.
- If a group of actions is small and rare (like a tiny family), the robot needs to be very strict and pay close attention to every tiny detail (Low Temperature).
- If a group is huge and common (like a massive family), the robot can be a bit more relaxed and focus on the big picture (High Temperature).
Why it helps: It automatically adjusts how hard the robot tries to separate similar things based on how many examples it has.

The Result: A Super-Student

The authors tested this new method on six different "exams" (datasets) involving:

Action Recognition: Recognizing what someone is doing (e.g., jumping, waving).
Gait Recognition: Identifying people by how they walk.
Person Re-Identification: Finding a specific person in a crowd based on their skeleton.

The Outcome: ACLNet beat all previous methods. It became much better at telling the difference between "Reading" and "Writing," or "Drinking" and "Eating," even when the data was messy or incomplete (like if a person's arm was hidden).

In a Nutshell

ACLNet is like a brilliant teacher who doesn't just tell a student "Right vs. Wrong." Instead, it:

Groups similar-looking actions into families to understand their relationships.
Creates a strict safety zone for tricky, weird examples so they don't cause confusion.
Adjusts its teaching style on the fly depending on how hard the lesson is.

This makes the AI much smarter at understanding the subtle nuances of human movement, which is crucial for security, healthcare, and helping robots interact with us naturally.

1. Problem Statement

Existing methods for skeleton-based human activity understanding often rely on contrastive learning to create discriminative feature spaces. However, current approaches suffer from two primary limitations:

Neglect of Structural Inter-class Similarities: Activities with similar motion patterns (e.g., "reading" vs. "writing") often share structural commonalities in their skeleton sequences (e.g., specific joint trajectories). Existing methods treat classes as independent global positives/negatives, failing to exploit these structural relationships, which leads to inefficient optimization in fine-grained scenarios.
Overlooked Anomalous Positive Samples: Intra-class variability (due to viewing angles or movement amplitude) creates "hard positive" samples that are easily confused with other classes. Current paradigms do not adequately address the noise introduced by these anomalous positives, leading to accumulated errors in the embedding space and degraded performance.

2. Methodology: ACLNet

The authors propose ACLNet, a network that integrates Affinity Contrastive Learning to refine feature discrimination at both inter-class and intra-class levels. The framework consists of three core components:

A. Inter-class Affinity Contrastive Learning

This module aims to capture semantic commonalities between related activities to form "Motion Families" (superclasses).

Affinity Similarity Metric: Instead of relying solely on direct pairwise confusion, the method calculates a similarity score combining:
1. Direct Pairwise Similarity: Based on the top- $K$ most confused classes in the confusion matrix.
2. Indirect Contextual Similarity: Based on the overlap of neighbors (classes that both $i$ and $j$ confuse with).
- Formula: $w_{ij} = \frac{I_{NK}(i,j)}{2} + \frac{M(i,j)}{\sum_p I_{NK}(i,p)}$ , where $M(i,j)$ counts common neighbors.
Motion Family Construction: Classes with high affinity similarity are grouped into a "Motion Family" ( $W(i)$ ).
Dynamic Temperature Schedule: A family-aware temperature $\tau_w$ is dynamically adjusted based on the size of the Motion Family ( $N_w$ ). Smaller families use lower temperatures to amplify differences for hard negatives, while larger families use higher temperatures to facilitate cluster-wise discrimination.
Loss Function: An inter-class affinity contrastive loss ( $L_{inter}$ ) is applied to refine representations within these superclasses.

B. Intra-class Marginal Contrastive Learning

This module addresses the issue of hard positive samples within a single class.

Marginal Strategy: It introduces a margin-based constraint to increase the distance between hard positives and their closest negatives.
Optimization: The method transforms the margin constraint into a smooth approximation using the LogSumExp (LSE) operator. It ensures that the distance between a positive sample and the closest negative sample exceeds a margin $\epsilon$ .
Loss Function: An intra-class marginal contrastive loss ( $L_{intra}$ ) is formulated to enforce this separation, encouraging "affinitive aggregation" of hard positives.

C. Overall Objective

The total loss function combines standard Cross-Entropy ( $L_{ce}$ ) with the two proposed contrastive losses:
$L = L_{ce} + \lambda_1 L_{inter} + \lambda_2 L_{intra}$

3. Key Contributions

ACLNet Framework: A novel network that enhances discriminative representations for skeleton-based activity understanding by explicitly modeling affinity relationships.
Inter-class Affinity Contrastive Method: Introduces an affinity metric to capture structural commonalities between related activities, grouping them into "Motion Families" for targeted refinement.
Intra-class Marginal Strategy: Proposes a margin-based contrastive strategy to explicitly control the separation between hard positive and negative samples, improving robustness to intra-class variations.
Dynamic Temperature Schedule: A mechanism to adaptively adjust penalty strength based on superclass size, optimizing the learning process for different cluster densities.

4. Experimental Results

The method was evaluated on six benchmark datasets across action recognition, gait recognition, and person re-identification.

NTU RGB+D 60: Achieved 93.6% (X-Sub) and 97.7% (X-View), outperforming state-of-the-art (SOTA) methods like BlockGCN and VA-AR.
NTU RGB+D 120: Achieved 90.7% (X-Sub) and 92.3% (X-Set), setting new SOTA records.
Kinetics-Skeleton: Achieved 52.1% Top-1 accuracy, surpassing previous GCN-based methods.
PKU-MMD & FineGYM: Demonstrated significant improvements, achieving 97.3% (X-Sub) on PKU-MMD and 96.0% on FineGYM, highlighting its ability to handle complex and fine-grained actions.
CASIA-B (Gait & Re-ID):
- Gait Recognition: Achieved 88.5% average Rank-1 accuracy, outperforming CycleGait and Gait-D.
- Person Re-ID: Achieved 82.8% Rank-1 accuracy in the N-N setting, showing strong potential for biometric identification.
Ablation Studies: Confirmed that both Inter-class and Intra-class components contribute positively, with the full model improving the baseline by 1.1% on NTU-60. The method also showed superior robustness against occluded skeleton data (e.g., missing arms or legs).

5. Significance

Fine-Grained Discrimination: By modeling the "affinity" between classes, ACLNet effectively resolves ambiguities in visually similar actions (e.g., distinguishing "reading" from "drinking" based on shared hand trajectories), which is crucial for fine-grained activity analysis.
Biometric Applications: The method's ability to handle subtle behavioral cues and intra-class variations makes it highly suitable for biometric tasks like gait recognition and person re-identification, where identity inference relies on consistent motion patterns.
Generalizability: The approach is not limited to action recognition but opens new avenues for behavioral biometrics in security, healthcare, and human-computer interaction by providing a robust framework for handling structural similarities and hard samples in skeleton data.