Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

Imagine you have a brilliant, world-class chef who has spent 20 years mastering the art of cooking with vegetables. This chef knows exactly how to chop, sauté, and season carrots, broccoli, and spinach to make delicious meals. Their skills are so advanced that they can predict how any vegetable will taste just by looking at it.

Now, imagine you want this same chef to cook with spaghetti. But there's a problem: the chef has never seen spaghetti before. They only know how to handle vegetables. If you just hand them a bowl of noodles, they won't know what to do. They might try to chop the noodles like carrots, which would ruin the dish.

This is exactly the problem computer scientists faced with Skeleton Data (the digital lines and dots that represent human movement) and Vision Models (the "chefs" trained on images like photos and videos).

The Problem: Two Different Languages

Vision Models (The Chefs): These are super-smart AI systems trained on billions of photos. They are experts at recognizing patterns in 2D images (like a cat in a photo).
Skeleton Data (The Spaghetti): This is data that tracks how a person moves using just a few dots (joints) connected by lines. It's sparse, 3D, and looks nothing like a photograph.

Because the "language" of skeletons is so different from the "language" of photos, the smartest AI chefs couldn't learn from them. Scientists usually had to build new, clumsy kitchens (custom AI models) just for skeletons, which meant they couldn't use the powerful knowledge the chefs already had.

The Solution: "Skeleton-to-Image" (S2I)

The authors of this paper invented a clever translator called Skeleton-to-Image Encoding (S2I).

Think of S2I as a magic kitchen gadget that instantly turns a bowl of spaghetti into a plate of perfectly arranged vegetables.

Here is how the gadget works, step-by-step:

Sorting the Ingredients: First, the gadget looks at the human skeleton and sorts the joints into five logical groups, just like sorting vegetables:
- The Torso (The main stem)
- Left Arm & Right Arm (The left and right branches)
- Left Leg & Right Leg (The lower branches)
Arranging the Plate: Instead of leaving the joints scattered in 3D space, the gadget arranges them neatly on a 2D grid, like a chef arranging ingredients on a cutting board. It stacks the movement over time so that the "movie" of the person moving becomes a single, static picture.
Color Coding: It takes the X, Y, and Z coordinates of the joints and paints them into the Red, Green, and Blue channels of an image. Suddenly, the movement data looks like a colorful, abstract painting.
Serving the Dish: The result is a standard 224x224 pixel image.

The Result: The Chef Can Cook Again!

Now, when you hand this "skeleton-image" to the world-class Vision Chef (the pre-trained AI), they don't see spaghetti anymore. They see a familiar, structured image.

The Magic: Because the data now looks like an image, the AI can use its massive, pre-existing knowledge (learned from billions of photos) to understand human movement.
No New Kitchens Needed: You don't need to build a new model from scratch. You just feed the skeleton data through this "magic gadget" and let the powerful existing AI do the work.

Why This is a Big Deal

It's Universal: Imagine you have a dataset of people with 25 joints and another with 13 joints. Usually, you'd have to force them to match, losing information. With S2I, it doesn't matter. The gadget turns any skeleton format into the same type of image. It's like a universal adapter for electrical plugs.
It Learns Faster: Because the AI is already a master chef, it learns to recognize actions (like "jumping" or "waving") much faster and better than if it had to learn from scratch.
It Works Everywhere: The paper tested this on many different datasets, and the AI performed better than ever before, even when switching between different types of skeleton data.

In a Nutshell

The paper says: "Don't build a new brain for skeleton data. Instead, just translate the skeleton data into a language that the smartest brains we already have can understand."

By turning movement into pictures, they unlocked the power of the world's most advanced AI for the world of human motion analysis.

1. Problem Statement

The paper addresses three critical challenges in skeleton-based action recognition and representation learning:

Modality Gap: Large-scale vision-pretrained models (e.g., Vision Transformers, MAE, DiffMAE) have achieved state-of-the-art results in image and video domains. However, they cannot be directly applied to 3D skeleton data due to fundamental differences in data structure. Images are dense 2D tensors ($3 \times H \times W $), while skeletons are sparse 3D point clouds ($ T \times J \times 3$) representing joint coordinates over time.
Data Scarcity: Unlike the massive ImageNet dataset, large-scale annotated skeleton datasets are scarce. Training deep models from scratch on skeleton data often leads to overfitting and poor generalization.
Structural Heterogeneity: Existing skeleton methods are typically designed for specific, homogeneous datasets with fixed joint definitions (e.g., 25 joints for NTU, 20 for NW-UCLA). This makes them difficult to scale or transfer across datasets with different joint configurations (cross-format learning), often requiring manual downsampling or interpolation that loses information.

2. Methodology: Skeleton-to-Image Encoding (S2I)

The core contribution is Skeleton-to-Image Encoding (S2I), a novel representation pipeline that transforms sparse 3D skeleton sequences into dense, image-like data compatible with standard vision models.

A. Encoding Process

The S2I pipeline converts a skeleton sequence ( $T \times J \times 3$ ) into a standardized image ($224 \times 224 \times 3$) through the following steps:

Semantic Partitioning: The skeleton joints are partitioned into five semantic body parts based on kinematic structure: Spine, Left Arm, Right Arm, Left Leg, and Right Leg.
Reordering: Joints within each body part are reordered based on their physical position (e.g., top-down for limbs, spine sequence). This ensures spatial consistency regardless of the original dataset's joint indexing.
Channel Mapping: The 3D coordinates $(x, y, z)$ of the joints are directly mapped to the RGB channels of an image.
Temporal Stacking: The reordered joints across the temporal dimension ( $T$ frames) are stacked to form a spatio-temporal feature map.
Resizing: The resulting feature map is resized to the standard vision model input size ($224 \times 224$) using linear interpolation along both the temporal and joint dimensions.

B. Training Framework

Once encoded, the skeleton data is treated as an image, allowing the use of powerful vision-pretrained backbones:

Backbones: The authors utilize MAE (Masked Autoencoders) and DiffMAE (Diffusion-based MAE), both pretrained on ImageNet.
Self-Supervised Pretraining: The models undergo masked modeling pretraining on the S2I-encoded skeleton data.
- MAE: Reconstructs masked patches from visible context.
- DiffMAE: Uses a denoising diffusion process to reconstruct masked regions.
Masking Strategies: The paper investigates various masking strategies (Random, Block, Joint, Temporal). Random masking with a 75% ratio was found to be the most effective.
Downstream Tasks: The pretrained encoders are evaluated via Linear Probing (freezing the backbone) and Fine-tuning (updating the whole network) on action recognition tasks.

C. Multi-Modal Fusion

The method supports fusing different skeleton modalities (Joint, Bone, and Motion) by encoding them separately into S2I images and concatenating their features, further boosting performance.

3. Key Contributions

Novel Representation (S2I): A unified, format-agnostic representation that bridges the gap between sparse 3D skeletons and dense 2D images, enabling the direct transfer of visual domain knowledge to skeleton analysis.
Vision-Pretrained Models for Skeletons: The first work to successfully leverage large-scale image-pretrained models (MAE, DiffMAE) for self-supervised skeleton representation learning without designing skeleton-specific architectures.
Universal & Cross-Format Learning: S2I abstracts away dataset-specific joint definitions. This enables:
- Cross-Format Transfer: Training on one dataset (e.g., 25 joints) and fine-tuning on another (e.g., 13 joints) without manual joint alignment.
- Universal Pretraining: Jointly pretraining on multiple heterogeneous datasets to learn a robust, generalizable representation.

4. Experimental Results

The method was evaluated on five benchmark datasets: NTU-60, NTU-120, PKU-MMD, NW-UCLA, and Toyota Smarthome.

Self-Supervised Performance:
- On NTU-60 (Cross-Subject), S2I with DiffMAE achieved 83.1% (Linear Probe) and 91.0% (Fine-tuning), outperforming specialized skeleton self-supervised methods like SkeletonMAE and 3s-CrosSCLR.
- The 3-stream fusion (3s-S2I) combining Joint, Bone, and Motion modalities achieved 85.8% (Linear) and 93.1% (Fine-tuning), setting a new state-of-the-art (SOTA) for linear evaluation on NTU-60.
Cross-Format Transfer:
- When transferring from NTU-60 (25 joints) to Toyota (13 joints) and NW-UCLA (20 joints), S2I significantly outperformed existing methods that rely on joint downsampling. For example, on Toyota (CV1), 3s-S2I reached 53.8%, surpassing the previous best by a large margin.
Universal Pretraining:
- Pretraining on a combined dataset of NTU-120, PKU-MMD, Toyota, and NW-UCLA yielded consistent improvements across all target datasets compared to single-dataset pretraining, with notable gains on PKU-II (+5.3%) and Toyota (+3.5%).
Semi-Supervised Learning:
- With only 1% labeled data, S2I achieved 75.2% accuracy on NTU-60, demonstrating superior data efficiency compared to other semi-supervised approaches.

5. Significance and Impact

Paradigm Shift: The paper challenges the notion that skeleton data requires specialized architectures (like GCNs). It proves that with the right encoding, general-purpose vision models can outperform specialized skeleton models.
Scalability: By decoupling the model from specific joint definitions, S2I solves the scalability issue of skeleton datasets, allowing researchers to leverage diverse, heterogeneous data sources for a single universal model.
Resource Efficiency: It unlocks the potential of massive image pretraining resources (ImageNet) for the skeleton domain, reducing the reliance on expensive, large-scale skeleton annotations.
Future Directions: The framework opens the door for integrating skeletons into broader multi-modal vision-language models (VLMs) and other generative tasks, as the input format is now compatible with the vast ecosystem of vision foundation models.