Scriboora: Rethinking Human Pose Forecasting

Imagine you are watching a dance performance. You see the dancers moving for the last few seconds, and your brain naturally starts to guess what they will do next. Will they spin? Will they jump? Will they stop?

Human Pose Forecasting is basically teaching a computer to do exactly that: watch a person's past movements and predict their future ones. This is super useful for self-driving cars (to guess where a pedestrian will step), robots working alongside humans, or even creating realistic animations for movies.

However, the paper "Scriboora" argues that the current way scientists are testing these computer brains is broken, messy, and doesn't reflect real life. Here is the story of how they fixed it, using some creative analogies.

1. The "Messy Kitchen" Problem

The authors started by looking at all the different recipes (algorithms) other scientists had published to predict human movement. They found a huge problem: nobody was cooking in the same kitchen.

Some chefs used different ingredients (data processing).
Some used different measuring cups (evaluation metrics).
Some even forgot to write down their recipes (code wasn't open-source).

Because of this, comparing who was the "best chef" was impossible. One paper might claim their robot is 10% better, but that's only because they used a different ruler. The authors cleaned up the kitchen, set up one standard recipe, and re-tested everyone. They found that many "winners" from previous studies actually lost when judged fairly.

2. The "Speech-to-Text" Surprise

Here is the paper's coolest idea. The authors asked: "What if we stopped trying to build a special robot just for dancing, and instead used a robot designed for talking?"

Think of it like this:

Predicting movement is like listening to a sentence and guessing the next word.
Predicting speech is like listening to a sentence and guessing the next word.

Both are just sequences of numbers changing over time. The authors took a state-of-the-art Speech Model (a brain trained to turn audio into text) and taught it to look at human joints instead of sound waves.

The Result: It worked amazingly well! Just as a speech model understands the rhythm of a sentence, this model understood the rhythm of a walk or a wave. They created a new model called MotionConformer, which became the new champion, beating all the specialized dance robots while running super fast (real-time).

3. The "Blurry Glasses" Reality Check

Most previous studies tested these models using perfect, crystal-clear data (like a high-definition video where every joint is marked perfectly). But in the real world, cameras aren't perfect.

Imagine trying to predict a dancer's next move while wearing foggy, blurry glasses.

Old studies: Tested the models with perfect vision.
This study: Put "foggy glasses" on the models. They fed the computer data coming from a standard camera that guesses where joints are (which is often slightly wrong).

The Shock: When the models wore these "foggy glasses," their performance crashed. They got confused and made bad predictions.

The Fix: The authors found a way to train the models to wear these glasses during practice. They taught the models to expect blurry data. Once trained this way, the models could handle the real-world messiness much better. It's like training a driver on a rainy day so they don't panic when it actually rains.

4. New Rules for the Race

The paper also introduced two new ways to judge the models, because speed matters in the real world:

FADE (Forecast After Delay Error): Imagine you are driving a car. If your computer takes 1 second to calculate where the pedestrian will be, that pedestrian has already moved 1 second further. This metric checks: "Did the model account for the time it took to think?"
FCE (Fast Change Error): What if the pedestrian suddenly stops and turns around? A slow model will keep predicting they are walking straight, causing a crash. This metric checks how quickly the model can react to sudden changes in direction.

The Big Takeaway

The paper concludes that to build truly useful robots and self-driving cars, we need to stop pretending the world is perfect.

Stop reinventing the wheel: Sometimes, borrowing a brain from a different field (like speech recognition) works better than building a new one from scratch.
Test in the mud: Don't just test your robot on a clean, perfect track. Test it in the mud, with foggy glasses and bad data.
Keep it simple: The best solution was a "Speech Model" that was adapted, not a complex, custom-built dance machine.

By fixing the testing rules and using these "speech brains," the authors have given us a much more reliable and realistic way to teach computers how to predict human movement.

1. Problem Statement

Human pose forecasting aims to predict future human joint positions based on past observations. While critical for applications like autonomous driving, human-robot interaction, and action recognition, the field suffers from three major issues:

Reproducibility Crisis: Existing literature often uses heterogeneous preprocessing, inconsistent evaluation metrics, and partially released code, making cross-paper comparisons unreliable.
Unrealistic Evaluation: Most models are trained and tested on clean, ground-truth motion capture data. In real-world deployments, inputs come from marker-less pose estimators (e.g., camera-based detectors), which introduce specific types of structural noise not captured by synthetic Gaussian noise or random joint dropping.
Limited Baselines: Research often focuses on specialized architectures (e.g., Graph Convolutional Networks) while ignoring whether mature models from neighboring domains (like speech processing) could serve as stronger baselines.

2. Methodology

The authors propose a unified framework to address these issues, introducing a new pipeline for training, evaluation, and noise simulation.

A. Unified Training and Evaluation Pipeline

Absolute vs. Relative: The paper focuses on absolute pose forecasting (predicting global trajectories), converting existing relative forecasting models (which predict motion relative to a fixed hip) to absolute coordinates by adjusting the input/output pipeline.
Standardization: All models are evaluated on a unified protocol using the Human3.6m and CMU-MoCap datasets with consistent input/output timesteps (50 input frames, 25 output frames for 1s prediction).
New Metrics:
- FADE (Forecast After Delay Error): Accounts for inference latency. If a model takes time $t$ to compute, the error is projected forward, assuming linear error growth.
- FCE (Fast Change Error): Estimates the maximum distance a person can travel before a new forecast is available, crucial for systems with reaction time constraints.

B. Cross-Domain Adaptation (Scriboora)

The core innovation is adapting Speech-to-Text (STT) models to pose forecasting. The authors draw a high-level analogy: both tasks involve transforming a sequence of input numbers (audio features or joint coordinates) into a sequence of output numbers (text or future poses).

Models Tested: DeepSpeech, QuartzNet, Conformer, and Squeezeformer.
Adaptations:
- Replaced audio feature extraction with motion data loading.
- Adjusted hidden dimensions and filter shapes.
- Removed BatchNorm layers (which hindered training).
- Modified subsampling layers in Conformer/Squeezeformer to match the shorter motion sequence length (2x downsampling instead of 4x).
MotionConformer: An optimized version of the Conformer where time reduction is moved to the end of the network to preserve temporal information longer, and layer dimensions were increased.

C. Realistic Noise Evaluation

To simulate real-world deployment, the authors generated a new dataset variation of Human3.6m:

Source: Used RapidPoseTriangulation (a state-of-the-art multi-view pose estimator) to generate noisy joint coordinates from the original video frames.
Comparison: Models were tested in three scenarios:
1. Zero-shot: Pre-trained on clean data, tested on noisy inputs.
2. Noise Pre-training: Fine-tuned with artificial Gaussian noise to simulate robustness.
3. Unsupervised Fine-tuning: Fine-tuned directly on the noisy predictions (without ground truth labels) collected from the estimator.

3. Key Contributions

Reproducibility Audit: A comprehensive re-evaluation of state-of-the-art (SOTA) models under a unified protocol, revealing that many previous claims of improvement vanish when protocols are standardized.
Scriboora (Speech-to-Pose): The successful adaptation of speech recognition architectures (specifically Conformer) to pose forecasting, achieving SOTA performance with real-time throughput.
Realistic Noise Benchmark: The first evaluation of pose forecasting using noise derived from actual pose estimation networks rather than synthetic noise, highlighting the significant performance gap between lab conditions and real-world deployment.
Performance Recovery Strategy: Demonstration that unsupervised fine-tuning on noisy input data (collected from the running system) can recover most of the performance lost due to estimator errors.
Open Source: Release of code, preprocessed datasets, and trained models to facilitate future research.

4. Key Results

Performance: The adapted MotionConformer achieved the best results across all datasets (Human3.6m and CMU-MoCap), outperforming specialized graph-based models (e.g., EqMotion, JRTransformer) and other speech-derived models.
- On Human3.6m (1s prediction), MotionConformer achieved an MPJPE of 143mm (vs. 147mm for standard Conformer and ~150-170mm for other SOTA).
- It maintained high inference speeds (~929 FPS), making it suitable for real-time applications.
Generalization: Speech models showed excellent generalization, transferring well from single-person to multi-person scenarios (tested on CHi3D dataset) with minimal architectural changes.
Noise Impact:
- Using noisy inputs from pose estimators caused a massive performance drop (e.g., MPJPE increased from ~150mm to ~228mm for MotionConformer).
- Artificial Noise Pre-training helped slightly but did not fully bridge the gap.
- Unsupervised Fine-tuning on the noisy data was highly effective, reducing the error significantly (e.g., from 228mm back to ~183mm MPJPE on ground truth targets), proving that models can adapt to specific estimator characteristics without ground-truth labels.

5. Significance

This paper fundamentally shifts the paradigm of human pose forecasting research:

From Specialized to Generic: It challenges the notion that pose forecasting requires highly specialized graph architectures, showing that generic sequence-to-sequence models from speech processing are superior and more efficient.
From Ideal to Realistic: It emphasizes that evaluating models only on clean data is insufficient. The proposed "noisy" evaluation protocol provides a more accurate measure of a model's readiness for deployment in autonomous vehicles or robotics.
Practical Deployment: The demonstration of unsupervised fine-tuning offers a practical solution for industry: systems can be deployed with pre-trained models and then self-correct using the noisy data they collect in the field, ensuring robustness over time.

In conclusion, Scriboora establishes a new standard for reproducibility, introduces a powerful new architecture class for the task, and provides a roadmap for deploying robust pose forecasting systems in noisy, real-world environments.