CERNet: Class-Embedding Predictive-Coding RNN for Unified Robot Motion, Recognition, and Confidence Estimation

Imagine a robot that doesn't just move its arm like a mindless machine, but actually understands what it's doing, learns from its mistakes in real-time, and even knows when it's unsure.

That's exactly what this paper introduces: CERNet.

Think of CERNet as a "super-brain" for a robot arm. Instead of having three separate brains—one for moving, one for recognizing, and one for guessing how sure it is—CERNet combines them all into one elegant, unified system.

Here is how it works, broken down with some everyday analogies:

1. The Core Idea: The "Class Embedding" (The Master Key)

Imagine you are learning to write the alphabet. You have a mental "key" for the letter "A," a different key for "B," and so on.

In CERNet: The robot has a special digital "key" (called a Class Embedding Vector) for every letter it knows.
How it works: When the robot wants to write a letter, it grabs the right key. This key acts like a guide, telling the robot's internal gears exactly how to move to draw that specific shape.
The Magic: If the robot is watching someone else write a letter, it tries to find the right key by itself. It keeps turning the key until the movement it "sees" matches the movement it "expects." Once the key fits perfectly, the robot knows, "Ah! That's an 'A'!"

2. The Two Modes: The "Painter" and the "Detective"

CERNet switches between two modes, much like a person who can both paint and critique art.

Mode A: The Painter (Generation)
The robot picks a key (e.g., the letter "G") and starts drawing. It doesn't just follow a pre-recorded video; it predicts where its hand should be next. If the hand drifts slightly, the robot corrects itself instantly to stay on the line.
Mode B: The Detective (Inference)
The robot watches a human (or another robot) draw a letter. It doesn't know what it is yet. It tries out different "keys" in its mind. As it watches the strokes, it refines its guess. Eventually, the "key" that fits the best reveals the letter's identity.

3. The Secret Sauce: Predictive Coding (The "Expectation vs. Reality" Loop)

This is the most important part. CERNet is built on a concept called Predictive Coding.

The Analogy: Imagine you are walking in the dark. You expect the floor to be flat. You take a step. If your foot hits the floor exactly where you expected, you feel confident and keep walking. But if your foot hits a bump you didn't expect, your brain screams, "Wait, something is wrong!"
In the Robot: The robot constantly predicts where its hand will be.
- If it's right: It moves smoothly.
- If it's wrong (a bump, a push, or a weird angle): The "error signal" spikes. The robot immediately updates its internal map to fix the mistake.
- The Result: If someone pushes the robot's arm off course while it's writing, the robot doesn't crash or stop. It feels the "error," adjusts its internal plan, and smoothly steers back to the correct path. It's like a tightrope walker who wobbles but instantly corrects their balance.

4. The "Gut Feeling" (Confidence Estimation)

Most robots are terrible at knowing when they are confused. They just guess. CERNet is different.

The Analogy: Think of a student taking a test. If they get an answer right, they feel calm. If they get it wrong, they feel a knot in their stomach.
In CERNet: The robot measures its own "knot in the stomach" using Prediction Error.
- Low Error: The robot's prediction matches reality perfectly. It says, "I am 100% sure this is a 'B'."
- High Error: The robot is struggling to match its prediction to what it sees. It says, "I'm not sure. Maybe it's a 'B', maybe it's an '8'."
Why this matters: This allows the robot to say, "I don't know," or ask for help, rather than confidently doing the wrong thing.

5. The Experiment: Teaching a Robot to Write

The researchers taught a real robot (named Reachy) to write all 26 letters of the alphabet by physically guiding its arm (a process called kinesthetic teaching).

The Test: They made the robot write letters while pushing its arm off course (perturbations).
The Result:
- Old-style robots (Single-layer): Got confused, wrote messy scribbles, and couldn't recover from pushes.
- CERNet (Multi-layer): Wrote clear, legible letters. When pushed, it wobbled but immediately corrected itself and finished the letter perfectly.
- Recognition: When watching someone else write, it correctly guessed the letter 68% of the time on the first try (and 81% if you count the top two guesses).
- Confidence: The letters it guessed correctly had much lower "error signals" (calm stomach) than the ones it guessed wrong (knot in stomach).

Summary

CERNet is a breakthrough because it unifies three things that usually require three different systems:

Doing (Moving the arm).
Understanding (Recognizing what is being done).
Self-Awareness (Knowing how confident it is).

It does this using a single, compact brain that learns from its own mistakes in real-time. This brings us one step closer to robots that can work safely alongside humans, understanding our intentions and knowing when they need to ask, "Did you mean for me to do that?"

Here is a detailed technical summary of the paper "CERNet: Class-Embedding Predictive-Coding RNN for Unified Robot Motion, Recognition, and Confidence Estimation."

1. Problem Statement

Robots operating in human-centric environments require three simultaneous capabilities:

Generation: Producing learned movements in real-time.
Recognition: Inferring human intent or task identity from observed behaviors.
Confidence Estimation: Assessing the reliability of their own inferences without external classifiers.

Existing approaches typically treat these functions separately (e.g., using distinct architectures for perception and control) or rely on complex multi-module systems. Furthermore, most predictive-coding (PC) models are validated only in simulation or lack the ability to estimate confidence intrinsically. There is a gap in unified, parameter-efficient models that can perform all three tasks on physical robotic hardware under real-world disturbances.

2. Methodology: CERNet Architecture

The authors propose CERNet (Class-Embedding Predictive-Coding Recurrent Neural Network), a unified hierarchical model based on the Predictive Coding framework.

Core Design Principles

Hierarchical Predictive Coding (PC-RNN): The model uses a multi-layer RNN structure where higher layers maintain abstract motion intentions over longer timescales. It operates by minimizing the discrepancy between top-down predictions and bottom-up sensory inputs.
Class Embedding Vector ( $C$ ): A unique feature of CERNet is a dynamically updated class embedding vector.
- In Generation Mode: $C$ is fixed to a specific class (one-hot), constraining the hidden state dynamics to a class-specific subspace to reproduce the trajectory.
- In Inference Mode: $C$ is initialized with noise and optimized online via gradient descent to minimize the accumulated prediction error over a sliding window of past observations. As the robot observes motion, $C$ drifts toward the latent subspace corresponding to the observed class, effectively performing recognition.
Intrinsic Confidence Estimation: The model does not use a separate classifier. Instead, it uses the magnitude of the internal reconstruction error (prediction error) as a proxy for confidence. Lower error implies higher confidence in the current inference.

Operational Phases

Training: The network learns parameters ( $\theta$ ) by minimizing prediction error loss (equivalent to variational free energy) on labeled trajectory data.
Generation: A target class is specified; the network generates motion in a closed loop, updating internal states to correct for errors but keeping weights fixed.
Inference: The network observes a trajectory segment, updates internal states and the class embedding $C$ online to minimize past reconstruction error, and infers the class identity.

3. Experimental Setup

Platform: Reachy 2021 humanoid robot (7-DoF left arm).
Task: Learning and reproducing 26 English alphabet writing trajectories taught via kinesthetic guidance.
Models Tested: Six variants were trained (3 single-layer, 3 hierarchical) with matched parameter counts (Mini, Standard, Large) to isolate the effect of hierarchy.
Evaluation Metrics:
- Generation: Dynamic Time Warping (DTW) score against ground truth.
- Robustness: Recovery from external physical perturbations.
- Recognition: Top-1 and Top-2 classification accuracy.
- Confidence: Correlation between reconstruction error and classification correctness.

4. Key Results

A. Generation Performance (Motion Reproduction)

Hierarchical Advantage: The hierarchical models significantly outperformed single-layer baselines. The best hierarchical model (MultiLarge) achieved a 76% lower trajectory reproduction error (DTW) compared to the parameter-matched single-layer baseline (SingleLarge).
Physical Robustness: While all models degraded when moving from simulation to the real robot, hierarchical models maintained legible character shapes. Single-layer models often produced illegible distortions despite similar average DTW scores, highlighting the superior perceptual fidelity of the hierarchical structure.

B. Perturbation Resistance

In experiments where external forces were applied to the robot's arm (drifting it off-course between timesteps 40–45), the hierarchical CERNet demonstrated self-correcting capabilities.
The model detected the deviation via a spike in prediction error, updated its internal states, and autonomously recovered to the intended trajectory once the disturbance ceased.

C. Recognition and Confidence Estimation

Real-Time Recognition: The model inferred the intended alphabet class from partial trajectory observations with 68% Top-1 accuracy and 81% Top-2 accuracy across 260 trials.
Intrinsic Confidence: A statistical analysis (Mann–Whitney U test) revealed a strong correlation between reconstruction error and accuracy:
- Top-1 Correct: Significantly lowest reconstruction error.
- Top-2 Correct: Intermediate error.
- Incorrect: Highest error.
- This confirms that the model's internal error signal naturally serves as a reliable confidence metric without external calibration.

5. Key Contributions

Unified Architecture: First demonstration of a single PC-RNN that simultaneously handles motor generation, real-time class recognition, and confidence estimation on a physical robot.
Class Embedding Mechanism: Introduction of an online-optimized class embedding vector that unifies the inference and generation processes within the same dynamical system.
Intrinsic Uncertainty: Proof that prediction error magnitude in a PC framework can serve as a direct, implicit confidence signal, eliminating the need for separate uncertainty estimation modules.
Physical Validation: Successful deployment on a humanoid robot (Reachy) showing robustness to real-world noise and external perturbations, validating the theory beyond simulation.

6. Significance

This work bridges the gap between theoretical predictive coding and practical robotics. By integrating generation, recognition, and self-evaluation into a compact, single-loop architecture, CERNet offers a scalable solution for intent-sensitive human-robot collaboration. It enables robots to not only perform tasks but also understand human intent in real-time and assess their own certainty, which is critical for safe and adaptive interaction in unstructured environments. The findings suggest that hierarchical predictive coding is a viable foundation for future embodied AI systems requiring robust motor memory and adaptive perception.