Original authors: Xinyang Liu, Xuanyu Liang, Shiqi Ding, Boyang Li, Zhiqiang Que, Jiayang Li, Guosheng Hu

Published 2026-06-03✓ Author reviewed ⓘ

📖 5 min read🧠 Deep dive

Original authors: Xinyang Liu, Xuanyu Liang, Shiqi Ding, Boyang Li, Zhiqiang Que, Jiayang Li, Guosheng Hu

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a team of workers to predict the future temperature in a room.

The Old Way (Backpropagation):
For decades, the standard method has been like a strict, top-down manager. The manager looks at the final prediction, sees it's wrong, and then walks all the way back through the entire team, telling every single worker exactly how they contributed to the mistake.

The Problem: This requires the manager to remember everything every worker did during the process (which takes up a lot of mental space/memory). Also, no one can fix their mistake until the manager finishes the whole walk-back. It's slow, memory-heavy, and biologically unrealistic (our brains don't work like this).

The Previous "New" Way (Forward-Forward):
A few years ago, a new method called "Forward-Forward" (FF) was invented. Instead of a manager walking backward, it uses a "local" approach. Each worker only looks at their immediate neighbor.

How it worked: It was great for Yes/No questions (Classification). The system would show a worker a "good" example (a real cat) and a "bad" example (a random dog). The worker learned to say, "I like the cat, I dislike the dog."
The Problem: This works perfectly for picking a cat or a dog, but it fails miserably at predicting numbers (Regression), like temperature. You can't easily say "This temperature is good, that one is bad" because temperature is a continuous scale. Is 20°C "bad" if the target is 21°C? What about 100°C? The old method didn't know how to handle the distance between numbers, only whether something was "right" or "wrong."

The New Solution: FFR (Forward-Forward for Regression)
This paper introduces FFR, a new system that finally teaches this "local worker" method to handle continuous numbers like temperature, speed, or price. Here is how they did it, using three clever tricks:

1. The "Tug-of-War" Instead of "Good vs. Bad"

Instead of showing a worker a "good" example and a "bad" example, FFR splits the workers into teams.

The Analogy: Imagine the target temperature is 20°C. The workers are divided into groups: Group A is responsible for 10–15°C, Group B for 15–20°C, Group C for 20–25°C, and so on.
The Trick: The system doesn't just say "Group B is right." It says, "Group B is the winner, but Group A and Group C are close runners-up, while Group Z (100°C) is a total loser."
Why it helps: This teaches the workers not just which group is right, but how close they are to the right answer. It understands that 19°C is "closer" to 20°C than 10°C is. This replaces the old "Good vs. Bad" game with a "Who is closest?" competition.

2. The "Stratified Ladder" (From Rough to Fine)

The paper builds a special ladder structure where the workers get more precise as they go up.

The Analogy:
- Bottom Rungs (Shallow Layers): These workers are like rough drafters. They just decide if the temperature is "Cold," "Warm," or "Hot." They make a big, coarse guess.
- Top Rungs (Deep Layers): These workers are like fine artists. They take the "Warm" guess from below and refine it to "20.5°C."
The Collaboration: The system doesn't just throw away the rough guesses. It keeps them all. At the very top, a "Head Coach" (a final layer) looks at the rough guesses from the bottom and the fine guesses from the top, mixes them together, and makes the final prediction. This ensures the system doesn't get stuck on a bad guess early on.

3. The "Free Lunch" (Uncertainty)

Usually, to know how confident a computer is in its answer, you have to run the simulation a thousand times and see how much the answers vary. This takes forever.

The FFR Trick: Because the system has workers at every level of the ladder (from rough to fine), it can just ask them all: "What do you think?"
The Result: If the "Rough" workers and the "Fine" workers all agree, the system is very confident. If they are arguing with each other, the system knows, "Hey, I'm not sure about this one."
The Benefit: The system gives you a prediction and a confidence score instantly, without any extra work. It's a "free lunch."

What Did They Prove?

The authors tested this on real-world problems like:

Predicting energy use in smart homes.
Predicting when machine tools will break in factories.
Predicting indoor location (GPS-free).
Predicting health metrics from wearables.
Judging image quality.

The Results:

Accuracy: FFR got about 98.6% of the accuracy of the old, heavy "Backpropagation" method.
Memory: It used only 27% of the memory at moderate depths and 8% at very deep levels. (Imagine carrying a backpack that stays the same size no matter how many books you add, while the old method's backpack grew infinitely heavy).
Speed: It trained about 28% faster per step because it didn't have to wait for the "backward walk."

In Summary:
FFR takes a method that was previously only good for simple "Yes/No" decisions and upgrades it to handle complex number predictions. It does this by turning the learning process into a "closest guess" competition, building a ladder of workers from rough to fine, and getting a confidence score for free. It proves that you can build smart, efficient AI without needing the heavy, memory-hungry "backward walking" that has dominated the field for decades.

Technical Summary: FFR (Forward-Forward for Regression)

1. Problem Statement

The Forward-Forward (FF) algorithm, proposed by Hinton et al., offers a biologically plausible and memory-efficient alternative to Backpropagation (BP) by training neural networks through purely local, layer-wise optimization using two forward passes (positive and negative data). However, FF is inherently designed for classification tasks, relying on contrastive pairs of "genuine" (positive) and "spurious" (negative) samples. Extending FF to real-world regression presents two fundamental challenges:

Absence of Natural Negatives: In continuous target spaces, there is no natural definition of a "negative" sample. Unlike classification, where a random incorrect label suffices, continuous values (e.g., $y+0.1$ vs. $y+100$ ) cannot be trivially categorized as equally incorrect, making the construction of contrastive pairs ambiguous.
Magnitude and Ordering Blindness: The standard FF "goodness" function ( $g = \|h\|^2$ ) measures activation magnitude for binary discrimination but carries no information about the target's magnitude or ordinal ordering. This makes it unsuitable for supervising real-valued predictions where the relative distance between values matters.

Existing attempts to bridge this gap have been limited: some cast regression as binary classification over tolerance bands (retaining high overhead and limited accuracy), while others replace the goodness function with directional derivatives (sacrificing accuracy for hardware implementability). None have demonstrated competitive performance on diverse real-world regression datasets compared to BP.

2. Methodology: FFR Framework

The authors propose FFR (Forward-Forward for Regression), a framework that extends FF to regression through three core innovations:

2.1 Ordinal Competitive Goodness Function

Instead of direct Mean Squared Error (MSE) regression or contrastive pairs, FFR treats each hidden layer as an ordinal classifier.

Discretization: The continuous target range $[y_{min}, y_{max}]$ is partitioned into $K_\ell$ ordered bins at layer $\ell$ .
Competitive Groups: The neurons in a layer are partitioned into disjoint groups $\{G_{\ell,1}, \dots, G_{\ell,K_\ell}\}$ , where each group corresponds to a specific bin.
Ordinal Supervision: Rather than using hard one-hot labels, FFR employs a distance-aware soft label. A Gaussian bump is centered on the true target $y$ and projected onto the bin midpoints. This creates a target distribution $q_{\ell,k}$ where nearby bins receive higher probability mass than distant ones.
Goodness Calculation: The "goodness" of a group is the mean squared activation of its neurons. This is normalized into a probability distribution $p_{\ell,k}$ . The layer loss is the cross-entropy between the soft label $q$ and the goodness distribution $p$ . This preserves local competition while encoding the ordinal structure of the target.

2.2 Stratified Ladder Architecture

To prevent "representation collapse" (where all layers learn identical coarse features) and enable fine-grained regression:

Stratified Granularity: The number of competitive groups $K_\ell$ doubles with each layer ( $K_\ell = 2^{d_0 + \ell - 1}$ ). Shallow layers learn coarse ordinal discrimination (wide bins), while deeper layers refine these into fine-grained partitions.
Group-wise Normalization: To prevent activation leakage between groups, normalization is applied within each group rather than across the whole layer.
Ladder Aggregation: The goodness values (scalars) from all intermediate layers are concatenated and fed into a terminal linear regression head. This allows for inter-layer collaboration without backpropagating gradients through the intermediate layers, preserving the local-update property of FF.

2.3 Hierarchical Prediction with Uncertainty Estimation

FFR leverages the multi-scale nature of the ladder architecture to provide robust predictions and uncertainty estimates "for free":

Ensemble Prediction: Each intermediate layer $\ell$ produces a continuous prediction $\mu_\ell$ based on its softmax distribution over bin midpoints. The final prediction $\hat{y}$ is a weighted ensemble of all layer outputs and the terminal head.
Uncertainty as a Free Lunch: Predictive uncertainty is calculated as the weighted dispersion of the layer-wise predictions around the ensemble mean. This provides a confidence metric without requiring Monte Carlo dropout or Bayesian approximations.

3. Key Contributions

First Real-World FF Regression Framework: FFR is the first framework to successfully extend Forward-Forward learning to real-world regression tasks, demonstrating competitive performance across diverse domains including smart-home IoT, industrial sensing, indoor localization, wearable health, and image quality assessment.
Three Technical Innovations:
- An ordinal competitive goodness function that replaces contrastive pairs with intra-layer competition under distance-aware ordinal supervision.
- A stratified ladder architecture that scales ordinal granularity with depth and aggregates multi-scale features.
- A hierarchical prediction mechanism that yields robust estimates and uncertainty quantification in a single forward pass.
Efficiency and Performance: FFR achieves on average 98.6% of the accuracy of a Backpropagation-trained equivalent (BP-UR) across five real-world benchmarks. Crucially, it reduces peak training memory to 27% of BP at depth 8 and 8% at depth 32, while maintaining per-iteration training time at approximately 72% of BP.

4. Experimental Results

The authors evaluated FFR on:

Synthetic Benchmarks: Sin-Cos, Exp-Trig-Poly, and multi-target variants (MT-A, MT-B).
Real-World Datasets: Appliances Energy, Machine Tool Wear, UJIIndoorLoc, BIDMC (wearable health), and KonIQ-10k (image quality).

Key Findings:

Accuracy: FFR outperformed all BP-free competitors (including FF-MSE, FF-CLF, FF-CAR, FF-Zero, PEPITA, and F3). On several real-world datasets (UJIIndoorLoc, BIDMC, Appliances), FFR even surpassed the standard BP baseline, suggesting the hierarchical ensemble adds complementary signal.
Memory Scaling: Unlike BP, where memory usage grows linearly with depth due to stored activations, FFR's memory usage remains nearly constant as depth increases because intermediate activations are discarded after the local update.
Uncertainty: Visualizations showed that the predictive uncertainty bands correctly widened for difficult or atypical samples, validating the utility of the "free-lunch" uncertainty estimation.

5. Significance and Claims

The paper claims that FFR demonstrates that carefully designed local learning can rival global optimization (BP) at a fraction of the training cost. By solving the fundamental mismatch between FF's contrastive nature and regression's continuous target space, FFR enables the deployment of biologically plausible, memory-efficient learning on resource-constrained hardware (e.g., IoT sensors, edge controllers, robotics) where BP is infeasible due to memory and update-locking constraints.

The authors acknowledge limitations, noting that current implementations use standard floating-point precision and have not yet been validated on low-bit accelerators or analog/physical computing hardware, leaving those as future work.

FFR: Forward-Forward Learning for Regression