When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

The Big Picture: A Noisy Classroom

Imagine you are teaching a class of students (the computer model) how to recognize different animals. You have a specific rule for how they should learn: "If two pictures are the same animal, they should be very close together. If they are different, they should be far apart."

In this paper, the researchers are studying a specific way of teaching this rule called Contrastive Forward-Forward (CFF). Instead of the teacher correcting the whole class at the end of the lesson (like standard AI training), the teacher checks each student's work immediately after they finish a single step.

The researchers discovered a hidden problem with how some teachers were applying the "closeness" rule. Sometimes, the rule made the students' learning wildly unpredictable, even though the average grade looked fine.

The Problem: The "Hard Ceiling" vs. The "Soft Nudge"

The researchers looked at two different ways to apply the rule that says "same animals must be close."

1. The "Hard Ceiling" Method (Clamping)

Imagine a teacher who says: "You must be within 1 meter of your partner. If you are already 0.8 meters away, I will force you to be exactly 1 meter away, no matter what."

The Analogy: This is like hitting a wall. Once you hit the limit (1 meter), the teacher stops caring how much you move. If you try to move closer, the teacher just says, "You're already at the limit," and ignores your effort.
The Result: In the computer world, this is called Clamping. When the computer tries to learn, it hits this "wall" and stops receiving useful feedback (gradients) for those specific pairs.

2. The "Soft Nudge" Method (Subtraction)

Imagine a teacher who says: "You must be within 1 meter. If you are 0.8 meters away, I will just mentally subtract 0.2 meters from your score, but I will still let you move freely."

The Analogy: This is a Soft Nudge. The teacher adjusts the score, but the student can still feel the feedback and keep learning. The teacher never stops listening.
The Result: This is the Subtraction method. The researchers proved mathematically that this method gives the exact same "average" result as the Hard Ceiling, but it never stops the learning signal.

The Discovery: Why the "Hard Ceiling" Causes Chaos

The researchers ran the experiment 7 times with different random starting points (seeds) on a dataset called CIFAR-10 (a set of 10,000 small pictures of cats, dogs, cars, etc.).

The Average Grade: Both methods got roughly the same average score (about 78.5%).
The Consistency: This is where it got weird.
- The Soft Nudge group was very consistent. Every student got a score between 78% and 79%.
- The Hard Ceiling group was all over the place. Some students got 76%, others got 81%. The scores were 6 times more spread out (variance) than the other group.

Why?
Because of the "Hard Ceiling." In the early stages of learning, the computer hits that wall so often (about 60% of the time) that it stops learning for those specific pairs. Since the "wall" hits different pairs depending on the random starting point, some students get lucky and learn well, while others get stuck and struggle. It's like rolling dice where one side is glued down; sometimes you get a good roll, sometimes a bad one, purely by chance.

The Twist: It Depends on the Class

The researchers then tested this on other "classes" (datasets) to see if the problem was universal.

The "Crowded Room" (CIFAR-100):
- This dataset has 100 types of animals instead of 10.
- Result: The "Hard Ceiling" method actually worked better (less variance).
- Why? Because there are so many types, students rarely see the same animal twice in a single lesson. They rarely hit the "wall." The problem only happens when there are many same-class pairs in one batch.
The "Easy Test" (SVHN & Fashion-MNIST):
- These datasets are very easy for the computer to solve (it gets 96%+ accuracy).
- Result: The "Hard Ceiling" method was fine.
- Why? When the task is so easy, the students figure it out so quickly that the "wall" doesn't matter. They all converge to the same perfect answer regardless of the noise.
The "Sweet Spot" (CIFAR-10):
- This dataset is "medium difficulty." It's hard enough that the students need to struggle, but easy enough that they have many same-class pairs in every lesson.
- Result: This is the only place where the "Hard Ceiling" caused chaos.

The Solution: A Simple Fix

The paper concludes with a very practical piece of advice for anyone building these AI models:

Check the "Wall Hit Rate."
Before you worry about your results, check how often your model hits that "Hard Ceiling" (saturation) in the very first layer of learning.

If it hits the wall often (like in CIFAR-10): Switch to the Soft Nudge (Subtraction) method. It costs nothing in terms of average performance but makes your results much more reliable and repeatable.
If it rarely hits the wall (like in CIFAR-100 or easy tasks): You don't need to change anything.

Summary in One Sentence

Using a "Hard Ceiling" to force AI models to learn similarities creates a lot of random noise and inconsistency in medium-difficulty tasks, but switching to a "Soft Nudge" method fixes this instability without hurting the final score.

1. Problem Statement

The paper investigates the seed-to-seed variance (reproducibility) in Contrastive Forward-Forward (CFF) learning, a layer-local training method for Vision Transformers (ViT) that replaces end-to-end backpropagation with local supervised contrastive objectives.

While prior work has focused on mean accuracy, the sources of instability in CFF are poorly understood. The authors focus on a specific implementation detail in the supervised contrastive loss: how the positive-pair margin is applied.

Standard Implementation: Uses saturating similarity clamping: $\min(s + m, 1)$ , where $s$ is cosine similarity and $m$ is the margin. This caps the similarity at 1.0.
The Issue: This clamping creates a "saturation" region where the gradient becomes zero (or truncated) for positive pairs exceeding the threshold. The authors hypothesize that this saturation-induced gradient truncation, which varies stochastically based on random seeds, leads to significantly higher training variance without affecting mean accuracy.

2. Methodology

A. Theoretical Formulation

The authors propose a gradient-neutral alternative to clamping to isolate the effect of saturation from the effect of the margin itself.

Clamping (Standard): Modifies similarity $s$ before the logit calculation: $\tilde{s} = \min(s + m, 1)$ .
Subtraction (Proposed): Computes log-probabilities from unmodified similarities and subtracts the margin after the log-probability calculation: $\log \tilde{p} = \log p - m$ .
Proof: The authors prove (Proposition 4.1) that under the "mean-over-positives" reduction, the subtraction form is gradient-neutral. It shifts the loss by a constant but leaves all gradients with respect to model parameters unchanged. This serves as a true baseline where the margin exists but saturation does not.

B. Experimental Setup

Datasets: CIFAR-10 (primary), CIFAR-100, SVHN, and Fashion-MNIST.
Architecture: Vision Transformer (ViT) with $d=128$ , 4 heads, 8 layers.
Design: A $2 \times 2$ factorial design varying:
1. Margin Type: Clamp vs. Subtract.
2. Numerical Stability Mode: Detach vs. Direct (to check if gradient flow through the max operator affects variance).
Seeds: 7 independent seeds per cell for CIFAR-10 (total 28 runs), with additional runs for dose-response and generalization studies.
Metrics:
- Primary: Test accuracy variance ( $\text{Var}$ ) and Variance Ratio ( $\text{VR} = \text{Var}_{\text{clamp}} / \text{Var}_{\text{sub}}$ ).
- Diagnostics: Clamp Activation Rate (CAR), layer-wise gradient norms, and dose-response probes (reducing margin size).

3. Key Results

A. CIFAR-10: The Variance Inflation Effect

On CIFAR-10, the clamping method produced 5.90× higher test-accuracy variance compared to the subtraction method ( $p=0.003$ ), despite no statistically significant difference in mean accuracy ($78.48% $vs.$ 78.51%$).

Mechanism: The high variance is driven by gradient truncation at early layers.
- Clamp Activation Rate (CAR): At Layer 0, 60.7% of positive pairs are saturated (clamped) under the standard margin.
- Gradient Norms: At Layer 0, gradient norms under clamping are 4.0× lower than under subtraction.
- Interpretation: Different random seeds result in different subsets of positive pairs saturating. This causes stochastic "zeroing" of gradients, leading to divergent optimization trajectories and higher variance in final accuracy.

B. Dose-Response Verification

Reducing the starting margin from $0.4 \to 0.1 $to$ 0.2 \to 0.1$ reduced the variance ratio from 5.90× to 2.98×. This confirms a dose-response relationship: less saturation leads to less variance inflation.

C. Cross-Dataset Generalization (The "When" and "Why")

The effect is not universal; it depends on two interacting factors: Positive-Pair Density and Task Difficulty.

Dataset	Classes	Pos. Pairs/Batch	L0 CAR	Accuracy	Variance Ratio (VR)	Outcome
CIFAR-10	10	~25,700	60.7%	~78%	5.90×	High Variance
CIFAR-100	100	~2,580	29.0%	~52%	0.39×	Low Variance (Inverted)
SVHN	10	~30,500	51.0%	~97%	0.25×	Low Variance (Inverted)
Fashion-MNIST	10	~25,700	N/A	~93%	0.08×	Low Variance (Inverted)

Factor 1: Positive-Pair Density (CIFAR-100): With 100 classes, the density of positive pairs per batch drops by 10×. This reduces the frequency of saturation (CAR drops to 29%), preventing the gradient truncation mechanism from activating.
Factor 2: Task Difficulty (SVHN/Fashion-MNIST): These datasets have high accuracy (>90%). Even with high CAR (SVHN has 51% at L0), the task is "solved" easily, so all seeds converge to similar optima regardless of gradient truncation.
SVHN Difficulty Sweep: By increasing augmentation difficulty on SVHN to lower accuracy from 97% to 25%, the VR flipped from 0.25× to 16.73×. This confirms that intermediate difficulty (where trajectories are sensitive but not trivial) combined with high CAR is the "perfect storm" for variance inflation.

4. Key Contributions

Theoretical Proof: Proved that post-log-probability subtraction is a gradient-neutral baseline, allowing the isolation of saturation effects from margin effects.
Empirical Discovery: Identified that a common implementation detail (similarity clamping) can inflate training variance by nearly 6× in CFF without impacting mean accuracy.
Mechanistic Explanation: Demonstrated that the variance stems from stochastic gradient truncation at early layers due to saturation, which is exacerbated by high positive-pair density.
Boundary Conditions: Mapped the conditions under which this effect occurs (Moderate Accuracy + High Pair Density) and when it does not (High Accuracy or Low Pair Density).
Practical Diagnostic: Proposed measuring Layer-0 Clamp Activation Rate (CAR) as a simple check. If CAR is low (<50%), clamping is likely safe; if high, switching to subtraction is recommended.

5. Significance and Implications

Reproducibility: For researchers using CFF on datasets like CIFAR-10, the choice of margin implementation significantly impacts the reliability of results. Using clamping requires many more seeds (e.g., $n \approx 11$ ) to achieve the same statistical power as the subtraction method ( $n \approx 2$ ).
Training Stability: The paper highlights a unique vulnerability of layer-local training (like Forward-Forward). Unlike end-to-end backpropagation, where downstream gradients can compensate for local truncation, CFF layers optimize independently. Therefore, saturation at Layer 0 cannot be corrected by deeper layers, making the system uniquely sensitive to this implementation detail.
Recommendation: Practitioners should switch to the gradient-neutral subtraction method in moderate-accuracy, high-density regimes to reduce noise at zero cost to mean performance. In high-accuracy or many-class regimes, the effect is negligible.

In summary, the paper provides a rigorous audit of a subtle implementation choice, revealing that margin clamping acts as a stochastic regularizer that increases variance in specific regimes, and offers a simple, theoretically grounded alternative to stabilize training.