Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

The Big Problem: The "One-Bit Wall"

Imagine you are trying to pack a massive library of books (a giant AI model) into a tiny suitcase (a compressed file). To save space, you decide to shrink the books.

In the world of AI, numbers (weights) have two main parts:

The Magnitude: How big the number is (e.g., 5.43 or 0.002).
The Sign: Whether the number is positive (+) or negative (-).

For years, researchers have been great at shrinking the magnitude. They can compress a number like 5.43 down to a tiny fraction of a bit. But they hit a wall with the sign.

The paper argues that the sign is like a stubborn, random coin flip. Even after the AI learns, the pattern of pluses and minuses looks completely random, like static on an old TV. Because it's random, you can't compress it. You have to store every single sign, which costs exactly one bit per number.

This creates a "One-Bit Wall." No matter how much you shrink the magnitudes, the signs take up so much space that you can't get the total storage below one bit per number.

The Discovery: The Signs Are "Locked In"

The authors discovered something surprising: The AI isn't actually changing the signs very much.

Think of the AI training process like a hiker walking through a foggy mountain range.

The Magnitudes are the hiker's speed and direction. They change constantly as the hiker navigates the terrain.
The Signs are the hiker's starting direction (North or South).

The paper found that once the hiker starts walking, they rarely turn around completely. If they started facing North, they mostly stay facing North. They might stumble a bit, but they don't flip 180 degrees.

The "Sign Lock-In" Theory:
The randomness we see in the final AI model isn't because the AI learned a complex, random pattern. It's because the AI inherited the random pattern from the very first moment it was created (initialization). The training process just "locks" those initial random signs in place.

The authors call this "Sign Lock-In." It's like a door that, once opened a certain way, gets jammed shut. It's very hard to force it open the other way.

Why Does This Happen? (The "Zero" Trap)

To flip a sign from Positive (+) to Negative (-), a number has to pass through Zero.

Imagine the number line as a tightrope.

Positive is on the right side.
Negative is on the left side.
Zero is the tiny, slippery pole in the middle.

For a sign to flip, the number has to walk all the way to the pole, touch it, and cross over to the other side. The paper shows that during training, numbers usually stay far away from that slippery pole. They get "locked" on their side of the tightrope. Occasionally, a number might stumble near the pole, but it rarely crosses over and stays there.

The Solution: Building a "No-Zero" Zone

Since the signs are stuck, the authors asked: Can we use this to our advantage?

If the signs are stuck, why not force them to be a pattern we can compress?

They proposed two simple tricks to make the signs even more stubborn:

The "Gap" Start (Gap Initialization):
Instead of starting the hiker right near the slippery pole (Zero), start them far away on the safe side of the mountain. Give them a big "gap" so they can't accidentally stumble into the zero zone. This prevents them from ever flipping in the first place.
The "Repellent" Force (Outer-Drift Regularizer):
Imagine putting a gentle wind blowing away from the pole. If a number starts drifting toward zero, this wind pushes it back to safety. This ensures that even if a number gets close to the edge, it gets pushed back before it can flip.

The Result: Breaking the Wall

By using these tricks, the researchers were able to:

Freeze the signs into a specific, predictable pattern (like a low-rank template).
Compress the signs almost to zero cost (because the computer can just "regenerate" the pattern from a tiny seed, rather than storing every single bit).
Focus all the space on compressing the magnitudes.

The Analogy:
Imagine you are packing a suitcase.

Before: You have to pack every single sock individually (the signs), taking up half the suitcase. The clothes (magnitudes) are already folded small.
After: You realize the socks are all the same color and pattern. You decide to just pack a tiny note that says "All socks are blue." You don't need to pack the socks themselves. Now you have huge space left for the clothes, and the whole suitcase is tiny.

Why This Matters

This paper solves a major bottleneck in making AI models smaller and faster. It proves that the "randomness" of AI signs is an illusion caused by the training process getting stuck. By understanding this "lock-in," we can design AI models that are incredibly small (sub-bit compression) without losing their smarts.

In short: The signs of AI numbers are lazy; they stay exactly where they started. If we nudge them to stay put, we can throw away the storage cost for them entirely.

1. Problem Statement

The paper addresses a critical bottleneck in sub-bit model compression, a field aiming to store neural network weights using less than one bit per parameter.

The "One-Bit Wall": While magnitude compression (quantization, pruning, low-rank factorization) has advanced significantly, the sign bit (the $+1$ or $-1$ state of a weight) remains a fixed-cost barrier. Once magnitudes are compressed to sub-bit levels, the sign bit alone consumes 1 bit per weight, preventing further storage reduction.
The Paradox: Empirical analysis reveals that learned sign matrices across various architectures (Transformers, CNNs, MLPs) appear spectrally random (indistinguishable from i.i.d. Rademacher noise) and resist low-rank compression. However, despite this apparent randomness, the signs are highly persistent: most weights retain their initialization signs throughout training, with flips occurring only rarely.
The Gap: Existing compression methods treat signs as either fixed or random noise, failing to exploit the underlying dynamical mechanism that causes signs to "lock in" to their initial values.

2. Methodology & Theoretical Framework

The authors propose a stochastic-process viewpoint to analyze Sign Dynamics under Stochastic Gradient Descent (SGD), moving beyond asymptotic linearization.

A. Sign Lock-In Theory

The core theoretical contribution is the Sign Lock-In Theory, which formalizes why signs persist.

Mechanism: A sign flip can only occur if the weight trajectory crosses zero. The authors define an outer region ( $|w| \ge \rho$ ) and a boundary neighborhood ( $|w| \le \epsilon$ ).
Stopping-Time Analysis: They model sign flips as "excursions" from the outer region to the boundary and back.
- Assumption 1 (Bounded Updates): SGD updates are bounded, preventing instantaneous jumps across the origin.
- Assumption 2 (Re-entry Control): Once a weight escapes the boundary neighborhood back to the outer region, the probability of it returning to the boundary is low.
Geometric Tail Law: Under these conditions, the number of effective outer-to-outer sign flips follows a geometric tail distribution. This means that while a weight might flip once, it is exponentially unlikely to flip repeatedly.
Implication: The apparent "randomness" of the final sign pattern is largely inherited from the random initialization, not generated by the optimization process. The optimization process rarely alters the sign structure.

B. Empirical Validation

Spectral Analysis: Using Singular Value Decomposition (SVD) and Kolmogorov-Smirnov tests, the authors show that trained sign matrices have spectral statistics nearly identical to i.i.d. Rademacher noise, confirming poor compressibility via standard low-rank methods.
Drift Tracking: Tracking sign mismatches during training reveals that the flip ratio remains low (typically $< 10\%$ ), confirming that most signs are "locked" to their initialization.
Scale Dependence: Experiments on models ranging from 30M to 12B parameters show that Sign Lock-In strengthens with model size. Larger models exhibit fewer initial boundary hits and even lower re-entry probabilities.

3. Key Contributions & Proposed Solutions

Based on the theory, the authors propose a Sign Lock-In Enhancement framework to actively control sign dynamics, making signs compressible.

A. Gap Initialization

To reduce the initial-hit factor ( $h_T$ , the probability of hitting the boundary early):

Weights are initialized with a gap threshold ( $a_{init}$ ), rejecting any initialization values near zero ( $|w| < a_{init}$ ).
This ensures weights start deep in the "outer region," reducing the likelihood of early boundary crossings.

B. Outer-Drift Regularization

To reduce the re-entry ratio ( $g_T$ , the probability of returning to the boundary):

A lightweight log-barrier regularizer is applied during early training.
This penalty increases as weights approach zero, creating an "outward drift" that pushes weights away from the sign boundary after they have moved into the outer region.

C. Compressible Sign Template

Instead of learning arbitrary signs, the authors propose initializing weights with a low-rank sign template ( $T = \text{sign}(GH^\top)$ ).
By combining Gap Initialization and Outer-Drift Regularization, the model preserves this structured template throughout training.
Result: The sign pattern becomes deterministic and regenerable from a seed, effectively reducing the storage cost of signs to ~0 bits (only the seed and low-rank factors need storage).

4. Results

Flip Rate Reduction: The proposed methods reduce the effective sign flip rate to approximately $10^{-3}$ (0.1%) with only a negligible increase in perplexity (~1 point).
Compression Performance:
- Sub-bit Regime: In the sub-bit regime (effective bits per weight < 1), the proposed method (Sign Template + Magnitude SVD) significantly outperforms baselines like OneBit, HashedNets, and standard pruning.
- Perplexity/Accuracy: On language modeling (TinyLlama, CharLM) and classification (DBPedia) tasks, the method maintains performance comparable to full-precision or standard quantization models while achieving storage costs well below 1 bit/weight.
- Magnitude Preservation: The magnitude matrices remain highly compressible (low-rank), while the sign matrices become structured and compressible due to the enforced template.

5. Significance

Breaking the One-Bit Wall: This work provides the first theoretical and practical pathway to compress signs below 1 bit per weight, effectively removing the "fixed cost" barrier in sub-bit compression.
New Paradigm for Compression: It shifts the focus from treating signs as random noise to treating them as persistent dynamical variables that can be controlled and structured.
Theoretical Insight: The "Sign Lock-In" theory offers a novel stopping-time analysis of SGD, explaining why signs persist and how to manipulate this persistence. It connects optimization dynamics with random matrix theory and metastability.
Practical Impact: The proposed "Zero-Cost Sign Template" method enables the deployment of extremely small models (sub-bit) on edge devices without significant accuracy degradation, which is crucial for the future of efficient AI.

In summary, the paper demonstrates that signs are not random noise but persistent artifacts of initialization. By formalizing this via "Sign Lock-In" theory and introducing gap initialization and drift regularization, the authors successfully transform signs from a compression bottleneck into a controllable, compressible component.