Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling

The Big Idea: The "Infinite Staircase" vs. The "Tall Tower"

Imagine you want to build a machine that can solve a very difficult puzzle.

The Old Way (Explicit Models):
Think of a standard AI model (like the ones in your phone or laptop) as a Tall Tower. To make the tower solve harder puzzles, you have to keep adding more floors (layers). If you want it to be super smart, you need a skyscraper.

The Problem: Building a skyscraper is expensive. It takes a lot of memory (bricks) and time to build. Once the tower is built, its height is fixed. If you want it to do something even harder later, you have to tear it down and build an even taller one.

The New Way (Implicit Models):
This paper introduces a different approach: The Infinite Staircase.
Instead of building a tall tower, you build a single, simple room with a staircase inside. You start at the bottom, take a step, look at the view, take another step, look again, and keep going until you reach the perfect spot (the "fixed point").

The Magic: You only built one room (one set of parameters), but you can climb as many steps as you want. The more steps you take (more "test-time compute"), the more complex the view becomes. You don't need to build a bigger room; you just need to walk further.

What Did the Authors Prove?

The authors asked two big questions:

Can this simple staircase do everything the tall tower can do? (Yes.)
Can the staircase do things the tower can't do without getting huge? (Yes!)

They proved mathematically that this "Infinite Staircase" (Implicit Model) can represent incredibly complex, jagged, and difficult functions (like a cliff with a sudden drop) using a very smooth, simple operator.

The Analogy of the "Smooth Painter":
Imagine you want to paint a picture of a jagged, lightning-bolt-shaped mountain.

The Explicit Tower: To paint the sharp, jagged edges, you need a massive, complex brush with thousands of tiny bristles (parameters).
The Implicit Staircase: You use a simple, smooth brush. But, you don't just swipe once. You swipe, then look at the result, adjust your hand slightly, and swipe again. You repeat this 100 times.
- Step 1: The brush makes a smooth curve.
- Step 10: The curve gets sharper.
- Step 100: The curve looks exactly like the jagged lightning bolt.

The paper proves that by repeating this simple action enough times, you can create any shape, even ones that are mathematically "impossible" for a single smooth stroke. The complexity comes from the repetition, not the size of the tool.

Why Does This Matter? (The "Test-Time Scaling" Secret)

In the old world, if you wanted a smarter AI, you had to train a bigger model (more parameters). This is like buying a bigger car to go faster.

In this new world, you can keep the model small (the same car) but drive it longer (more iterations).

Test-Time Scaling: This is the fancy term for "spending more time thinking at the moment of answering."
The Result: A small, cheap model can outperform a giant, expensive model if you let the small model "think" for a few more seconds (iterations).

Real-World Examples from the Paper

The authors tested this "Staircase" idea in four different fields to prove it works:

Image Restoration (Fixing Blurry Photos):
- The Task: Take a blurry photo and make it sharp.
- The Result: The implicit model started with a blurry guess. With every "step" (iteration), the image got sharper and sharper. Eventually, it produced a clearer image than a much larger, traditional model.
Scientific Computing (Fluid Dynamics):
- The Task: Predict how air or water flows around an object (like a plane wing).
- The Result: The model started with a rough guess of the wind. As it "walked" up the stairs, the wind patterns became more detailed and accurate, matching complex physics equations better than larger models.
Operations Research (Solving Math Puzzles):
- The Task: Solving complex logistics problems (like how to deliver packages to 1,000 stores efficiently).
- The Result: The model treated the problem as a graph. By iterating, it found better and better solutions, eventually beating larger models that were trained specifically for this.
LLM Reasoning (AI Chatbots):
- The Task: Answering tricky questions that require deep thinking (e.g., "What is the difference between 'charge' in physics vs. 'charge' in banking?").
- The Result: At first, the AI just repeated the question. But as it "thought" longer (more iterations), it realized the context shifted from physics to finance and gave the correct, nuanced answer. The "thinking" process allowed it to separate the meanings.

The "Secret Sauce": Why It Works

The paper explains that the "Simple Operator" (the single room) is designed to be stable and smooth. It doesn't try to be complex immediately.

The Trap: If you force the operator to be complex from the start, it becomes unstable and hard to train.
The Solution: Keep the operator simple. Let the iterations do the heavy lifting. The complexity "emerges" naturally as you keep walking up the stairs.

The Takeaway for Everyone

This paper tells us that we don't always need bigger AI models.
Sometimes, we just need to let the models think longer.

Old Mindset: "Make the model bigger to make it smarter."
New Mindset: "Keep the model small and efficient, but let it iterate (think) more times when it needs to solve a hard problem."

It's the difference between hiring a giant team of people to solve a problem instantly versus hiring one very smart person who takes their time to think it through step-by-step. The paper proves that the "one smart person taking their time" can often do a better job than the giant team, using fewer resources.

1. Problem Statement

Implicit models (also known as Deep Equilibrium Models or Fixed-Point Models) compute outputs by iterating a single parameter block $G$ until convergence to a fixed point $y^* = G(y^*, x)$ . While these models are known for their memory efficiency (constant memory regardless of depth) and ability to match or exceed the performance of larger explicit networks by increasing test-time compute (iterations), the theoretical mechanism behind this "scaling" phenomenon remains poorly understood.

The authors address two fundamental questions:

Q1 (Expressive Boundary): Do implicit models possess the same expressive power as explicit models? Specifically, can any target mapping $F$ be represented as the fixed point of an implicit operator $G$ ?
Q2 (Expressive Advantage): Can a relatively simple implicit operator $G$ (e.g., one with bounded derivatives) generate a complex fixed-point mapping $F$ (e.g., one with singularities or unbounded local slopes) through iteration alone?

2. Methodology and Theoretical Framework

The paper approaches these questions through a nonparametric, function-space analysis rather than relying on infinite-width limits or kernel methods.

A. Definitions and Assumptions

Target Class: The authors focus on locally Lipschitz mappings. These are functions that may have unbounded derivatives (singularities) in certain regions (e.g., $f(x) = 1/x$ near 0) but are well-behaved locally. This is a broader class than globally Lipschitz functions.
Regular Implicit Operator: An operator $G(y, x)$ $G (y, x)$ is defined as "regular" if:
1. It is globally Lipschitz with respect to the input $x$ (with a constant growing linearly with $\|y\|$ ).
2. It is a contraction with respect to the state $y$ (i.e., $\|G(y, x) - G(y', x)\| \leq \mu(x)\|y - y'\|$ with $\mu(x) < 1$ ).
- Intuition: The operator $G$ itself is "simple" and smooth, but the fixed point it converges to can be complex.

B. Main Theoretical Results

The paper provides a strict mathematical characterization of the expressive power of regular implicit models:

Sufficiency (Theorem 2.4): For any locally Lipschitz target map $F$ on a bounded domain, there exists a regular implicit operator $G$ such that the fixed-point iteration $y_t = G(y_{t-1}, x)$ converges to $F(x)$ .
- Mechanism: The proof constructs $G$ such that the step size (contraction modulus) adapts to the local steepness of $F$ . In regions where $F$ is steep (near singularities), the iteration slows down (smaller step size), allowing the simple operator to "trace" the complex function without $G$ itself becoming singular.
Necessity (Theorem 2.5): Conversely, for any regular operator $G$ , the resulting fixed-point map $y^*(x)$ is necessarily locally Lipschitz.
- Significance: This establishes a tight boundary. Implicit models can represent exactly the class of locally Lipschitz functions, but no more (e.g., they cannot represent discontinuous functions if $G$ is regular).
Emergent Expressive Power: The core insight is that the expressive power scales with test-time compute.
- At iteration $t=1$ , the output $y_1(x) = G(0, x)$ is a simple, globally Lipschitz map.
- As $t \to \infty$ , the iterates $y_t(x)$ progressively unlock the complexity of the target $F(x)$ . The effective Lipschitz constant of the iterate grows with $t$ , eventually matching the complexity of $F$ .
- Unlike explicit networks, which must increase model size (depth/width) to approximate complex functions, implicit models increase expressivity by increasing the number of iterations without adding parameters.

3. Key Contributions

Theoretical Characterization: The first rigorous proof that regular implicit operators can represent the entire class of locally Lipschitz functions, bridging the gap between simple update rules and complex fixed-point solutions.
Test-Time Scaling Mechanism: A formal explanation of why increasing test-time iterations improves performance: it dynamically increases the model's effective Lipschitz constant, allowing it to approximate increasingly complex mappings.
Rejection of Uniform Constraints: The authors argue that enforcing a global Lipschitz bound on the fixed-point map (a common practice for robustness) fundamentally limits the expressive power of implicit models. Instead, they advocate for adaptive contractivity.
Empirical Validation: Extensive experiments across four distinct domains demonstrating that as iterations increase, the empirical Lipschitz constant of the model grows while accuracy improves and stabilizes.

4. Experimental Results (Case Studies)

The theory was validated across four domains, showing that learned operators $G$ are regular (simple) while the fixed points become complex:

Case 1: Image Reconstruction (Inverse Problems)
- Task: Deblurring images.
- Result: As iterations $t$ increased from 1 to 100, the empirical Lipschitz constant of the iterate $y_t$ grew from ~0.14 to ~5.0, while PSNR improved and stabilized. Implicit models outperformed explicit baselines with the same parameter count and even larger explicit models.
Case 2: Scientific Computing (Navier-Stokes Equations)
- Task: Solving steady-state incompressible Navier-Stokes equations.
- Result: Using a Fourier Neural Operator (FNO) as the implicit operator, the Lipschitz constant grew from ~23 to ~367 over 50 iterations, while relative error dropped to ~0.078. The implicit model significantly outperformed a vanilla explicit FNO.
Case 3: Operations Research (Linear Programming)
- Task: Solving Linear Programming (LP) problems using Graph Neural Networks (GNNs).
- Result: Implicit GNNs matched or exceeded the performance of explicit GNNs with much larger embedding sizes. The Lipschitz constant of the solution map grew with iterations, confirming the theory that simple update rules can solve complex optimization landscapes.
Case 4: LLM Reasoning
- Task: Disambiguating semantic contexts (e.g., "charge" in physics vs. finance) using a looped transformer.
- Result: Early iterations produced shallow, repetitive outputs. As iterations increased, the model successfully distinguished subtle semantic differences, with the "Empirical Lipschitz" (measured via Levenshtein distance) increasing significantly, indicating the emergence of complex context-aware reasoning.

5. Significance and Implications

Efficiency vs. Capacity: The paper demonstrates that implicit models offer a unique trade-off: they achieve the expressivity of massive explicit networks using a fraction of the parameters and constant memory, simply by utilizing more compute at inference time.
Design Philosophy: It challenges the prevailing trend of enforcing strict global Lipschitz constraints in implicit models. The authors suggest that allowing the contraction modulus to vary adaptively (local Lipschitzness) is crucial for unlocking the full expressive potential of these architectures.
Practical Guidance: For practitioners, the results suggest that "running more iterations" is not just a convergence heuristic but a mechanism to access higher levels of function complexity. This supports the use of test-time compute scaling as a viable strategy for improving model performance without retraining or increasing model size.

In summary, the paper provides a foundational theoretical framework proving that implicit models are not just memory-efficient alternatives but are fundamentally capable of representing complex, singular functions through the dynamic scaling of test-time iterations.