The minimal width of universal $p$-adic ReLU neural networks

Imagine you are trying to build a machine that can learn to recognize patterns, like distinguishing a cat from a dog in a photo. In the world of standard computers, we usually teach these machines using Real Numbers (the familiar 1, 2, 3, 3.14, etc.). This paper asks a fascinating question: What if we built these machines using a completely different number system called "p-adic numbers"?

Here is a breakdown of the paper's ideas, translated into everyday language with some creative metaphors.

1. The Setting: A World of "Digital" Distances

In our normal world (Real numbers), distance is like a ruler. If you move a tiny bit, you are still close.
In the p-adic world (specifically $\mathbb{Q}_p$ ), distance works like a digital file system or a family tree.

Two numbers are "close" if they share a long history of common ancestors (digits).
If they differ even slightly in their "deep history," they are considered far apart, no matter how similar they look on the surface.
This world is totally disconnected. Imagine a forest where every tree is an island; there are no bridges between them. You can't walk smoothly from one tree to another; you have to jump.

2. The Tool: The "p-Adic ReLU"

Standard neural networks use an activation function called ReLU (Rectified Linear Unit). Think of it as a gatekeeper:

Standard ReLU: "If the number is positive, let it pass. If it's negative, stop it (make it zero)."
p-Adic ReLU (pReLU): The authors created a version for this digital world.
- The Rule: "If the number belongs to a specific 'safe zone' (called the integers of p-adic numbers, $\mathbb{Z}_p$ ), let it pass. If it's outside that zone, stop it."

3. The Big Question: How Wide Must the Machine Be?

In neural networks, width is like the number of workers in a factory assembly line.

Narrow factory: Few workers.
Wide factory: Many workers working in parallel.

The paper asks: What is the minimum number of workers (width) needed to build a machine that can learn any pattern in this p-adic world?

They found a precise formula:

Minimum Width = Max(Input Size + 1, Output Size)

If you are processing an image with 3 pixels (Input = 3) and want to output a 2-digit code (Output = 2), you need a factory width of 4 (because $3+1=4$ , which is bigger than 2).

4. Why is this different from the Real World?

In the real world, this problem is very tricky. Because real numbers are connected (like a smooth road), there are "topological traps." Sometimes, a narrow factory just physically cannot twist and turn enough to draw a complex shape without getting stuck.

In the p-adic world, there are no traps.
Because the space is "totally disconnected" (like a forest of islands), the machine doesn't need to draw smooth curves. It just needs to jump from island to island.

The Analogy: Imagine you are trying to sort mail. In the real world, you might need a complex conveyor belt to sort letters that are slightly different. In the p-adic world, every letter is already in a separate, distinct bin. You just need to drop the right letter into the right bin. The "jumps" are easy.

5. The Two-Step Strategy (The "Encoder" and "Decoder")

The authors proved that if you have enough width, you can build a universal machine. They did this by showing how to build two special tools:

The Encoder (The "Zipper"):
- Imagine you have a complex 3D object (your input data). You want to flatten it into a single line of numbers without losing information.
- The authors built a "p-Adic Zipper" (an encoding function) that takes your multi-dimensional data and compresses it into a single number, preserving all the details. This requires Input Size + 1 width.
The Decoder (The "Unzipper"):
- Once the data is compressed, the machine does its math.
- Then, you need to expand that single number back into the original shape (the output).
- The authors built a "p-Adic Unzipper" (a decoding function) that takes that single number and expands it back into the correct output dimensions. This requires Output Size width.

By combining these two tools, they showed that as long as your factory is wide enough to handle the "Zipper" and the "Unzipper," you can approximate any function you want.

6. The "Juggling" Trick

One of the most clever parts of the paper involves a concept they call a "Juggling Function."

Imagine a juggler who has $p^m$ balls. They need to make sure that no matter which "bucket" (coset) you throw a ball into, the juggler can catch it and throw it into every possible bucket eventually.
The authors proved that a neural network with just 2 workers (width 2) can act as this perfect juggler. This allows the network to shuffle data around perfectly to hit every target value.

Summary: The Takeaway

This paper is a mathematical tour de force that says:

"If you switch from the smooth, connected world of Real numbers to the 'digital,' disconnected world of p-adic numbers, neural networks become surprisingly simple and efficient. You don't need complex, wide networks to solve hard problems. You just need a specific, minimal width to act as a perfect 'Zipper' and 'Unzipper' for your data."

It suggests that for certain types of classification problems (like sorting distinct categories), p-adic neural networks might be a more natural and efficient fit than the traditional ones we use today.

1. Problem Statement

The paper addresses the problem of determining the minimal width required for a neural network to possess the Universal Approximation Property (UAP) within the context of p-adic analysis.

Context: While UAP and minimal width for real-valued neural networks (using ReLU activations) are well-studied, this work extends the theory to the field of p-adic numbers ( $\mathbb{Q}_p$ ).
Specifics:
- Domain: Continuous functions $f: \mathbb{Z}_p^{d_x} \to \mathbb{Q}_p^{d_y}$ (or more generally on compact open subsets).
- Activation Function: A natural p-adic analogue of ReLU, defined as:
  $\text{pReLU}(x) = \begin{cases} x & \text{if } x \in \mathbb{Z}_p \\ 0 & \text{otherwise} \end{cases}$
- Weights: The paper explicitly allows weights in $\mathbb{Q}_p$ (not just $\mathbb{Z}_p$ ), as restricting weights to $\mathbb{Z}_p$ renders the network incapable of universal approximation (it would only compute affine maps).
- Metrics: Approximation is measured in $L_q$ norms ( $1 \le q \le \infty$ ) and the $C^1$ norm (defined as the $L_\infty$ norm in this context).

2. Methodology

The authors employ a combination of p-adic topology, algebraic geometry over local fields, and constructive neural network design.

A. Topological and Algebraic Foundations

Total Disconnectedness: Unlike $\mathbb{R}$ , the topology of $\mathbb{Q}_p$ is totally disconnected. This eliminates many topological obstructions found in real-valued approximation theory.
Locally Constant Functions: The authors leverage the fact that continuous functions on the compact space $\mathbb{Z}_p^n$ can be uniformly approximated by locally constant functions. These functions are constant on cosets of $p^m \mathbb{Z}_p^n$ for sufficiently large $m$ .
Convexity in $\mathbb{Q}_p$ : The paper redefines convexity for $\mathbb{Q}_p$ vector spaces as "cosets of $\mathbb{Z}_p$ -submodules." This algebraic definition is crucial for analyzing the image of neural networks.

B. Lower Bound Strategy (Necessity)

To prove that width $w$ must be at least $\max(d_x + 1, d_y)$ , the authors establish obstructions:

Output Dimension ( $w \ge d_y$ ): If $w < d_y$ , the image of the network lies in a proper affine subspace of $\mathbb{Q}_p^{d_y}$ . Since the target function can map to the entire space, the network cannot approximate it.
Input Dimension ( $w \ge d_x + 1$ ): This is the novel contribution. The authors prove a structural lemma (Theorem 2.13):
- Lemma: For a pReLU-network of width $n$ and input dimension $n$ , either the function is affine on $\mathbb{Z}_p^n$ , or there exists a ball of radius $1/p$ where the function is constant in some direction.
- Contradiction: They construct a homeomorphism $h: \mathbb{Z}_p^{d_x} \to \mathbb{Z}_p$ (which exists due to the topological properties of p-adic spaces) and compose it with a non-linear map (e.g., $x \mapsto x^2$ ). This target function cannot be constant in any direction on a ball, nor can it be affine. Thus, a network with width $w \le d_x$ fails to approximate it.

C. Upper Bound Strategy (Sufficiency)

To prove that width $w = \max(d_x + 1, d_y)$ is sufficient, the authors construct explicit networks in two stages:

Encoding (Input Reduction):
- Construct a network of width $d_x + 1$ that maps $\mathbb{Z}_p^{d_x}$ to $\mathbb{Z}_p$ .
- This "encoding function" is constant on cosets of $p^m \mathbb{Z}_p^{d_x}$ but maps distinct cosets to distinct values in $\mathbb{Z}_p$ .
- Mechanism: Uses Lemma 3.5/3.6 to interpolate values on finite sets and Lemma 3.8 to handle coset selection.
Interpolation:
- Once the input is encoded into a single p-adic variable, the problem reduces to interpolating a function on a finite subset of $\mathbb{Z}_p$ .
- Theorem 3.4 shows that a network of width $d_x + 1$ can compute any locally constant function by combining the encoder with a width-2 interpolator.
Decoding (Output Expansion):
- Construct a network of width $d_y$ that maps $\mathbb{Z}_p$ back to $\mathbb{Z}_p^{d_y}$ .
- This "decoding function" (or "juggling function") ensures that for any target vector, there is an input that maps to a value within the correct $p^m$ -coset.
- Mechanism: Uses iterative composition of width-2 networks (Lemma 3.16) to "juggle" values across cosets.

3. Key Results

Theorem 1.2 (Main Result):
For every $q \in [1, \infty]$ , pReLU-networks of width $w$ have the universal approximation property for continuous functions $f: \mathbb{Z}_p^{d_x} \to \mathbb{Q}_p^{d_y}$ in the $L_q$ norm if and only if:
$w \ge \max(d_x + 1, d_y)$

Key Distinctions from Real-Valued Networks:

No Gap in Bounds: In real-valued ReLU networks, there is often a discrepancy between the minimal width required for $L_q$ approximation and $C^1$ (smooth) approximation due to topological obstructions (e.g., the "bottleneck" of the ReLU function). In the p-adic case, the bound is identical for all norms ( $L_q$ and $C^1$ ).
Reason: The total disconnectedness of $\mathbb{Q}_p$ allows functions to be approximated by locally constant functions, bypassing the need for smooth transitions that real networks struggle with in low widths.

Theorem 1.5 (Intermediate Result):
Any locally constant function $f: \mathbb{Z}_p^{d_x} \to \mathbb{Q}_p$ can be computed exactly by a pReLU-network of width $d_x + 1$ .

4. Significance and Contributions

Foundational Theory for p-adic AI: The paper provides the first rigorous characterization of the expressive power of p-adic neural networks, establishing a direct analogue to the real-valued case but with distinct structural properties.
Optimal Width Determination: It resolves the minimal width problem for p-adic ReLU networks, showing that the cost of universality is exactly $d_x + 1$ (for input handling) or $d_y$ (for output handling), whichever is larger.
Algebraic vs. Topological Obstructions: The work highlights a fundamental difference between real and p-adic approximation. In $\mathbb{R}$ , minimal width is often limited by topological connectivity issues. In $\mathbb{Q}_p$ , the limitation is purely algebraic (related to the dimension of affine subspaces and the inability of low-width networks to break symmetry in specific directions).
Practical Implications for Classification: The authors argue that p-adic networks may be more suitable for classification tasks (binary outputs) than real networks because the totally disconnected nature of $\mathbb{Q}_p$ naturally aligns with discrete decision boundaries, potentially offering more efficient architectures for such problems.
Methodological Innovation: The introduction of "encoding" and "decoding" functions using p-adic coset arithmetic provides a new toolkit for constructing deep learning architectures over non-Archimedean fields.

In summary, the paper demonstrates that while p-adic neural networks share the universal approximation capability of their real counterparts, their minimal width requirements are governed by the unique algebraic and topological structure of the p-adic numbers, resulting in a unified bound for all approximation norms.

The minimal width of universal ppp-adic ReLU neural networks

1. The Setting: A World of "Digital" Distances

2. The Tool: The "p-Adic ReLU"

3. The Big Question: How Wide Must the Machine Be?

4. Why is this different from the Real World?

5. The Two-Step Strategy (The "Encoder" and "Decoder")

6. The "Juggling" Trick

Summary: The Takeaway

1. Problem Statement

2. Methodology

A. Topological and Algebraic Foundations

B. Lower Bound Strategy (Necessity)

C. Upper Bound Strategy (Sufficiency)

3. Key Results

4. Significance and Contributions

More like this

Entropy After for reasoning model early exiting

Alternatives to the Laplacian for Scalable Spectral Clustering with Group Fairness Constraints

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

The minimal width of universal $p$ -adic ReLU neural networks