$p$-adic Linear Regression for Random Sampling with Digitwise Noise

The Big Picture: A New Way to Find Patterns in "Strange" Numbers

Imagine you are a detective trying to solve a mystery. You have a list of clues (data points), and you suspect they follow a simple rule (a straight line). In the real world, we use Linear Regression to find that line. We draw a line that comes closest to all the dots, minimizing the "distance" between the line and the dots.

But this paper isn't about the real world. It's about p-adic numbers.

What are p-adic numbers?
Think of them as numbers written in reverse.

Real numbers: We read from left to right (e.g., $123.45$). The most important part is the big number on the left.
p-adic numbers: We read from right to left, infinitely. The most important part is the last digit (the "ones" place). The further left you go, the less significant the digit becomes.

The Problem:
In the real world, if you have a few "bad" data points (noise), you can smooth them out. But in the p-adic world, the usual math tools (like squaring errors and adding them up) break down. It's like trying to measure the height of a mountain by counting how many grains of sand are in a bucket; the math just doesn't add up the same way.

This paper proposes a new, clever way to find the "line" (the rule) even when the data is messy and written in this strange p-adic language.

The Core Idea: "Peeling the Onion"

The author's solution is based on a simple, step-by-step strategy: Don't try to solve the whole infinite number at once. Solve it one digit at a time, starting from the end.

Imagine you are trying to guess a secret 10-digit combination lock code, but you can only see the last digit at first, then the second-to-last, and so on.

Step 1: The "Modulo p" Detective (Finding the Last Digit)

First, the algorithm ignores everything except the very last digit of every number.

The Analogy: Imagine you have a huge pile of mixed-up puzzle pieces. You decide to only look at pieces that are Red. You ignore the Blue, Green, and Yellow pieces for a moment.
The Math: The algorithm looks at the data "modulo $p$ ." This means it only cares about the remainder when you divide by a prime number $p$ (like 7). It effectively strips away all the "higher" digits and leaves just the last one.
The Noise: Some of your data points are "noisy" (wrong). The algorithm uses a probabilistic trick: it keeps picking random groups of data points and asking, "Do these points fit a straight line?" If a group fits perfectly, it's likely a "clean" group. If it doesn't, it's probably mixed with noise.
The Result: It finds the correct last digit of the secret code (the slope of the line).

Step 2: The "Peeling" Process (Finding the Next Digits)

Once the algorithm knows the last digit, it doesn't stop. It uses that knowledge to "peel" the onion and look at the next digit.

The Analogy: You now know the last digit of the combination is 7. You write that down. Now, you take the original numbers, subtract that 7, and divide by 10 (or $p$ ). This shifts the numbers to the right, bringing the second-to-last digit into the "ones" place.
The Math: The algorithm calculates the "residual" (the error) of the first guess. It divides this error by $p$ . This creates a new set of data where the "new" last digit is actually the "old" second-to-last digit.
The Loop: It runs the same "Last Digit Detective" algorithm again on this new, shifted data. It finds the next digit.
Repeat: It does this over and over, digit by digit, until it has reconstructed the entire infinite number.

Why This is Special (The "Noise" Factor)

In many real-world scenarios, data is messy. Some people lie, some sensors break, and some numbers are just wrong. This is called noise.

The Challenge: If you have 100 data points and 10 of them are lies, a standard computer might get confused and draw a crooked line.
The Paper's Solution: The algorithm is a "smart guesser." It knows that if it picks a random handful of points, there's a good chance most of them are honest. It tries different handfuls. If a handful fits a perfect line, it assumes, "Aha! These are the honest people!" It then uses those honest people to find the next digit.

It's like trying to find the true temperature in a room where some thermometers are broken. Instead of averaging all of them (which would give a wrong answer), you look for a group of thermometers that all agree with each other. Once you find that "truthful group," you trust them to tell you the next piece of the puzzle.

Summary of the Algorithm's Journey

Look at the last digit: Ignore the noise, find the pattern in the last digit of all numbers.
Lock it in: Write down that digit.
Shift the view: Subtract what you found, divide by $p$ , and look at the new last digit (which was the second-to-last before).
Repeat: Do this until you have the whole number.

Why Should We Care?

The author mentions that this is useful for Computer Science and Artificial Intelligence.

Neural Networks: Just like we use real numbers to train AI, p-adic numbers might be better for certain types of data (like hierarchical trees or specific types of encryption).
Efficiency: This method is a "probabilistic algorithm," meaning it's fast and doesn't need to check every single possibility. It's a heuristic (a smart shortcut) that works surprisingly well even when the data is messy.

In a nutshell: This paper teaches computers how to find straight lines in a world where numbers are written backwards and the rules of "closeness" are totally different, by solving the mystery one tiny digit at a time.

1. Problem Statement

The paper addresses the challenge of performing linear regression in the p-adic number system ( $\mathbb{Q}_p$ and $\mathbb{Z}_p$ ) when the observed data contains digitwise noise.

Context: While linear regression is a cornerstone of real-number statistics (using the least squares method), it does not translate directly to p-adic analysis.
The Core Difficulty:
- Non-Archimedean Property: In $\mathbb{R}$ , minimizing the sum of squared errors $\sum |y_i - g(x_i)|^2$ ensures each individual error is small. In $\mathbb{Q}_p$ , the non-Archimedean property ( $|a+b| \le \max(|a|, |b|)$ ) means that a small sum of errors does not imply small individual errors. Consequently, standard gradient-based optimization and least squares methods fail.
- Lack of Differentiability: Loss functions in p-adic settings are often locally constant almost everywhere, rendering differential calculus (gradients) useless for optimization.
- Computational Complexity: The problem of finding the best-fit hyperplane in $\mathbb{F}_p$ (modulo $p$ ) is equivalent to the Maximal Feasible Subsystem problem, which is known to be APX-complete. This implies that exact solutions are computationally hard, necessitating heuristic or probabilistic approaches.
Goal: To develop a probabilistic algorithm that estimates the coefficient vector $\vec{c}$ of a linear equation $y = \langle \vec{c}, \vec{x} \rangle$ given noisy samples, where noise is defined as "digitwise" (errors appearing at specific p-adic digits).

2. Methodology

The proposed solution is a hierarchical, digit-by-digit probabilistic algorithm. It decomposes the p-adic regression problem into a sequence of modulo $p$ regression problems.

A. Modulo $p$ Linear Regression (The Core Subroutine)

The foundation is Algorithm 6, which estimates the linear equation over the finite field $\mathbb{F}_p$ .

Assumption: The data contains a "noise-free locus" (a subset of samples that perfectly satisfy the linear equation) comprising a fraction $(1-r)$ of the total data, where $r$ is the noise probability bound.
Mechanism:
1. Iterative Inclusion: The algorithm attempts to build a set of indices $I'$ representing a noise-free subset.
2. Dynamic Gauss Elimination: It uses a dynamic variant of Gaussian elimination (Algorithm 1) to maintain an extended row echelon form of the candidate subset.
3. Probabilistic Verification: It employs a statistical criterion (Algorithm 2) to test if a candidate subset $I'$ $I^{'}$ is likely to be noise-free.
  - If $I'$ is noise-free, the ratio of points in $I'$ satisfying the derived equation within the full dataset is expected to be high ( $\approx 1-r$ ).
  - If $I'$ is not noise-free, the ratio drops significantly (to $\approx p^{-k}$ ).
4. Thresholding: The algorithm uses a threshold based on the noise bound $r$ and the dimension $D$ to decide whether to accept a candidate subset. It recursively extends the subset until it spans the required codimension.

B. Digitwise Linear Regression (The Main Algorithm)

The full p-adic regression is achieved via Algorithm 8, which iteratively refines the solution digit by digit (from the least significant digit to higher powers of $p$ ).

Step 1 (Least Significant Digit): Apply the modulo $p$ regression (Algorithm 6) to the data reduced modulo $p$ to estimate $\vec{c} \pmod p$ .
Step 2 (Residual Calculation): Calculate the residuals $y_i - \langle \vec{c}_{current}, \vec{x}_i \rangle$ .
Step 3 (Noise Filtering & Scaling):
- Filter the dataset to keep only samples where the residual is divisible by the next power of $p$ (effectively identifying the "noise-free" locus for the next digit).
- Divide the residuals by $p$ to shift the problem to the next digit position.
Step 4 (Recursion): Repeat the process for the next digit. The algorithm accumulates the digits to reconstruct the full p-adic coefficient vector $\vec{c} \pmod{p^E}$ .

3. Key Contributions

New Probabilistic Algorithm: Introduction of Algorithm 8, a novel method for p-adic linear regression that avoids gradient descent and least squares, relying instead on probabilistic inclusion testing and digit-wise refinement.
Modulo $p$ Solver: Development of Algorithm 6, a robust heuristic for solving the APX-complete maximal feasible subsystem problem in $\mathbb{F}_p$ under random sampling assumptions.
Theoretical Framework: Formalization of "digitwise noise" and the conditions under which p-adic regression is feasible (specifically, the requirement that the noise-free locus is sufficiently large and the data is sufficiently random).
Complexity Analysis: The paper provides a theoretical lower bound on the expected number of trials required, showing that the algorithm's efficiency depends heavily on the noise probability $r$ and the dimension $D$ .

4. Experimental Results

The author conducted experiments with varying dimensions ( $D = 20, 40, \dots, 100$ ) and noise probabilities ( $r = 0.01, 0.03$ ) using $p=7$ .

Metrics: The performance was measured by the number of initialization retries ( $c_0$ ) and the number of search retries for extending the index set ( $c_1$ ).
Findings:
- Low Noise ( $r=0.01$ ): The algorithm performed efficiently, often converging in very few trials (frequently 0 or 1 retry).
- Higher Noise ( $r=0.03$ ): The number of retries increased, particularly as the dimension $D$ grew. For $D=100$ and $r=0.03$ , some cases required over 100 retries.
- Failure Modes: The paper notes that if $r$ is too high (e.g., $r=0.1$ ) or $D$ is very large, the expected number of trials grows exponentially ( $\approx (1-r)^{-D}$ ), potentially causing the algorithm to fail to terminate within practical time limits.
- Correctness: In all terminating cases reported, the algorithm successfully recovered the correct coefficient vector.

5. Significance and Implications

Bridging Number Theory and Machine Learning: This work provides a concrete computational tool for p-adic machine learning, a field gaining traction for its potential in handling hierarchical data (via ultrametric spaces) and cryptographic applications.
Alternative to Gradient Descent: It demonstrates that optimization in non-Archimedean spaces requires fundamentally different strategies than in Euclidean spaces. The digitwise approach leverages the unique structure of p-adic integers (where precision is additive) rather than trying to minimize a global error sum.
Applications: The method is relevant for:
- p-adic Neural Networks: Providing a training mechanism for networks based on p-adic activation functions.
- Cryptography: Analyzing or breaking systems based on p-adic equations.
- Data Analysis: Handling data with hierarchical or tree-like structures where p-adic metrics are natural.

In summary, Mihara proposes a mathematically rigorous and computationally feasible way to perform linear regression in the p-adic domain by transforming a difficult global optimization problem into a sequence of simpler, probabilistic finite-field problems, solved iteratively from the least significant digit upwards.

ppp-adic Linear Regression for Random Sampling with Digitwise Noise