Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional propagation of values and densities

Imagine you are trying to teach a computer to think like a human brain. For decades, we've built artificial brains (Neural Networks) that are incredibly smart at specific tasks, but they are still clumsy compared to real biology. They are like rigid, one-way streets where information only flows forward, and they only deal in "facts" (single numbers), ignoring the "uncertainty" (how likely something is to be true).

This paper proposes a new type of artificial neuron called HCRNN (Hierarchical Correlation Reconstruction Neural Network). Think of it as upgrading a simple calculator into a sophisticated weather forecaster that lives inside every single brain cell.

Here is the breakdown using simple analogies:

1. The Problem: The One-Way Street vs. The Roundabout

Current AI (MLP/KAN): Imagine a factory assembly line. A part comes in, gets stamped, and moves to the next station. It only goes one way. If you ask the machine, "What if we sent the part backward?" it gets confused. It only knows the final answer, not the "what-ifs."

The Biological Brain: Real neurons are like a busy roundabout or a Swiss Army knife. They can send signals forward, backward, and sideways. They don't just say "It's raining"; they say, "It's 80% likely to rain, but there's a 20% chance of a sudden storm." They handle uncertainty and direction naturally.

2. The Solution: The "Probability Cloud" Neuron

The author suggests replacing the standard "fact-based" neuron with a "Joint Distribution" neuron.

The Old Way: A neuron outputs a single number, like "5."
The New Way (HCR): This neuron outputs a cloud of possibilities. Instead of just "5," it says, "The answer is likely 5, but it could be 4 or 6, and here is the exact shape of that probability."

The Analogy:
Imagine you are guessing the weight of a mystery box.

Old AI: Guesses "10kg." (Done. No room for error).
HCR Neuron: Draws a map. "It's probably around 10kg, but it's very likely between 9 and 11kg, and very unlikely to be 20kg." It carries the shape of the uncertainty with it.

3. How It Works: The "Lego Block" of Math

The paper uses a mathematical trick called HCR (Hierarchical Correlation Reconstruction).

Think of a complex relationship between variables (like how temperature, humidity, and wind speed affect a storm) as a giant, messy 3D puzzle.

Standard AI tries to solve the whole puzzle at once, which is hard and rigid.
HCR breaks the puzzle down into Lego blocks called "moments."
- Block 1: The average (Expected Value).
- Block 2: How much it varies (Variance).
- Block 3: How skewed or weird it is (Skewness).
- Block 4: How "spiky" the data is (Kurtosis).

The neuron stores these blocks as a simple list of numbers (coefficients). Because they are just blocks, you can rearrange them easily.

Multidirectional Propagation: If you know the "Wind" and "Humidity," you can use the blocks to predict "Temperature." But if you know "Temperature" and "Wind," you can flip the blocks and predict "Humidity." It's like having a 3D map where you can walk in any direction, not just forward.

4. The "Information Bottleneck": The Filter

One of the biggest problems in AI is that it gets overwhelmed by too much data (noise).

The Analogy: Imagine trying to listen to a friend in a noisy concert. You need to filter out the music to hear the voice.
The HCR Advantage: This new method uses a concept called the Information Bottleneck. It acts like a smart filter that asks: "What information is actually useful for the next step, and what is just noise?"
Because the neuron understands the shape of the data (the probability distribution), it can filter out the "noise" much better than current AI, which just sees numbers. It's like having a noise-canceling headphone that understands the music of the data, not just the volume.

5. Why This Matters: The "Super-Embedding"

The paper suggests this could revolutionize things like Transformers (the tech behind Chatbots).

Current Embeddings: When a computer reads the word "Adult," it assigns it a single vector (a list of numbers). It's a bit like saying "Adult = 30 years old."
HCR Embeddings: It realizes "Adult" is a range. It could be 20, 40, or 60. So, instead of a single number, the word "Adult" becomes a probability cloud representing the whole range of adulthood.
The Result: The AI becomes more flexible. It understands that "Adult" has a wide variance, while "Toddler" has a narrow one. This makes the AI more robust, less likely to make silly mistakes, and better at handling real-world ambiguity.

Summary: The "Swiss Army Knife" Upgrade

The paper proposes upgrading our AI neurons from single-purpose hammers (good for hitting one nail in one direction) to Swiss Army Knives (capable of cutting, screwing, and sawing in any direction).

By teaching these neurons to carry probability clouds instead of just facts, and by letting them flip directions like real biological neurons, we might finally create AI that is as flexible, robust, and adaptable as the human brain. It's not just about making the AI smarter; it's about making it understand the uncertainty of the world, just like we do.

1. Problem Statement

Current Artificial Neural Networks (ANNs), such as Multi-Layer Perceptrons (MLPs) and Kolmogorov-Arnold Networks (KANs), exhibit qualitative inferiority compared to Biological Neural Networks (BNNs) in terms of learning efficiency, flexibility, and robustness. The paper identifies three fundamental gaps between current ANNs and biological neurons:

Propagation Direction: Biological axons propagate signals bidirectionally, whereas ANNs are strictly unidirectional.
Information Type: Biological systems process probability distributions (e.g., for risk avoidance and uncertainty management), while ANNs typically propagate only scalar values (expected values).
Training Mechanisms: Biological neurons rely on local training rules (e.g., Hebbian learning), whereas ANNs rely heavily on global backpropagation.

The challenge is to design a neural architecture that bridges these gaps without incurring prohibitive computational costs, specifically by modeling local joint probability distributions within neurons.

2. Methodology: Hierarchical Correlation Reconstruction (HCR)

The paper proposes HCRNN (Hierarchical Correlation Reconstruction Neural Networks), an extension of the KAN approach where neurons model the local joint distribution of inputs rather than just a function mapping inputs to outputs.

A. Core Representation

Instead of a standard activation function, each neuron represents a joint probability density function $\rho(x)$ for normalized variables $x \in [0, 1]^d$ . This density is approximated as a linear combination of an orthonormal basis (e.g., rescaled Legendre polynomials):
$\rho(x) = \sum_{j \in B} a_j f_j(x)$
Where:

$f_j(x)$ are product basis functions (e.g., $f_{j_1}(x_1) \cdot f_{j_2}(x_2) \dots$ ).
$a_j$ are the coefficients (parameters) representing mixed moments of the distribution.
$a_0 = 1$ ensures normalization.
Higher-order coefficients ( $a_{ij}, a_{ijk}$ ) represent pairwise, triplewise, and higher-order statistical dependencies.

B. Key Functionalities

Bidirectional Propagation:
- Because the neuron holds a joint distribution model, it can compute conditional distributions by substituting known variables and normalizing.
- By permuting indices in the coefficient tensor, the direction of propagation can be reversed (e.g., predicting $x$ given $y$ vs. $y$ given $x$ ) without retraining.
Distribution Propagation:
- Neurons can propagate not just expected values but entire probability distributions (represented as vectors of moments: mean, variance, skewness, kurtosis).
- This allows the network to handle uncertainty and perform probabilistic inference.
Local Training & Information Bottleneck:
- The framework supports training via Information Bottleneck (IB) principles. It optimizes the mutual information $I(T; Y)$ (relevance to output) while minimizing $I(X; T)$ (compression of input noise).
- Mutual information is estimated efficiently using the squared sum of non-trivial mixed moments ( $a_j$ ), avoiding the computational cost of kernel density estimation or Gaussian kernels used in HSIC (Hilbert-Schmidt Independence Criterion).

C. Training Approaches

The paper outlines several training strategies for HCRNNs:

Backpropagation: Treating the HCR as a parametrization and optimizing $a_j$ via gradient descent.
Direct Estimation: Updating moments directly from data batches (e.g., using Exponential Moving Averages).
Tensor Decomposition: Decomposing high-order joint distributions into lower-order tensors to reduce complexity.
Information Bottleneck: Directly optimizing the content of intermediate layers to maximize relevant information while compressing noise.

3. Key Contributions

Joint Distribution Neurons: Introduction of a neuron model that explicitly encodes local joint probability distributions, enabling multidirectional inference and uncertainty propagation.
HCRNN Architecture: A generalization of KANs where parameters are interpretable as statistical moments, allowing for the conscious inclusion of higher-order dependencies.
Efficient Mutual Information Estimation: A novel, low-cost method to estimate mutual information and entropy using the orthogonality of the polynomial basis, significantly outperforming HSIC in sensitivity and computational cost ( $O(n|B|)$ vs $O(n^3)$ ).
Probabilistic Embeddings: A proposal to modify architectures like Transformers to work with probability distributions (vectors of moments) rather than single scalar embeddings, potentially improving the representation of ambiguous concepts (e.g., the age variance of the word "adult" vs. "toddler").
Biological Plausibility: The architecture mimics biological properties: bidirectional signal flow, risk-aware processing (variance), and local training capabilities.

4. Results and Validation

Function Approximation: Experiments show HCRNNs can successfully learn complex functions (e.g., $f(x) = \exp(x_1^2 - x_2^2 - x_3^3 + x_4^4)$ ) and recover underlying polynomial structures from data.
Generalization: Compared to Kernel Density Estimation (KDE), the global polynomial basis of HCR generalizes better to unseen data, avoiding the "locality trap" where KDE assumes new points must be close to old points.
Independence Testing: In independence tests (HSIC vs. HCR), the HCR approach demonstrated superior sensitivity to dependencies (detecting rotated bimodal distributions with fewer degrees of rotation) and provided interpretable descriptions of the dependencies found.
Normalization Improvements: Replacing standard rescaling normalization with CDF/EDF normalization (quantile normalization) significantly improved performance on MNIST with low-degree polynomials.

5. Significance and Future Work

The paper suggests that HCRNNs offer a "missing link" between current deep learning and biological intelligence. By moving from value-based to distribution-based processing, these networks can:

Enhance Robustness: Better handle noise and uncertainty, mimicking biological risk avoidance.
Improve Interpretability: Parameters are directly interpretable as statistical moments, revealing the nature of learned dependencies.
Enable New Architectures: Facilitate the creation of "omnidirectional" networks capable of bidirectional inference (e.g., in Bayesian scenarios) and probabilistic embeddings in Transformers.

Future Directions:

Practical implementation and optimization of training algorithms (especially for intermediate layers).
Automatic extraction of real-world properties from learned embeddings.
Integration of time-dependence (e.g., long-term potentiation) to further mimic biological dynamics.
Application to high-dimensional problems via conditional distribution prediction rather than direct joint modeling.

In conclusion, this work proposes a mathematically rigorous, biologically inspired framework that extends the capabilities of neural networks from deterministic function approximation to probabilistic, multidirectional reasoning, potentially paving the way for the next generation of adaptive and robust AI systems.