CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning

The Big Problem: The "Memory vs. Stability" Dilemma

Imagine you are trying to teach a robot to remember a long story or predict the weather for the next 100 years. Current AI tools face a frustrating choice, like a car with only two gears:

The "Discrete" Gear (LSTMs): These are like a robot taking giant, jerky steps. They are very expressive and can learn complex things, but because they step so roughly, they often trip over themselves. Over a long time, tiny errors pile up, causing the robot to either spin out of control (exploding gradients) or freeze completely (vanishing gradients).
The "Continuous" Gear (Neural ODEs): These are like a robot gliding on ice. They move smoothly and are very stable, but they are also "dissipative." This means they slowly lose energy and information as they move. It's like a cup of hot coffee left on a table; eventually, it cools down to room temperature and loses its distinct "hotness." The AI forgets the details of the story to stay stable.

The Goal: The authors want a robot that can glide smoothly without losing its energy or forgetting the story. They want a system that is both stable and remembers everything perfectly.

The Solution: CHLU (The "Clue")

The authors propose a new building block for AI called CHLU (pronounced "Clue"). Think of it as a Physics-Based Time Machine for data.

Instead of just guessing the next step in a sequence, CHLU treats data like a physical object moving through space and time, governed by the laws of physics.

Here are the three main "superpowers" of CHLU:

1. The Speed Limit (Relativistic Kinetic Governor)

The Analogy: Imagine a car driving on a highway. In normal AI, if the car hits a bump, it might accelerate infinitely fast, fly off the road, and crash.
How CHLU fixes it: CHLU puts a "speed limit" (the speed of light, $c$ ) on the data. No matter how hard the AI tries to accelerate, it physically cannot go faster than this limit.
The Result: If the AI gets confused or sees a weird noise spike, it doesn't explode. It just smoothly slows down or changes direction. It prevents the "crashes" that happen in other models.

2. The Perfect Rollercoaster (Symplectic Integration)

The Analogy: Imagine a rollercoaster in a perfect vacuum with no friction. If you push the cart up a hill, it will go down, up the next hill, and keep going forever without ever losing height. It never stops, and it never forgets how high it started.
How CHLU works: Most AI models are like rollercoasters with friction; they lose height (information) over time. CHLU uses a special mathematical trick called Symplectic Integration. This ensures that the "energy" of the data is strictly conserved.
The Result: The AI can run for an infinite amount of time (infinite horizon) and still remember the exact shape of the path it started on. It doesn't "cool down" or forget.

3. The Dreaming Machine (Thermodynamic Generation)

The Analogy: Imagine you have a pile of sand (noise) and you want to build a sandcastle (a picture of a cat).
- Old AI tries to force the sand into a shape.
- CHLU acts like a thermodynamic sculptor. It heats up the sand (adding energy) so the grains can move around freely, and then slowly cools it down (annealing). As it cools, the sand naturally settles into the deepest, most stable "valleys" of the landscape.
The Result: The AI "crystallizes" random noise into structured images (like digits from the MNIST dataset). It doesn't just memorize; it understands the "shape" of the data so well that it can create new examples from scratch.

How It Was Tested (The Experiments)

The authors tested CHLU against the old "Discrete" and "Continuous" models:

The Infinity Loop (Lemniscate): They asked the AI to draw a figure-eight shape forever.
- Old AI: Drifted away or spiraled into a dot.
- CHLU: Drew the figure-eight perfectly, over and over, forever. It respected the geometry.
The Shaky Wave (Perturbed Sine Wave): They gave the AI a wobbly starting point.
- Old AI: Tried to fix the wobble instantly by accelerating to infinite speed (physically impossible).
- CHLU: Accepted the wobble, smoothed it out, and kept the wave moving at a safe, constant speed.
The Art Gallery (MNIST Generation): They asked the AI to draw numbers.
- CHLU: Successfully turned random static noise into clear, recognizable numbers by "cooling" the noise down.

The Bottom Line

The paper argues that to build better AI that understands the real world, we shouldn't just try to make the math "smarter." Instead, we should hard-code the laws of physics into the AI's brain.

By treating information like energy and time like a physical dimension, CHLU solves the age-old problem of "stability vs. memory." It creates an AI that is as stable as a rock but remembers as much as a library, all while obeying a strict speed limit to prevent chaos.

In short: CHLU is an AI unit that doesn't just "learn" patterns; it lives inside a physics simulation where energy is never lost, ensuring it never forgets and never breaks.

1. Problem Statement

Current deep learning primitives for modeling temporal dynamics face a fundamental trade-off between stability and information preservation:

Discrete Units (e.g., LSTMs, Transformers): These are often unstable over long horizons, suffering from exploding or vanishing gradients. They lack interpretable state dynamics and struggle to learn implicit conservation laws.
Continuous Units (e.g., Neural ODEs): While offering smoothness, they typically model dissipative systems. To ensure stability, they inherently destroy information over time, making them unsuitable for long-term information preservation.
Existing Hamiltonian Networks: Previous attempts (e.g., Hamiltonian Neural Networks) focus on energy conservation for simulation but are not designed for general inference on high-dimensional data or for solving the memory-stability dichotomy in deep learning architectures.

The paper argues that a new primitive is needed that enforces causal evolution bounds and strict energy conservation to achieve infinite-horizon stability without information loss.

2. Methodology: The Causal Hamiltonian Learning Unit (CHLU)

The CHLU is a physics-grounded computational unit that treats energy conservation as a structural prior rather than a learned target. It integrates Relativistic Mechanics and Symplectic Geometry.

A. The Separable Hamiltonian Engine

The core of the CHLU is a learnable Hamiltonian function $H(q, p)$ , where $z = (q, p)$ represents generalized positions and momenta. The Hamiltonian is defined as:
$H(q, p) = T(p) + V_\theta(q) + \alpha\|q\|^2$

$T(p)$ (Relativistic Kinetic Governor): Unlike standard Newtonian kinetic energy, this term enforces a speed limit $c$ . It is parameterized by a learnable mass matrix $M$ , rest mass $m_0$ , and the speed of causality $c$ . This ensures velocity $\dot{q}$ saturates at $c$ as momentum increases, preventing kinetic explosions.
$V_\theta(q)$ (Learnable Potential Energy): A non-linear potential energy function parameterized by a neural network.
$\alpha\|q\|^2$ (Global Confinement): A weak quadratic regularizer acting as a local gravitational potential when $V_\theta$ is flat.

The state evolves according to Hamilton's equations:
$\frac{dq}{dt} = \frac{\partial H}{\partial p}, \quad \frac{dp}{dt} = -\frac{\partial H}{\partial q}$

B. Symplectic Integration

To ensure energy conservation over infinite horizons, the CHLU employs a Dissipative Velocity Verlet integrator directly in the forward pass. This allows the system to switch between:

Conservative Dynamics ( $\gamma = 0$ ): Strictly conserves phase-space volume.
Dissipative Convergence ( $\gamma > 0$ ): Introduces a friction parameter to controllably drain entropy, collapsing the state into stable attractors.

C. Training Dynamics: Hamiltonian Contrastive Divergence

The CHLU utilizes a thermodynamic modification of the Wake-Sleep algorithm:

Wake Phase (Supervised): The system minimizes the Mean Squared Error (MSE) between the predicted trajectory and the target. A regularization term penalizes Lyapunov exponents to ensure training stability.
Sleep Phase (Unsupervised): The system evolves freely from a replay buffer. Weights are updated to raise the energy of "hallucinations" (states not matching the data distribution) while lowering the energy of valid data.
Update Rule: The weight update is proportional to the difference in gradients between the "clamped-wake" trajectory and the "free-sleep" trajectory, creating a contrastive signal that differentiates physical signals (low energy) from noise (high energy).

D. Generative Sampling

For generation, the CHLU couples with Langevin Dynamics. By adding stochastic noise ($dW$) and adjusting temperature ( $T$ ) and friction ( $\gamma$ ), the system samples from the Boltzmann distribution $P(q, p) \propto \exp(-H/k_B T)$ . Annealing the temperature allows the system to "crystallize" noise into structured data samples.

3. Key Contributions

Relativistic Kinetic Governor: Introduces a configurable speed limit $c$ that acts as a structural constraint on kinetic stability, preventing the unbounded velocity updates common in unconstrained recurrent architectures.
Action Propagation: Generalizes contrastive learning methods (like Contrastive Divergence) by treating generation as thermodynamic relaxation on a learned potential energy surface, unifying inference and generation into a single reversible Hamiltonian operator.
Symplectic Primitive: Proposes a unit that strictly conserves phase-space volume, solving the memory-stability trade-off by design rather than by learning better approximation functions.

4. Experimental Results

The authors compared CHLU against LSTMs and Neural ODEs (NODEs) on three tasks, focusing on inductive biases rather than raw performance metrics:

Experiment I: Long-Horizon Stability (Lemniscate Tracing)
- LSTM: Learned the shape but accumulated numerical errors, pushing the trajectory into a high-energy limit cycle.
- NODE: Captured short-term dynamics but, due to dissipation, spiraled inward and collapsed to the origin.
- CHLU: Preserved the shadow Hamiltonian exactly. Despite minor learning imperfections, the orbit remained closed and stable indefinitely, demonstrating topological fidelity.
Experiment II: Kinetic Safety (Perturbed Sine Wave)
- LSTM: Produced non-physical, near-infinite velocity spikes to instantly correct noise.
- NODE: Collapsed the wave entirely as a trivial solution to MSE reduction.
- CHLU: Smoothly saturated velocity at the speed limit $c$ . Perturbations were converted into phase shifts rather than magnitude divergences, proving robustness against initialization instabilities.
Experiment III: Thermodynamic Generation (MNIST)
- Trained on 10k MNIST images, the CHLU successfully generated digits by annealing noise from the centroid of the test set.
- The system settled into low-energy modes of the learned potential, producing distinct digit modes (though with some bias toward certain digits like 3, 5, 8, 9).

5. Significance and Future Directions

The paper posits that the CHLU resolves the dichotomy between discrete and continuous deep learning models by enforcing a stricter geometric reality (causality and conservation) rather than relying on optimization alone.

Implications: The CHLU offers topological fidelity, kinetic safety, and thermodynamic generation, making it a promising primitive for building deep, physically consistent networks that learn better "world models."
Limitations: The strict conservation can lead to "hyper-stability" where the system struggles to forget noise without friction. Additionally, stiff potentials can cause integrator instability.
Future Work: The authors propose extending CHLUs into "Deep Symplectic Networks" with features like Lorentz Boosting (global attention), Wormholes (sparse non-local connections), and Hierarchical Causality (stacking units with boundary conditions).

In summary, CHLU represents a shift from learning approximations of dynamics to learning within the constraints of physical laws, offering a robust framework for long-term temporal modeling and generative tasks.