Neural Fields as World Models

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: How the Brain Predicts the Future

Imagine you are playing catch. A ball is thrown toward you. Your brain has to solve two problems instantly:

Where will the ball land? (Physics)
Where will my hand be when it gets there? (Body awareness)

Most modern AI tries to solve this by taking a picture, squishing it down into a tiny, abstract list of numbers (like a secret code), and then guessing what happens next. The problem? In this "secret code" world, objects can teleport. A ball can jump from the top of the screen to the bottom in a single step because the AI doesn't care about the space between them.

This paper argues that the human brain doesn't work that way. Instead of compressing the world into a secret code, the brain keeps the map. It preserves the shape and layout of what it sees. If a ball moves, the brain's prediction moves smoothly across the map, just like a real ball moving through the air.

The authors call this an "Isomorphic World Model." "Isomorphic" is a fancy word meaning "having the same shape." Their model keeps the shape of the real world inside the computer.

The Engine: A "Living" Grid

To build this, the researchers used something called Neural Fields.

The Analogy: The Ripple in a Pond
Imagine a calm pond. If you drop a stone in one corner, a ripple spreads out. It doesn't jump to the other side of the pond instantly; it has to travel through the water, touching every spot in between.

Old AI (Latent Space): Like a magician pulling a rabbit out of a hat. The rabbit (the ball) disappears from the hat and reappears in your hand instantly. No travel time.
This New AI (Neural Field): Like the ripple in the pond. The "activity" (the prediction of where the ball is) spreads from neighbor to neighbor. It physically cannot teleport. It has to move through the space, just like a real object.

The Secret Sauce: Motor-Gated Channels

The brain doesn't just watch the world; it acts on it. When you move your arm, your brain knows exactly how your vision will change.

The researchers added a special feature called Motor-Gated Channels.

The Analogy: The Dimmer Switch
Imagine a room full of lightbulbs (the neural field). Some of these bulbs are connected to a special dimmer switch controlled by your muscles.

When you decide to move your arm, you flip the switch.
This doesn't just turn the lights on or off; it scales them. It makes the lights representing your arm brighter or dimmer based on your movement command.
This is called Gain Modulation. It's how your brain says, "I am moving my hand, so I expect to see my hand move here."

The Three Experiments (The Proof)

The team tested this idea in three ways:

1. The Ballistic Test (No Teleporting)

They showed the AI a ball falling under gravity for three seconds, then turned off the camera.

The Result: The "Old AI" (VAE-LSTM) got confused. Its prediction of the ball would sometimes jump erratically across the screen (teleport).
The New AI: Because it uses the "ripple" method (local connections), the ball's prediction moved in a smooth, perfect curve. It couldn't jump because the math forced it to travel through the intermediate spots.

2. The Dream Training (Practicing in Your Head)

This is the coolest part. They trained a robot arm to catch a falling ball entirely inside the AI's imagination.

They froze the "World Model" (the simulator) and let the robot practice catching balls inside it.
Then, they took that robot and put it in the real world.
The Result: The robot trained in the "dream" caught the ball 81.5% of the time. The "Old AI" trained in a dream only caught it 46% of the time.
Why? Because the "Dream" world looked exactly like the real world. The robot learned the geometry of the catch, not just a secret code. It's like practicing a piano piece in your head; if you visualize the keys correctly, your fingers know where to go.

3. The Body Discovery (Finding "Me")

Finally, they asked: Can the AI figure out what is "itself" (the arm) and what is "the world" (the ball) without being told?

They didn't label the arm or the ball. They just let the AI predict what happens when it moves.
The Result: The AI spontaneously figured it out! The parts of the AI that controlled movement (the motor gates) started lighting up only over the arm, ignoring the ball.
The Lesson: The AI realized, "When I send a command, this part of the screen moves. That must be me." It discovered its own "body schema" just by trying to predict the future.

Why Does This Matter?

It's More Like Us: This model mimics how our brains actually work (keeping spatial maps) rather than how current computers work (compressing data).
Better Learning: Because the AI understands space, it can learn skills in a simulation and use them in the real world much better than current AI.
Self-Awareness: It suggests that we don't need to be "taught" what our body is. We just learn it by noticing that we are the only things that move when we decide to move.

The Bottom Line

The brain predicts the future not by doing abstract math on a list of numbers, but by running a movie in its head. In this movie, objects move smoothly through space, and when we move our bodies, the movie updates in real-time.

By building AI that works the same way—keeping the map, respecting the distance, and linking movement to vision—we can create machines that learn faster, move more naturally, and perhaps even understand what it means to have a body.

1. Problem Statement

Current machine learning world models (used for planning and control) typically compress visual input into a low-dimensional latent space (e.g., using VAEs or Transformers). While effective, these architectures violate key biological constraints:

Loss of Spatial Structure: They discard the topological organization of sensory cortex. Information can "teleport" discontinuously between latent dimensions, whereas physical objects move continuously through space.
Lack of Action Integration: Many models treat physics as an observer problem rather than an agent problem, failing to integrate motor commands directly into the prediction dynamics.
The Gap: There is a lack of world model architectures that preserve spatial structure (isomorphism) while integrating motor control, which is hypothesized to be how the brain performs "intuitive physics" and maintains a "body schema."

2. Methodology

The authors propose Isomorphic World Models implemented via Neural Fields with Motor-Gated Channels.

Architecture: Neural Fields

Core Principle: The model uses a spatially organized recurrent network where activity evolves through local lateral connectivity.
Dynamics: The state $h$ $h$ at time $t+1$ $t + 1$ is updated based on:
1. Decay of current activity.
2. Lateral input from spatial neighbors via a learned convolutional kernel ( $K$ , $7 \times 7$ ).
3. Visual input from the environment.
- Equation: $h_{t+1} = h_t + \frac{\Delta t}{\tau}(-h_t + K * \text{ReLU}(h_t) + W_{in} * I_t)$
Constraint: Information propagates only through spatial neighbors. A ball's trajectory must traverse intermediate locations in the representation, preventing "teleportation."

Motor Integration: Gain Modulation

Mechanism: The first $M$ channels are designated as motor-gated.
Operation: Motor commands ( $m$ ) multiplicatively modulate the activity of these specific channels: $h^{(i)}_{t+1} = m_i \cdot \tilde{h}^{(i)}_{t+1}$ .
Biological Basis: This implements gain modulation, a mechanism observed in the posterior parietal cortex where motor signals scale sensory responses.

Experimental Setup

The model was tested in two environments:

Ballistic Trajectory (Exp 1): A 32x32 grid with a ball moving under gravity. The model observes 3 frames and predicts the rest without visual input.
Musculoskeletal Arm (Exp 2 & 3): A planar double-pendulum arm in a 120x45 field. The arm must catch a falling ball using motor commands (co-contraction and reciprocal movement).

Baseline: Compared against a VAE-LSTM (standard latent-space world model) which compresses observations into a 32-dim latent vector.

3. Key Contributions

Isomorphic World Model Architecture: Introduced a world model that preserves sensory topology, transforming physics prediction from an abstract state transition into a geometric propagation problem.
Motor-Gated Channels: Demonstrated that multiplicative gating by motor commands allows the network to learn sensorimotor contingencies and implement gain modulation.
Emergent Body Schema: Showed that body-selective encoding (distinguishing self from world) can emerge spontaneously from the prediction objective without explicit supervision.
Dream-to-Real Transfer: Proved that policies trained entirely within the "imagination" (frozen world model) transfer effectively to real physics, outperforming latent-space alternatives.

4. Results

Experiment 1: Physics from Lateral Dynamics

Trajectory Prediction: The Neural Field (NF) successfully predicted parabolic arcs by propagating activity through the field.
Teleportation: The NF showed 0% teleportation (jumps >3.0 pixels). In contrast, the VAE-LSTM exhibited erratic oscillations and 15.4% teleportation.
Performance: NF achieved significantly lower prediction loss ( $9.33 \times 10^{-4}$ ) compared to VAE-LSTM ( $3.94 \times 10^{-3}$ ).
Conclusion: Local connectivity is sufficient to learn ballistic physics; trajectories emerge from wave-like propagation rather than symbolic rules.

Experiment 2: Dream Training Transfer

Setup: Policies were trained entirely inside the frozen NF world model ("dream training") to catch a falling ball.
Real-World Performance:
- NF Policy: Achieved 81.5% catch rate in the real physics environment (approaching the 89.0% of policies trained directly on real physics).
- VAE-LSTM Policy: Achieved only 46.0%.
Insight: Despite similar prediction losses during training, the spatial structure of the NF allows policies to exploit spatial features directly, whereas latent-space models struggle to decode actionable structure from abstract vectors.

Experiment 3: Emergent Body Schema

Analysis: Measured the selectivity of motor-gated channels over the arm vs. the ball.
Findings:
- Reciprocal (R) Channels: Showed strong body-selective encoding (median selectivity index ~2.18 for shoulder, 1.50 for elbow). These channels became active specifically over the arm when motor commands were issued.
- Co-contraction (C) Channels: Showed no significant selectivity (as they modulate stiffness without large visual movement).
Significance: The model discovered the distinction between "self" (body) and "world" (ball) purely through the statistical contingency of motor commands and sensory feedback, supporting developmental theories of body schema.

5. Significance and Implications

Biological Plausibility: The architecture aligns with known neuroanatomy (retinotopic organization in visual cortex and gain modulation in parietal cortex), suggesting the brain may use similar isomorphic mechanisms for prediction.
Computational Theory of Intuition: Supports the hypothesis that "intuitive physics" is not a separate symbolic reasoning engine but a consequence of spatially organized neural dynamics.
Mental Practice: Validates Grush's Emulation Theory, showing that accurate internal models allow for effective offline training (dreaming) that transfers to real-world action.
Representation Format: Argues for constitutive representations (where the representation shares the geometric form of the world) over descriptive representations (compressed latent codes). This explains why physical reasoning feels immediate and why "body schema" emerges naturally from sensorimotor prediction.
Future Directions: Suggests that the benefits arise from the isomorphic constraint itself, applicable to other spatial architectures (ConvLSTMs, cellular automata), and points toward 3D volumetric representations for more complex physics.

In conclusion, the paper demonstrates that preserving spatial structure in world models leads to more robust physics prediction, better policy transfer, and the spontaneous emergence of self-representation, offering a biologically grounded alternative to standard latent-space approaches.