Imagine you have a super-smart robot student. You want to teach it how planets move around the sun. You give it a massive history book of where the planets have been, and you ask it to guess where they will be next.

The big question this paper asks is: Can this robot student just memorize the path, or can it actually understand the laws of physics that cause the movement?

The authors found that without some special "training wheels" (which they call inductive biases), the robot is a brilliant memorizer but a terrible physicist. It learns to draw the path perfectly but has no idea why the planet is moving that way.

Here is the story of how they fixed the robot, broken down into three simple lessons.

The Problem: The Robot is a "Curve-Fitter," Not a "Physicist"

Think of the robot's brain as a giant library.

The Kepler Approach (What the robot did naturally): The robot looks at the last 1,000 points of a planet's journey. It says, "Aha! I see the pattern. It's an oval shape. I will just keep drawing the oval." It's like a child tracing a picture. It gets the picture right, but if you ask, "Why is it an oval?" or "What force is pulling it?", the robot has no answer. It just knows the shape.
The Newton Approach (What we want): We want the robot to say, "The sun is pulling the planet with gravity. If I know the planet's current speed and position, I can calculate the pull and predict the next step." This is understanding the cause, not just the effect.

The paper shows that standard AI models (Transformers) naturally become "tracers" (Kepler) and fail to become "calculators" (Newton). To fix this, the authors added three specific "training wheels."

Lesson 1: The "Pixelated Map" Problem (Spatial Smoothness)

The Analogy: Imagine you are trying to teach a robot to navigate a city.

The Mistake: You give the robot a map where every single street corner is a completely different, random color. "Red" is the corner of 1st and Main. "Blue" is the corner of 1st and 2nd. Even though these corners are right next to each other, the robot sees them as totally unrelated. It has to relearn the relationship between "Red" and "Blue" from scratch every time.
The Fix: The authors realized that when they chopped the planet's position into tiny "bins" (like pixels), they broke the natural smoothness of space.
The Solution: They made the "bins" bigger (fewer colors) or stopped using bins entirely and just gave the robot the exact coordinates (like a GPS). This allowed the robot to see that "Point A" is right next to "Point B," helping it build a real mental map of space instead of a confusing jumble of random codes.

Lesson 2: The "Domino Effect" Problem (Spatial Stability)

The Analogy: Imagine playing a game of "Telephone" where you whisper a number to the next person.

The Mistake: If the first person whispers "50.1" and the second person hears "50.2," the third person might hear "50.5," and by the time it gets to the end, the number is "100." In physics, if the robot makes a tiny mistake predicting the planet's position, that mistake gets bigger and bigger with every step, until the planet flies off into deep space or crashes into the sun.
The Fix: The authors realized that standard AI training is too "perfect." It only learns from perfect past data.
The Solution: They started "breaking" the robot's training data on purpose. They added a little bit of static noise (like static on a radio) to the history the robot was reading. This forced the robot to learn how to recover from small mistakes, making it robust enough to predict the future without the errors piling up.

Lesson 3: The "Long Memory" vs. "Short Memory" Problem (Temporal Locality)

The Analogy: This is the most important part.

The Long Memory (Kepler): Imagine a robot that remembers everything that happened in the last hour. When it tries to guess what happens next, it looks at the whole hour of history to draw a giant curve. It's like looking at a whole rollercoaster track to guess where the cart is going next. It works for the curve, but it doesn't understand the physics.
The Short Memory (Newton): Now, imagine a robot that is only allowed to remember the last two seconds. It can't see the whole track. It must look at where the cart is right now and how fast it's going right now to figure out where it goes next.
The Solution: The authors forced the robot to have a short memory. They told it, "You can only look at the immediate past."
The Result: Because the robot couldn't rely on the "big picture" curve anymore, it was forced to figure out the rules of the game. It had to calculate the invisible "pull" (gravity) acting on the planet right now to predict the next step. Suddenly, the robot stopped drawing ellipses and started calculating forces. It became a physicist.

The Big Takeaway

The paper concludes that how you design the AI's brain determines what it learns.

If you let it look at everything and use a pixelated map, it becomes a curve-fitter (Kepler). It draws pretty pictures but doesn't understand the universe.
If you give it a smooth map, teach it to handle mistakes, and force it to have a short memory, it becomes a physicist (Newton). It discovers the laws of gravity on its own.

The authors show that you don't need to program the laws of physics into the AI. You just need to give it the right "inductive biases" (the right training constraints), and it will discover the laws itself.

Technical Summary: From Kepler to Newton: Inductive Biases Guide Learned World Models in Transformers

1. Problem Statement

The paper addresses a critical gap in the capabilities of general-purpose foundation models (Transformers) regarding scientific discovery. While previous "AI Physicist" approaches have successfully recovered symbolic physical laws, they often rely on strong, domain-specific priors that effectively "bake in" the physics. Conversely, recent work by Vafa et al. (2025) demonstrated that generic Transformers, even at GPT-2 scale, fail to acquire "world models"—causal abstractions that explain why phenomena occur. Instead, these models achieve high predictive accuracy by learning geometric curve-fitting (Keplerian models) without capturing the underlying dynamical laws (Newtonian mechanics).

The central research question is: Why do Transformers fail to learn the Newtonian world model for planetary motion, and how can this be fixed? The authors posit that the failure stems from a lack of specific, minimal inductive biases rather than a fundamental limitation of the architecture.

2. Methodology

The authors systematically investigate the failure modes of Transformers in a controlled setting: predicting 2D planetary motion around a central mass. They introduce three minimal inductive biases to bridge the gap between geometric prediction and physical law discovery.

Problem Setup

The task involves predicting the next position $\vec{r}_{t+1}$ of a planet given a history of positions, formulated as an autoregressive next-token prediction (NTP) problem.

Baseline: The setup follows Vafa et al. (2025), where continuous coordinates are discretized into tokens (bins) and predicted via cross-entropy loss.
Proposed Modifications: The authors test variations in tokenization, loss functions, and attention mechanisms to isolate specific inductive biases.

The Three Inductive Biases

Bias 1: Spatial Smoothness

Failure Mode: Default tokenization discretizes continuous spatial coordinates into independent bins with randomly initialized embeddings. This breaks spatial smoothness; points physically close but in different bins are treated as unrelated. The authors show that even with massive data (20B tokens), the learned embedding space fails to form a coherent spatial map (low linear decodability, $R^2 \approx 0.86$ ).
Solution:
1. Optimized Tokenization: Reducing the vocabulary size ( $V$ ) significantly improves the emergence of a spatial map. The authors derive a scaling law showing that training data size ( $D$ ) must increase at least as fast as vocabulary size ( $V$ ) to maintain map quality ( $1-R^2 \propto D^{-\alpha_D} V^{\alpha_V}$ ).
2. Continuous Coordinates: Alternatively, using continuous coordinates without discretization inherently provides spatial smoothness, though this introduces stability challenges.

Bias 2: Spatial Stability

Failure Mode: Autoregressive models suffer from error accumulation, which is exacerbated when predicting continuous variables (regression) compared to discrete tokens (classification). Without mitigation, small initial errors cause the trajectory to diverge catastrophically (e.g., the planet flying to infinity or into the sun).
Solution: Noisy Context Learning. The authors inject Gaussian noise into the historical context during training. This forces the model to learn robust representations that do not rely on perfect past states.
Result: With noisy context training, regression (using continuous coordinates and MSE loss) consistently outperforms classification (discretized coordinates with cross-entropy loss) across all data scales.

Bias 3: Temporal Locality

Failure Mode: Standard Transformers utilize long context lengths (e.g., 1k+ tokens), allowing the model to access the entire history of the trajectory. This encourages the model to fit global geometric shapes (ellipses) based on all past points—a "Keplerian" approach.
Solution: Restricted Attention Window. The authors restrict the context length to the immediate past (e.g., only the last 2 states). This imposes the physical assumption that the future state depends only on the local state (position and velocity), consistent with Newton's second law (a second-order differential equation).
Result: This constraint forces the model to abandon global curve-fitting and instead learn to estimate local gravitational forces ( $\vec{F} \propto 1/r^2$ ) to simulate the trajectory step-by-step—a "Newtonian" approach.

3. Key Results

Spatial Map Emergence: The quality of the learned spatial map in tokenized models is highly sensitive to vocabulary size. Large vocabularies (e.g., $V=7000$ ) require impractical amounts of data to learn a coherent map. Reducing $V$ or using continuous coordinates resolves this.
Regression vs. Classification: Contrary to Vafa et al.'s findings, the authors demonstrate that regression with continuous coordinates is superior to classification, provided that noisy context learning is used to stabilize inference.
Keplerian vs. Newtonian Models:
- Long Context (Keplerian): The model learns to fit the global elliptical trajectory using all past states. It predicts by continuing the curve.
- Short Context (Newtonian): When restricted to local states, the model discovers the underlying force law. It predicts by simulating the differential equation $F=ma$.
Inductive Bias Hierarchy: The paper demonstrates that simple architectural choices (tokenization strategy, context length) determine whether an AI acts as a "curve-fitter" (Kepler) or a "physicist" (Newton).

4. Significance and Claims

The paper claims that simple architectural choices are the determining factor in whether a general-purpose AI discovers physical laws or merely fits data.

Bridging the Gap: The work bridges the divide between "AI Physicist" models (which use strong priors) and generic Transformers (which fail to learn physics). It shows that generic Transformers can learn world models if equipped with minimal, domain-agnostic inductive biases (smoothness, stability, locality).
Automated Scientific Discovery: The results serve as a "critical litmus test" for the vision of "AI Scientists." If general-purpose architectures cannot recover the known laws of classical mechanics without specific engineering, they cannot be trusted to discover unknown laws.
Mechanism of Failure: The paper clarifies that the failure of previous large-scale models was not due to a lack of capacity, but due to the absence of specific inductive biases (specifically temporal locality and spatial stability) required to force the emergence of causal abstractions over geometric correlations.

The authors conclude that by systematically introducing these biases, Transformers can transition from predicting what happens next to understanding why it happens, marking a step toward automated scientific discovery.

From Kepler to Newton: Inductive Biases Guide Learned World Models in Transformers