Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations

Imagine you have a brilliant, super-smart robot that has spent years watching millions of hours of videos. It has learned to understand how the world works: how gravity pulls things down, how a person throws a ball, or how a kite flies. But there's a catch: this robot thinks in a secret, continuous language of numbers that humans can't read. It's like it's speaking a language of pure math, and we have no dictionary to translate it.

This paper is about building a translator for that secret language, but with a very specific, clever twist.

The Problem: The "Black Box" Robot

The robot in question (called V-JEPA 2) is a "world model." Instead of trying to redraw the video pixel-by-pixel (like a generative AI that draws pictures), it predicts what happens next in a hidden, abstract space. This makes it incredibly good at understanding physics and motion.

However, because it never draws the picture back out, we can't see what it actually learned. It's like having a genius who can solve complex equations in their head but refuses to write them down. We know they are smart, but we can't inspect their notes to see if they truly understand "physics" or if they are just guessing patterns.

The Old Way vs. The New Way

The Old Way (The "Active" Translator): Usually, researchers try to attach a second AI (like a language bot) to the robot's brain to ask it questions. But this is messy. If the language bot gives a good answer, we don't know if it's because the robot actually understood the concept, or if the language bot just used its own knowledge to fill in the blanks. It's like asking a student a math question while a tutor is whispering the answers in their ear. You can't tell who did the work.
The New Way (The "Passive" Translator): The authors propose a new method called AIM (AI Mother Tongue). Instead of a smart language bot, they attach a simple, dumb "quantizer." Think of this as a stamping machine.
- The robot's secret math numbers come out.
- The stamping machine doesn't understand math; it just looks at the numbers and says, "This looks like a '5', that looks like a '3'."
- Crucially, the robot's brain is frozen. It cannot change its mind to help the stamping machine. The stamping machine has no dictionary and no pre-set rules. It just groups similar numbers together.

If the stamping machine starts grouping "Archery" videos into one bucket and "Bowling" videos into another, we know for a fact that the robot made those two things look different in its brain. The stamping machine didn't force them apart; it just revealed the difference that was already there.

The Experiment: Testing the Translator

The researchers tested this on a small dataset of videos (Kinetics-mini) involving five actions: archery, bowling, flying a kite, high jumping, and marching.

They set up three specific tests to see if the stamping machine could detect physical differences:

Grip Angle: Archery (pulling a bowstring) vs. Bowling (holding a ball).
Object Shape: Flying a kite (long, thin object) vs. High Jump (no object, just a body).
Time/Motion: Marching (steady, rhythmic walking) vs. Archery (slow build-up, then a quick release).

The Results: A "Compact" Brain

The results were fascinating.

It Worked: The stamping machine successfully grouped the videos. The "Archery" videos got mostly one symbol (let's call it Symbol #5), while "Bowling" got mostly Symbol #5 but with a little bit of Symbol #4 mixed in. The statistical tests proved these differences were real and not random noise.
The "Collision" Surprise: The most interesting finding was that almost every action mapped to the same main symbol (#5).
- Wait, didn't it fail? No! The authors explain this with a great metaphor: Imagine a hotel.
- In a bad hotel, every guest gets their own separate room (categorical separation).
- In this robot's brain, all the guests (actions) stay in the same giant lobby (the "compact" latent space). They share the same core features (gravity, human movement, space).
- However, they don't all stand in the exact same spot. "Marching" guests are clustered near the door, "Archery" guests are near the bar, and "Bowling" guests are near the elevator.
- The stamping machine (AIM) couldn't give them different rooms, but it could detect that they were standing in slightly different corners of the lobby.

This "compactness" is actually a sign of success. It means the robot has learned the universal laws of physics that apply to all these actions, rather than just memorizing that "archery" is one thing and "bowling" is another.

Why This Matters

This paper proves that we can peek inside these "black box" AI brains without breaking them or confusing the results.

For Science: It gives us a way to audit AI. We can ask, "Does this AI actually understand physics, or is it just faking it?"
For Safety: If we can turn the AI's secret thoughts into a list of simple symbols (like a code), we can monitor it for dangerous patterns. If the code suddenly shifts in a weird way, we know something is wrong before the AI makes a mistake.
For the Future: This is just Stage 1 of a four-stage plan.
- Stage 1 (Done): Proved the translator works on a frozen brain.
- Stage 2: Make the translator more detailed (more symbols).
- Stage 3: Let the robot and translator learn together.
- Stage 4: Build a robot that can plan actions using this new symbolic language.

The Bottom Line

The authors built a simple, passive tool that acts like a mirror for a complex AI. By freezing the AI and just observing how it groups its own thoughts, they proved that the AI has indeed learned a structured, physical understanding of the world. It's not just a pattern-matching machine; it has a "world model" inside, and for the first time, we have a way to read its notes.

1. Problem Statement

The paper addresses the representational opacity problem inherent in Joint Embedding Predictive Architectures (JEPA), specifically V-JEPA 2.

The Challenge: Unlike generative models that reconstruct pixels (providing a visual verification pathway), JEPA models operate entirely in latent space, predicting masked regions of latent representations. While this produces powerful encoders that internalize physical regularities (kinematics, geometry, temporal continuity), the resulting representations are continuous, high-dimensional, and lack an inspectable form.
Limitations of Existing Methods:
- Discriminative Probes: Linear classifiers or decoders can determine if a variable is decodable but operate in continuous space, failing to provide a structured, discrete, auditable symbolic record.
- Generative Probes: Attaching learned components (e.g., language heads or pixel decoders) introduces an attribution problem. It becomes impossible to distinguish whether observed behavior stems from the frozen encoder's representations or the attached component's own learned parameters.

2. Methodology: Passive Discrete Probing

The authors propose a novel framework called Passive Discrete Probing using the AI Mother Tongue (AIM) framework as a lightweight, vocabulary-free quantization probe.

Core Architecture (Three-Layer Framework)

Layer 1: Latent Model (Frozen): A pre-trained V-JEPA 2 ViT-L encoder. Crucially, the encoder is completely frozen ( $\nabla \phi L = 0$ ) throughout the process.
Layer 2: Discrete Semantic Layer (AIM): A lightweight Vector Quantization (VQ) module that maps continuous latent vectors to discrete symbols.
- Mechanism: It uses a single-layer VQ-VAE bottleneck with an Exponential Moving Average (EMA) updated codebook.
- Vocabulary-Free: No predefined symbols, labels, or task-specific supervision are used. The codebook is initialized randomly and learns solely from the statistics of the frozen encoder's output.
- Attribution Guarantee: Because the encoder cannot adapt to the probe, any symbolic structure emerging in the codebook is attributable entirely to the pre-trained representations of V-JEPA 2.
Layer 3: Language Interface (Deferred): Not instantiated in this work; the study stops at the discrete symbol level.

Experimental Design

Dataset: Kinetics-mini (5 action categories: archery, bowling, flying kite, high jump, marching).
Strategy: Category-Contrast Experiments. Since direct physical manipulation is difficult with existing datasets, the authors use action categories as proxies for physical dimensions, selecting pairs that differ primarily along one dimension while minimizing others:
1. Grasp Angle: Archery vs. Bowling.
2. Object Geometry: Flying Kite vs. High Jump.
3. Motion Temporal Structure: Marching vs. Archery.
Preprocessing: To handle the high L2 norm of V-JEPA 2 outputs ( $\approx 97.7$ ), a linear projection ($1024 \to 256$) followed by LayerNorm and L2 normalization is applied before quantization.
Metrics: Chi-squared ( $\chi^2$ ) independence tests, Mutual Information (MI), Jensen–Shannon Divergence (JSD), and codebook utilization (perplexity).

3. Key Contributions

Passive Probing Methodology: Formalizes the distinction between passive (frozen encoder, no semantic bias) and active probing. This resolves the attribution problem, ensuring that discovered symbolic structure originates from the world model, not the probe.
Architectural Compatibility: Demonstrates that AIM can be attached to a frozen V-JEPA 2 encoder without modifying source files, using a lightweight single-layer VQ quantizer trained on pre-computed tokens.
Discovery of Compact Latent Structure: Reveals that V-JEPA 2's latent space is highly compact. Diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than distinct categorical boundaries.
Statistical Validation: Provides rigorous statistical evidence that discrete symbols correlate with physical dimensions without task-specific supervision.

4. Key Results

The study successfully passed all Stage 1 diagnostic criteria:

Symbol Stability (H1): The pipeline is fully deterministic ( $\bar{\rho} = 1.000$ ), confirming that symbol variations are due to input conditions, not pipeline noise.
Statistical Significance (H2):
- Chi-Squared Tests: All three physical dimension contrasts yielded $p < 10^{-4}$ (Motion speed: $p < 10^{-10}$ ), rejecting the null hypothesis of independence.
- Mutual Information (MI): Absolute MI ranged from 0.036 to 0.117 bits. Normalized MI (relative to the 3-bit theoretical max for $K=8$ ) was 1.2% to 3.9%.
- Jensen–Shannon Divergence (JSD): Motion speed (Marching vs. Archery) showed the strongest signal ( $JSD = 0.343$ ), followed by Grasp Angle and Object Geometry ( $JSD = 0.190$ ). This aligns with V-JEPA 2's training objective of temporal prediction.
- Codebook Health: The codebook utilization was healthy at 62.5% (5 out of 8 entries active), with a perplexity of 4.635, indicating no collapse.
Dominant Symbol Collision: A key finding is that all categories mapped predominantly to the same codebook entry (Symbol #5). However, the secondary distributional shifts (e.g., spillover to symbols #3 and #4) carried the statistically significant information. This confirms the "compactness" of the latent space.

5. Significance and Implications

World Model Interpretability: The results suggest that JEPA-style models internalize shared physical structures (gravity, kinematics) effectively, resulting in a compact latent space where categories overlap significantly. This is a feature of world-model training, not a failure of capacity.
Auditability: The paper establishes a method to audit the internal states of "black box" world models using discrete symbols that are statistically testable without human annotation.
Roadmap for Future Work: This work constitutes Stage 1 of a four-stage roadmap:
- Stage 1 (Completed): Verify compatibility and extract structure from frozen encoders.
- Stage 2: Scale codebook size and introduce residual quantization to resolve finer sub-structures.
- Stage 3: Joint training (unfreezing the encoder) to align representations with the symbolic vocabulary.
- Stage 4: Action-conditioned symbolic world models and causal validation using controlled synthetic/robotic data.

In conclusion, the paper demonstrates that structured, physically grounded manifolds exist within the frozen latent space of V-JEPA 2 and can be accessed via a passive, vocabulary-free discrete probe, providing a new pathway for interpreting and auditing advanced video world models.