LiTo: Surface Light Field Tokenization

The paper proposes LiTo, a unified 3D latent representation that tokenizes surface light fields from RGB-depth images to jointly model geometry and view-dependent appearance, enabling high-fidelity 3D object generation with realistic lighting effects via a latent flow matching model.

Jen-Hao Rick Chang, Xiaoming Zhao, Dorian Chan, Oncel Tuzel

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you want to teach a computer to understand a 3D object, like a shiny red apple or a dusty old vase.

Most current AI models are like a child who only sees the object under a single, flat light. They can tell you the apple is round (geometry) and red (color), but they can't explain why the apple looks different when you walk around it. They miss the shiny highlight on the skin, the way the color deepens in the shadows, or how the reflection shifts as you move. They treat the object as a static, flat painting wrapped around a shape.

LiTo (Light Field Tokenization) is a new method that teaches the AI to see the object the way our eyes actually do: as a dynamic, living thing that changes based on where you stand and how the light hits it.

Here is how it works, broken down into simple concepts:

1. The "Surface Light Field" (The Infinite Library)

Imagine the surface of an object isn't just a wall; it's a massive library. Every single point on that wall has a book.

  • Old AI: Only reads the book that says "This point is red."
  • LiTo: Reads the entire library. It knows that if you look at that red point from the left, it's dark red. From the right, it's bright red. From above, it might have a white sparkle (specular highlight).

This collection of "what color is this point from every possible angle?" is called a Surface Light Field. It's a huge amount of data, like trying to memorize every possible photo of an object.

2. The "Tokenization" (The Zipper)

Because that library is too big to carry around, LiTo uses a clever trick called Tokenization.
Think of the Surface Light Field as a massive, uncompressed video file. LiTo is the "ZIP file" compressor.

  • It takes a random sampling of those millions of "books" (images from different angles).
  • It feeds them into a smart encoder (a compression algorithm).
  • It spits out a tiny, compact list of numbers (a "latent representation") that contains all the information needed to reconstruct the full library.

It's like taking a 4K movie and compressing it into a tiny text file that, when opened, can perfectly recreate the movie, including all the lighting effects.

3. The "Decoder" (The 3D Painter)

Once the AI has this tiny "ZIP file," it needs to show you the object. It uses two specialized painters:

  • The Geometry Painter: This one figures out the shape. Is it a sphere? A cube? It draws the 3D skeleton.
  • The Appearance Painter: This is the magic part. Instead of just painting the object red, it uses 3D Gaussians (think of them as glowing, fuzzy clouds of color) that are programmed with Spherical Harmonics.
    • Analogy: Imagine the object is covered in thousands of tiny, glowing fireflies. Each firefly knows exactly how to change its color and brightness depending on where you are standing. If you walk to the left, the fireflies on the left side glow brighter to simulate a reflection. If you walk to the right, they dim.

This allows the AI to recreate realistic effects like specular highlights (the shiny spot on a wet car) and Fresnel reflections (how a glass looks transparent from the front but opaque from the side) without needing to know the physics formulas beforehand. It just learns them from the data.

4. The "Generative Model" (The Imagination Engine)

Finally, the authors trained a "generative" model (like a creative artist) on these tiny ZIP files.

  • Input: You show the AI a single photo of a chair.
  • Process: The AI looks at the photo, figures out the "ZIP code" (the latent vector) that represents that chair's shape and its specific lighting/materials.
  • Output: It generates a full 3D model of that chair. You can rotate it, walk around it, and the shiny wood grain and the way the light hits the armrest will look real and consistent, just like in your original photo.

Why is this a big deal?

  • Realism: Previous methods made objects look like plastic toys or flat paintings. LiTo makes them look like real materials (metal, glass, fabric).
  • Efficiency: It doesn't need to store millions of images. It stores a tiny, efficient code that can recreate the infinite possibilities of light.
  • One-Shot Learning: It can take a single picture and turn it into a fully explorable 3D object that respects the lighting and materials of that specific photo.

In short: LiTo teaches AI to stop seeing 3D objects as static statues and start seeing them as dynamic scenes where light, angle, and material interact in real-time. It's the difference between a drawing of a ball and a real ball you can roll around in the light.