Tokenizing Semantic Segmentation with RLE

This paper introduces a unified language modeling approach for semantic and panoptic segmentation in images and videos that discretizes masks into run-length encoded tokens, employing novel compression strategies to enable autoregressive generation despite computational constraints.

Abhineet Singh, Justin Rozeboom, Nilanjan Ray

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot to look at a picture and tell you exactly what is where. Usually, robots do this by painting a giant grid over the image, coloring every single pixel one by one. It's like trying to describe a painting by listing the color of every single grain of sand on a beach. It works, but it's slow, messy, and takes up a lot of memory.

This paper proposes a smarter, more efficient way: talking to the robot in "chunks" instead of pixels.

Here is the breakdown of their new method, Tokenizing Semantic Segmentation, using some everyday analogies.

1. The Old Way vs. The New Way

  • The Old Way (Pixel-by-Pixel): Imagine you are describing a long line of red and blue beads to a friend. You say, "Red, red, red, red, red, blue, blue, red..." If you have 1,000 beads, that's 1,000 words.
  • The New Way (Run-Length Encoding): Instead, you use a shorthand. You say, "Five reds, two blues, one red..." This is called Run-Length Encoding (RLE). You are compressing the information. Instead of listing every bead, you list the groups of beads.

The authors realized that computer vision models (like the famous "Pix2Seq") are actually great at predicting the next word in a sentence (like a text predictor on your phone). So, they decided to treat the "groups of beads" (the mask) as a sentence. The robot doesn't paint pixels; it writes a story about where the objects are.

2. The "Zipper" Problem (Compression)

Even with the "Five reds, two blues" method, if you have a high-resolution image, the list of groups can still get too long. It's like trying to zip up a jacket that is too full; it just won't close.

To fix this, the authors invented some clever "compression tricks":

  • Lengths-As-Class (LAC): Imagine you are ordering pizza. Instead of saying "Pepperoni, size 12" and "Pepperoni, size 14" as two separate instructions, you create a special menu item called "Pepperoni-12" and another called "Pepperoni-14." You combine the size and the type into one single token. This shortens the sentence significantly.
  • Time-As-Class (TAC): When dealing with videos (moving pictures), the problem gets harder. You have to describe the same object in Frame 1, Frame 2, and Frame 3.
    • The Bad Way: Describe Frame 1, then Frame 2, then Frame 3. The list gets huge.
    • The Smart Way: Create a "super-token" that means "The car is in the left spot in Frame 1, and the middle spot in Frame 2." It's like giving the object a "time-traveling ID card" that tells the robot where it is across the whole timeline in one go.

3. Handling Videos (The Movie Theater Analogy)

The paper also tackles video segmentation. Imagine you are watching a movie and trying to track a specific actor.

  • In a traditional video model, the computer looks at Frame 1, then Frame 2, then Frame 3, trying to figure out if the actor moved.
  • In this new model, the computer treats the whole sequence of frames as a single, long sentence. It uses the Time-As-Class trick to say, "This token represents the actor's position across the last 5 seconds." This allows the model to understand motion and continuity without getting overwhelmed by the sheer amount of data.

4. The "Panoptic" Twist (Who is Who?)

The paper also shows how to do Panoptic Segmentation. This is a fancy term for answering two questions at once:

  1. What is it? (Semantic: "That is a dog.")
  2. Which one is it? (Instance: "That is my dog, Fido, not the neighbor's dog.")

They do this by adding a "name tag" to their token sentences. Instead of just saying "Dog, here, here, here," the model says "Dog-Fido, here, here" and "Dog-Neighbor, there, there." It separates the "what" from the "who" within the same compressed list.

5. The Reality Check (The Hardware Bottleneck)

The authors are very honest about the limitations. While this method is brilliant and saves a lot of space, it's like trying to drive a Ferrari on a dirt road.

  • The Good News: The model is competitive with the best existing models. It works well on specific datasets (like tracking ice on a river or cells under a microscope).
  • The Bad News: They ran out of computer power (GPU memory). Because their method is so new and complex, it requires a lot of memory to train. They couldn't test it on massive, real-world datasets (like Cityscapes or COCO) because their computers literally couldn't handle the load.

The Big Picture

Think of this paper as a new language for robots to describe images.

Instead of speaking in a million tiny, repetitive pixels, the robot learns to speak in efficient, compressed "chunks" (tokens). It's like switching from writing a novel by hand, letter by letter, to using a sophisticated shorthand that captures the whole story in a few paragraphs.

Why does this matter?
If we can make this language efficient enough to run on standard computers, we could have AI that understands video and complex scenes much faster and with less energy. It opens the door to AI that can watch a video and describe the entire scene in a single, fluid sentence, rather than a million disjointed facts.

In short: They taught a robot to stop counting every pixel and start telling a story about the image, using a special shorthand that saves time, space, and energy.