Tokenizing Semantic Segmentation with RLE

Imagine you are trying to teach a robot to look at a picture and tell you exactly what is where. Usually, robots do this by painting a giant grid over the image, coloring every single pixel one by one. It's like trying to describe a painting by listing the color of every single grain of sand on a beach. It works, but it's slow, messy, and takes up a lot of memory.

This paper proposes a smarter, more efficient way: talking to the robot in "chunks" instead of pixels.

Here is the breakdown of their new method, Tokenizing Semantic Segmentation, using some everyday analogies.

1. The Old Way vs. The New Way

The Old Way (Pixel-by-Pixel): Imagine you are describing a long line of red and blue beads to a friend. You say, "Red, red, red, red, red, blue, blue, red..." If you have 1,000 beads, that's 1,000 words.
The New Way (Run-Length Encoding): Instead, you use a shorthand. You say, "Five reds, two blues, one red..." This is called Run-Length Encoding (RLE). You are compressing the information. Instead of listing every bead, you list the groups of beads.

The authors realized that computer vision models (like the famous "Pix2Seq") are actually great at predicting the next word in a sentence (like a text predictor on your phone). So, they decided to treat the "groups of beads" (the mask) as a sentence. The robot doesn't paint pixels; it writes a story about where the objects are.

2. The "Zipper" Problem (Compression)

Even with the "Five reds, two blues" method, if you have a high-resolution image, the list of groups can still get too long. It's like trying to zip up a jacket that is too full; it just won't close.

To fix this, the authors invented some clever "compression tricks":

Lengths-As-Class (LAC): Imagine you are ordering pizza. Instead of saying "Pepperoni, size 12" and "Pepperoni, size 14" as two separate instructions, you create a special menu item called "Pepperoni-12" and another called "Pepperoni-14." You combine the size and the type into one single token. This shortens the sentence significantly.
Time-As-Class (TAC): When dealing with videos (moving pictures), the problem gets harder. You have to describe the same object in Frame 1, Frame 2, and Frame 3.
- The Bad Way: Describe Frame 1, then Frame 2, then Frame 3. The list gets huge.
- The Smart Way: Create a "super-token" that means "The car is in the left spot in Frame 1, and the middle spot in Frame 2." It's like giving the object a "time-traveling ID card" that tells the robot where it is across the whole timeline in one go.

3. Handling Videos (The Movie Theater Analogy)

The paper also tackles video segmentation. Imagine you are watching a movie and trying to track a specific actor.

In a traditional video model, the computer looks at Frame 1, then Frame 2, then Frame 3, trying to figure out if the actor moved.
In this new model, the computer treats the whole sequence of frames as a single, long sentence. It uses the Time-As-Class trick to say, "This token represents the actor's position across the last 5 seconds." This allows the model to understand motion and continuity without getting overwhelmed by the sheer amount of data.

4. The "Panoptic" Twist (Who is Who?)

The paper also shows how to do Panoptic Segmentation. This is a fancy term for answering two questions at once:

What is it? (Semantic: "That is a dog.")
Which one is it? (Instance: "That is my dog, Fido, not the neighbor's dog.")

They do this by adding a "name tag" to their token sentences. Instead of just saying "Dog, here, here, here," the model says "Dog-Fido, here, here" and "Dog-Neighbor, there, there." It separates the "what" from the "who" within the same compressed list.

5. The Reality Check (The Hardware Bottleneck)

The authors are very honest about the limitations. While this method is brilliant and saves a lot of space, it's like trying to drive a Ferrari on a dirt road.

The Good News: The model is competitive with the best existing models. It works well on specific datasets (like tracking ice on a river or cells under a microscope).
The Bad News: They ran out of computer power (GPU memory). Because their method is so new and complex, it requires a lot of memory to train. They couldn't test it on massive, real-world datasets (like Cityscapes or COCO) because their computers literally couldn't handle the load.

The Big Picture

Think of this paper as a new language for robots to describe images.

Instead of speaking in a million tiny, repetitive pixels, the robot learns to speak in efficient, compressed "chunks" (tokens). It's like switching from writing a novel by hand, letter by letter, to using a sophisticated shorthand that captures the whole story in a few paragraphs.

Why does this matter?
If we can make this language efficient enough to run on standard computers, we could have AI that understands video and complex scenes much faster and with less energy. It opens the door to AI that can watch a video and describe the entire scene in a single, fluid sentence, rather than a million disjointed facts.

In short: They taught a robot to stop counting every pixel and start telling a story about the image, using a special shorthand that saves time, space, and energy.

1. Problem Statement

Traditional computer vision models for tasks like object detection and segmentation typically produce continuous-valued, fixed-size outputs (e.g., heatmaps or bounding box coordinates). This approach is suboptimal for tasks where the output is inherently sparse and discrete, such as semantic segmentation, where the number of objects and their shapes vary significantly between images. While recent work has applied language modeling (autoregressive tokenization) to object detection, extending this to semantic segmentation (dense prediction) and video segmentation presents unique challenges:

Sequence Length: Segmenting high-resolution images or video sequences generates extremely long token sequences, exceeding the context limits of standard Transformer architectures (e.g., Pix2Seq).
Vocabulary Explosion: Encoding multi-class masks or temporal information (video) often leads to a combinatorial explosion in vocabulary size ( $V$ ), making training computationally infeasible.
Instance vs. Semantic: Standard tokenization struggles to naturally incorporate instance-level information required for panoptic segmentation without losing the robustness of the encoding.

2. Methodology

The authors propose a unified framework that treats semantic and video segmentation as a language modeling problem, outputting masks as sequences of discrete tokens using Run-Length Encoding (RLE).

Core Mechanism: RLE Tokenization

Instead of predicting pixel values directly, the model predicts a sequence of $(start, length, class)$ tuples.

Flattening: The 2D (or 3D for video) mask is flattened into a 1D vector (row-major or column-major).
Runs: A "run" is a continuous sequence of non-zero pixels represented by a start index and a length.
Compression: This lossless compression significantly reduces the sequence length compared to pixel-wise prediction.

Key Technical Innovations

To make this approach feasible for high-resolution images and videos, the authors introduce several novel tokenization strategies:

Sliding Window Patches: To handle high-resolution inputs (e.g., $2560 \times 2560 $) without exceeding sequence limits, images are processed in overlapping patches ($ 640 \times 640$). A pixel-level voting strategy is used during inference to combine predictions from overlapping patches.
Subsampling: Masks are subsampled (e.g., to $80 \times 80 $or$ 160 \times 160$) before tokenization. Experiments show this causes minimal degradation in segmentation quality (<10% metric drop) while drastically reducing token counts.
Lengths-As-Class (LAC): For static images, the authors combine the "length" and "class" tokens into a single composite token. Instead of outputting $(start, length, class)$ , the model outputs $(start, length\_as\_class)$ . This reduces the sequence length by 33% (from 3 tokens to 2 per run) without significantly increasing the vocabulary size.
Time-As-Class (TAC): For video segmentation, the temporal dimension is integrated into the class ID. A run in frame $t$ with class $c$ is treated as a unique token representing the combination of class and time. This collapses the 3D video mask into a 2D mask with a larger effective class set, making the start index independent of the number of frames ( $N$ ).
Length-and-Time-As-Class (LTAC): A further compression scheme combining LAC and TAC, representing runs with only 2 tokens even for multi-class video masks.
Instance-Wise (IW) Tokenization: To achieve panoptic segmentation, the authors propose generating RLE sequences for individual object instances rather than just classes, separated by class tokens.

Architecture

The models are built upon Pix2Seq, a Transformer-based architecture.

Backbone: ResNet-50 (frozen or trained).
Decoder: Autoregressive Transformer decoder.
Multi-headed Decoder (Proposed): To further reduce memory, the authors experimented with splitting the decoder into parallel heads, each predicting a specific component of the RLE tuple (e.g., one head for start X, one for start Y, one for length, one for class). This reduces the effective sequence length per head.

3. Key Contributions

Unified Tokenization Framework: A novel approach to semantic segmentation in both images and videos using RLE-based autoregressive language modeling.
Compression Strategies: Introduction of LAC, TAC, and LTAC schemes that effectively compress the token sequence length and manage vocabulary size, enabling video segmentation with up to 8 frames.
Panoptic Capability: A method to incorporate instance information into the token sequence, enabling full panoptic segmentation.
Empirical Validation: Comprehensive evaluation on the ARIS (river ice) and IPSC (stem cell) datasets, demonstrating competitiveness with state-of-the-art (SOTA) CNN and Transformer models despite hardware constraints.
Open Source: Public release of code and models to facilitate further research in token-based vision tasks.

4. Results

The models were evaluated on two specialized datasets: ARIS (River Ice Segmentation) and IPSC (Human iPSC Reprogramming).

Performance vs. SOTA:
- On ARIS, the proposed model (P2S-SEG) achieved top-tier performance, often outperforming conventional models (DeepLab, UNet, SegNet) in recall and class-agnostic metrics (ice+water).
- On IPSC, the model performed comparably to the Swin Transformer (SWS), a strong SOTA baseline. While SWS had a slight edge in precision, P2S-SEG excelled in recall.
Video Segmentation:
- The video model (P2S-VIDSEG) showed consistent performance but did not consistently outperform the static model trained on single frames. The authors attribute this to the difficulty of learning temporal dependencies with the current tokenization and the lack of temporal redundancy benefits in segmentation compared to detection.
Ablation Studies:
- Frozen Backbone: Surprisingly, keeping the backbone frozen yielded results close to full training, suggesting the tokenization task is learnable with limited parameter updates.
- Static Input for Video: Models trained to predict video outputs using only the first frame as input performed surprisingly well, indicating the models struggle to fully utilize temporal information with the current architecture.
- Resolution: Subsampling masks to $80 \times 80 $or$ 160 \times 160 $was found to be a necessary trade-off to keep sequence lengths manageable ($ L < 4096$) on limited GPU memory (RTX 3090).

5. Significance and Future Work

Paradigm Shift: This work reinforces the trend of unifying diverse computer vision tasks (detection, segmentation, tracking) under a single language modeling paradigm. It proves that dense tasks like segmentation can be effectively tokenized.
Robustness: RLE tokenization offers robustness to noise; a missing token affects only a small local region of the mask, unlike polygon-based methods where a single error can distort the entire shape.
Limitations: The current approach is bottlenecked by computational resources (GPU memory). Large-scale datasets like COCO and Cityscapes showed performance drops due to the inability to train with high enough batch sizes and the loss of fine-grained details at low mask resolutions.
Future Directions: The authors suggest that with more powerful hardware (e.g., ViT backbones, larger GPUs) and improved encoding schemes (e.g., better class-weight equalization, differential masks), this approach could become competitive on large-scale, general-purpose datasets. The multi-headed decoder architecture offers a promising path to reducing memory consumption.

In conclusion, the paper successfully demonstrates that semantic segmentation can be reframed as a sequence generation problem using RLE, offering a flexible and unified alternative to traditional dense prediction methods, particularly for specialized domains and video applications.