A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction

Imagine you have a super-smart assistant who can do three things: look at a picture and describe it, draw a new picture from scratch, and edit an existing picture (like changing a dog into a cat).

Usually, AI researchers build separate "brains" for each of these jobs. One brain is great at understanding, another is great at drawing, and a third is great at editing. But what if you could build one single brain that does all three perfectly, using the same simple logic for everything?

That is exactly what the researchers behind Wallaroo have done.

Here is the story of Wallaroo, explained simply:

1. The Core Idea: The "Next Word" Game

Most AI models are like a person playing a game of "fill in the blank." If you say, "The sky is...", the AI guesses the next word, "blue." Then it guesses the next word after that.

Wallaroo uses this exact same game, but for everything.

If you show it a picture and ask a question, it guesses the next word of the answer.
If you ask it to draw a cat, it guesses the next "pixel" (represented as a code) to build the image, one piece at a time.
If you want to edit a photo, it guesses the next piece of the new photo.

The researchers call this "Vanilla Next-Token Prediction." Think of it like a master chef who uses the exact same knife skills to chop vegetables, slice meat, and garnish a cake. They don't need three different sets of tools; they just need one set of skills applied to different ingredients.

2. The Problem: The "Translator" Conflict

In the past, trying to make one AI do both "understanding" (reading a picture) and "generating" (drawing a picture) was like trying to speak two languages at once.

Understanding needs to see the "big picture" (Is that a dog or a wolf?).
Generating needs to see the tiny details (Where exactly is the fur texture?).

Usually, these two tasks fight each other. If you force them to share the same brain, the AI gets confused.

Wallaroo's Solution: They built a two-lane highway.

Lane A (Understanding): Uses a special camera lens (called NaViT) to look at the image and understand the story.
Lane B (Generation): Uses a different tool (a VQ Tokenizer) that breaks the image down into tiny puzzle pieces (codes) to rebuild it.

These lanes run side-by-side but don't crash into each other. They only meet in the middle (the main brain) to decide what to do next.

3. The Training: A Four-Step School Curriculum

You can't just turn on a super-computer and expect it to know everything. Wallaroo went to school for four specific stages:

Kindergarten (Preliminary Alignment): The AI learns how to talk about images and how to draw simple shapes. It's just getting comfortable with the new tools.
Elementary School (Joint Pretraining): It studies hard. It reads millions of picture descriptions and draws millions of pictures. It learns to switch between "looking" and "drawing" without getting dizzy.
High School (Multi-Resolution): This is the tricky part. The AI learns to handle images of all sizes. It learns that a tiny 100x100 pixel image is different from a huge 1000x1000 pixel image. It learns to say, "Okay, I need to draw a small square here, or a big rectangle there."
Graduate School (Unified Fine-Tuning): Finally, it learns the art of editing. It learns how to take a photo of a beach, erase the sand, and replace it with a pool, all while keeping the sky the same.

4. Why is this a Big Deal?

Simplicity: Previous models were like Swiss Army knives with 50 different tools that were hard to use. Wallaroo is like a simple, sharp knife that does everything well because it's designed for one specific job: predicting the next step.
Bilingual: It speaks both English and Chinese fluently.
Versatile: It can handle any image size, from a tiny thumbnail to a giant poster.

5. The Catch (The "Pixelated" Reality)

The paper admits one weakness. Because Wallaroo builds images by guessing "puzzle pieces" (codes) rather than painting with smooth brushstrokes (like diffusion models do), the images can sometimes look a tiny bit blocky or lose fine details.

The Analogy:

Diffusion Models are like a painter mixing colors on a palette to create a smooth, realistic sunset.
Wallaroo is like a mosaic artist placing tiny tiles one by one to create the sunset. It's incredibly fast and flexible, but if you look very closely, you might see the individual tiles.

The Bottom Line

Wallaroo proves that you don't need complex, messy systems to build a "General AI" that can see, draw, and edit. Sometimes, the simplest approach—just predicting the next piece of the puzzle—is the most powerful. It's a major step toward an AI that can truly understand our world and create new ones with a single, unified mind.

1. Problem Statement

The field of Artificial General Intelligence (AGI) aims to unify multi-modal understanding, image generation, and image editing within a single framework. Existing approaches generally fall into three categories, each with significant limitations:

Conditional Encoders: Models that use multi-modal understanding as an encoder for diffusion-based generation (e.g., OmniGen2). These suffer from unidirectional information flow (understanding $\to$ generation), limiting interaction efficiency.
Hybrid Transformers: Models that integrate autoregressive (AR) understanding with diffusion generation (e.g., Bagel, JanusFlow). These often suffer from low interaction efficiency due to the presence of diffusion noise in the representation space.
Pure Autoregressive Models: Models using next-token prediction for both tasks (e.g., Chameleon, Janus). While efficient, previous attempts often struggled with poor visual tokenizers or failed to effectively unify all three tasks (understanding, generation, and editing) simultaneously without performance degradation.

The core challenge is to create a simple, unified autoregressive baseline that leverages next-token prediction to handle understanding, generation, and editing simultaneously, while maintaining high performance across all tasks and supporting multi-resolution and bilingual (Chinese/English) capabilities.

2. Methodology: Wallaroo

Wallaroo is built upon the Qwen2.5 VL (7B) backbone and adheres to a "minimalist" design philosophy, making as few architectural changes as possible.

Architecture

Decoupled Visual Encoding:
- Understanding: Uses the built-in NaViT (from Qwen2.5 VL) to encode images for multi-modal understanding.
- Generation: Introduces a VQ Tokenizer (from LlamaGen) to convert images into discrete 1D IDs. A Generation MLP Adaptor projects these embeddings to align with the transformer dimension.
- Editing: Uniquely, Wallaroo employs both NaViT (for semantic/high-level features) and the VQ Tokenizer (for low-level details) simultaneously for input encoding. It uses a dedicated Editing MLP Adaptor and a new Edit Head to handle the specific loss conflicts associated with editing tasks.
Heads: The model utilizes three distinct output heads:
1. Text Head: For multi-modal understanding (inherited from Qwen2.5 VL).
2. Image Head: For image generation.
3. Edit Head: For image editing.

Four-Stage Training Strategy

To reshape the model's capabilities without catastrophic forgetting, the authors propose a four-stage training pipeline:

Preliminary Generation Alignment: Freezes the backbone; trains only the new Generation Adaptor and Head to align with the Qwen2.5 VL representation space.
Joint Pretraining: Unfreezes the whole model (except NaViT and VQ tokenizer) and jointly trains on multi-modal understanding data and text-to-image data to balance both capabilities.
Image Size Scaling & Multi-resolution Adaptation:
- Scales image resolution from $384\times384 $to$ 512\times512$.
- Introduces special tokens <hw_info> to specify target height/width.
- Appends <eol> tokens to image token sequences to signify line breaks, enabling multi-resolution generation.
Unified Fine-tuning: Jointly fine-tunes on high-quality datasets for all three tasks (Understanding, Generation, Editing) using a 1:1:1 data ratio.

Training Objective

The model is optimized using a standard Next-Token Prediction loss ( $L = -\sum \log P_\theta(x_i|x_{<i})$ ). All three tasks are assigned equal loss weights (1.0) to ensure balanced capability development.

3. Key Contributions

Pioneering Unified Baseline: Wallaroo is one of the first models to unify multi-modal understanding, image generation, and image editing within a pure autoregressive next-token prediction framework, avoiding the complexity of diffusion hybrids.
Decoupled yet Integrated Architecture: It successfully decouples visual encoding pathways (NaViT for understanding, VQ for generation) while integrating them for editing, effectively bridging the gap between high-level semantics and low-level details.
Multi-Resolution & Bilingual Support: Unlike many predecessors, Wallaroo supports dynamic image resolutions (via special tokens) and is natively bilingual (Chinese and English).
Empirical Validation: Extensive experiments demonstrate that a simple AR baseline can achieve competitive performance against complex diffusion-based or hybrid models.

4. Experimental Results

Wallaroo was evaluated on various benchmarks against State-of-the-Art (SOTA) models:

Multi-Modal Understanding:
- Wallaroo achieves competitive scores on benchmarks like POPE, MME, MMB, SEED, GQA, MMMU, and MM-Vet.
- It outperforms many unified models (e.g., Janus-Pro, Mogao, OmniGen2) and remains close to the base Qwen2.5 VL, proving that adding generation capabilities does not severely degrade understanding.
Image Generation (Text-to-Image):
- On GenEval, Wallaroo achieves an overall score of 0.75, comparable to Janus-Pro (0.76) and Show-o2 (0.76).
- On DPG, it scores 79.35, competitive with JanusFlow and EMU3.
- Note: It trails behind pure diffusion models (e.g., OmniGen2, BAGEL) due to the inherent information loss in Vector Quantization (VQ), but this is expected for a pure AR approach.
Image Editing:
- On the ImgEdit benchmark, Wallaroo achieves an overall score of 2.92, outperforming several pure editing models (AnyEdit, UltraEdit) and the diffusion-based OmniGen.
- It slightly trails behind diffusion-based unified models (BAGEL, OmniGen2) and Janus-4o, likely because Wallaroo prioritizes a balance of all three tasks, whereas Janus-4o focuses exclusively on generation/editing.

5. Significance and Future Directions

Simplicity and Efficiency: Wallaroo demonstrates that a "vanilla" next-token prediction paradigm, when combined with careful architectural decoupling and a staged training strategy, is a powerful baseline for multi-modal AGI. It avoids the structural complexity of diffusion hybrids.
The Role of Editing: The authors hypothesize that image editing serves as a critical "bridge" between understanding and generation. By requiring both high-level (semantic) and low-level (pixel) representations, editing tasks naturally reconcile the conflicts between the two modalities.
Limitations & Insights:
- Tokenization Bottleneck: The primary limitation is the loss of image detail due to VQ tokenization. Future work could involve better tokenizers or using diffusion as a post-processing step.
- Positional Encoding: The study found that using distinct positional encoding schemes (1D for editing, 2D for understanding) significantly improves performance, suggesting low-level representations play different roles in different tasks.
- Token Sequence: The order of tokens (low-level vs. high-level representations) drastically affects editing quality, indicating a need for further research on token sequencing in unified models.
- Dynamic Head Selection: Currently, users must manually select the task head. Future iterations should aim for dynamic, context-aware head selection to improve intelligence.

In conclusion, Wallaroo establishes a strong, simple baseline that validates the potential of autoregressive models to unify the full spectrum of multi-modal tasks, paving the way for more efficient and capable AGI systems.