A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction
The paper introduces Wallaroo, a simple autoregressive model that unifies multi-modal understanding, image generation, and editing through next-token prediction, featuring multi-resolution support and bilingual capabilities while achieving competitive performance via a decoupled visual encoding and four-stage training strategy.