How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation
This paper introduces UniLongGen, a training-free inference strategy that improves long-horizon interleaved image generation by dynamically curating context to discard accumulated visual noise, thereby overcoming the reliability collapse caused by dense visual token interference in unified multimodal models.