InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
The paper introduces InternVL-U, a lightweight 4B-parameter unified multimodal model that democratizes advanced understanding, reasoning, generation, and editing capabilities by employing a modular architecture and a reasoning-centric data synthesis pipeline, achieving superior performance-efficiency balance that outperforms significantly larger baselines like BAGEL.