Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?
This paper introduces VGUBench to demonstrate that while Unified Multimodal Large Language Models exhibit strong textual reasoning and visual rendering capabilities individually, they fail to maintain semantic equivalence when required to generate visual answers, revealing a critical breakdown in cross-modal semantic alignment rather than a lack of generation fidelity.