Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?

The Big Idea: The "Bilingual" Artist Who Forgets What They Said

Imagine you hire a brilliant artist who is also a genius logician. You call them a "Unified Multimodal Model." Their job is to do two things at once:

Understand a question (like a detective).
Create an answer (like a painter).

The big promise of these new AI models is that they are "unified." This means the same brain that solves the puzzle should also be able to paint the solution. Theoretically, if they solve a math problem in their head, they should be able to write that answer on a canvas perfectly.

The Problem: The researchers in this paper found that these artists are suffering from a severe case of amnesia.

They are great at talking about the answer, but when asked to draw the answer, they forget what they just figured out. They might draw a picture of a cat when the answer was "2 + 2 = 4," or they might draw gibberish that looks like text but isn't.

The New Test: VGUBench (The "Truth-O-Meter")

To prove this, the researchers built a special test called VGUBench. Think of it as a three-part exam designed to catch the model lying to itself.

Here is how the test works, using the analogy of a Chef:

1. The Text Test (TGU): "The Recipe Book"

The Task: The Chef is asked, "What happens when you mix red and blue paint?"
The Result: The Chef writes a perfect sentence: "You get purple."
The Verdict: The Chef is smart! They understand the logic perfectly.

2. The Rendering Test (Render): "The Copy Machine"

The Task: The Chef is not asked to think. They are just handed a piece of paper that says "Purple" and told, "Please write this word on a blackboard."
The Result: The Chef writes "Purple" on the board. It's a bit messy, but you can read it.
The Verdict: The Chef has decent handwriting skills. They can turn text into an image if they don't have to think too hard.

3. The Visual Understanding Test (VGU): "The Magic Canvas"

The Task: The Chef is asked the original question again: "What happens when you mix red and blue paint?" But this time, they must draw the answer on a canvas.
The Result: The Chef draws a picture of a purple elephant, or writes "Purplz," or draws a mess that looks like a cat.
The Verdict: Catastrophic Failure. Even though the Chef knew the answer in Step 1 and could write the word in Step 2, they completely forgot the answer when asked to paint it.

The Big Discovery: It's Not a Handwriting Problem

The researchers wanted to know why this happens. They suspected the models just had bad "handwriting" (bad image generation skills).

So, they ran a correlation test. They compared how well the models did at Step 2 (just copying text) vs. Step 3 (answering a question with text).

The Shocking Result: There was zero connection between the two.

Some models had great handwriting (Step 2) but still failed the logic test (Step 3).
Some models had okay handwriting but failed the logic test even harder.

The Analogy: Imagine a student who can perfectly copy a sentence from a textbook onto a piece of paper. But when you ask them to solve a math problem and write the answer, they write the wrong number.

Old Theory: "The student has bad handwriting."
New Finding: "No, the student's handwriting is fine. The problem is that their brain disconnects the moment they switch from thinking to drawing."

Why This Matters

This paper reveals a "blind spot" in how we test AI.

Current Tests: We test if the AI can read a picture (Understanding) and if it can draw a picture (Generation) separately.
The Reality: We assumed that if an AI is good at both, it's "Unified."
The Truth: Being "Unified" in architecture doesn't mean the AI is "Unified" in its brain. It's like having a car with a steering wheel and an engine connected by a rubber band. When you turn the wheel, the engine doesn't always know to move.

The Takeaway

The paper concludes that current AI models are not truly unified. They have a "semantic gap." They can reason in text, and they can draw images, but they cannot reason in images.

If you ask them to explain a complex idea using a picture, they will likely hallucinate nonsense. To build truly intelligent AI, we need to fix this "disconnect" so that the model's logic stays consistent, whether it's speaking, writing, or painting.

1. Problem Statement

Unified Multimodal Large Language Models (U-MLLMs) aim to integrate understanding (reasoning) and generation (creation) within a single architecture. A fundamental theoretical premise of this unification is Semantic Equivalence across Different Output Modalities (SEDOM): the model should produce consistent reasoning results regardless of whether the output is text or an image.

However, the authors observe a critical gap in current evaluations:

Existing benchmarks assess understanding and generation in isolation.
While U-MLLMs demonstrate robust textual reasoning, they fail to maintain semantic consistency when required to render the same answer as an image.
Current models often produce visual answers with severe semantic deviations, illegible artifacts, or missing content, even when the textual answer is correct.
It remains unclear whether this failure stems from poor image generation fidelity (rendering capability) or a deeper breakdown in cross-modal semantic alignment (reasoning-to-generation transfer).

2. Methodology: VGUBench

To rigorously diagnose this discrepancy, the authors introduce VGUBench, a diagnostic framework designed to decouple reasoning logic from generation fidelity.

A. Task Definition

VGUBench comprises three sample-aligned tasks evaluated on the same input questions:

Textual Generative Understanding (TGU): The model answers a question in text. This establishes a baseline for reasoning accuracy.
Visual Generative Understanding (VGU): The model answers the same question by generating an image containing the answer (e.g., text rendered on a black background). This tests the ability to externalize reasoning into a visual format.
Visual Rendering (Render): A control task where the model is given the answer text directly (removing the reasoning step) and asked to render it as an image. This isolates the model's basic text-to-image rendering capability.

B. Data Construction

Source: 2,164 QA pairs sampled from 9 diverse datasets (e.g., MMLU, MATH, HumanEval, BoolQ).
Automation: An automated script renders reference answers into standardized images (white text on black background, centered) to create ground-truth labels ( $I_{label}$ ) for VGU and Render tasks.
Alignment: All three tasks share the same input queries and reference answers, ensuring a direct comparison of performance across modalities.

C. Evaluation Protocol

Judge: An LLM-as-a-Judge (Qwen2.5-VL-72B) evaluates outputs without using OCR (to avoid hallucination errors), focusing strictly on visible content.
Metrics: Three dimensions are scored (0–5) for all tasks:
1. Legibility: Is the text readable?
2. Completeness: Is all necessary information present?
3. Correctness: Is the semantic content accurate?
Analysis: The authors use Pearson and Spearman correlation coefficients to analyze the relationship between Render scores (generation fidelity) and VGU scores (semantic alignment).

3. Key Contributions

Definition of SEDOM: The paper formally identifies "Semantic Equivalence across Different Output Modalities" as a critical, previously unexplored dimension for evaluating U-MLLMs.
VGUBench Benchmark: A novel diagnostic framework that disentangles reasoning capability from generation fidelity, allowing for a granular analysis of where unification fails.
Empirical Discovery: The study reveals that the failure of U-MLLMs in visual answering is not primarily due to poor rendering quality (generation fidelity) but rather a breakdown in cross-modal semantic alignment.

4. Experimental Results

The authors evaluated 7 U-MLLMs (e.g., Janus, Emu3, UniLIP, Bagel) and 3 pure generative models (e.g., Qwen-Image, LongCat).

Textual Performance (TGU): Models perform excellently, with most achieving near-perfect scores in legibility, completeness, and correctness (Avg: 60–90%). This confirms mature textual reasoning.
Visual Performance (VGU): There is a catastrophic collapse in performance. Even the best-performing model (UniLIP) averages below 24%. Models frequently fail to include the correct answer in the image, producing fragmented text or hallucinations.
Rendering Performance (Render): Performance is intermediate but significantly lower than TGU. While some models (e.g., UniLIP) can render text legibly (Legibility ~85%), they still fail in Completeness and Correctness (Avg ~12%). This indicates that even without reasoning demands, the text-to-image mapping is lossy.
Correlation Analysis:
- There is a negligible correlation between Render scores (basic generation quality) and VGU scores (reasoning-based generation) for most models.
- For many models, the correlation is statistically insignificant or even negative.
- Conclusion: The inability to generate a visual answer is not caused by a lack of rendering skill. A model can render text well (high Render score) but still fail to generate the correct answer in an image (low VGU score). The bottleneck is the semantic alignment between the reasoning process and the visual generation module.

5. Significance and Implications

Architectural Unification $\neq$ Semantic Unification: The paper challenges the assumption that integrating understanding and generation into a single architecture guarantees consistent behavior across modalities.
Evaluation Blind Spot: Current benchmarks (which test understanding and generation separately) fail to detect this critical failure mode. A model can be "state-of-the-art" on MMBench and GenEval individually but fail completely at SEDOM.
Future Directions: The findings suggest that future U-MLLM development must prioritize cross-modal semantic alignment mechanisms. Simply improving generation fidelity (e.g., better diffusion models) will not solve the problem; the reasoning logic must be explicitly constrained to ensure semantic consistency when transitioning to visual outputs.
Reliability: For high-stakes applications requiring visual reasoning (e.g., medical diagnostics, scientific visualization), current U-MLLMs are unreliable because they cannot guarantee that the visual output matches the internal reasoning.

In summary, the paper exposes a fundamental flaw in current Unified Multimodal Models: they possess strong reasoning capabilities but lack the ability to faithfully externalize those reasoning results into the visual modality, a gap that existing evaluation protocols have failed to detect.