UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Imagine you are talking to a very smart, futuristic robot assistant. In the real world, you don't just talk to it with words. You might show it a photo of a broken engine, play a recording of the strange noise it's making, hand it a PDF manual, and ask it to write some code to fix the part, all in one conversation.

Currently, most AI assistants are like people who only speak one language or can only handle one type of object at a time. They might understand text well but get confused if you throw a video or a 3D model at them. They struggle to weave these different things together into a single, coherent story.

This paper introduces UNIM, a new "exam" and a new "student" designed to fix that.

1. The Problem: The "Jigsaw Puzzle" of Reality

Think of the real world as a giant, messy jigsaw puzzle where the pieces are different shapes and materials: some are words, some are pictures, some are sounds, some are 3D objects.

Old AI: Tries to solve the puzzle by only looking at the blue pieces (text) or only the square pieces (images). It can't see how the sound piece fits with the 3D piece.
The Goal: We need an AI that can look at a pile of mixed-up pieces (text, video, audio, code, 3D models) and instantly understand how they all fit together to solve a problem.

2. The Solution: The UNIM Benchmark (The "Ultimate Exam")

The authors created UNIM, which is the first massive, high-quality test for AI that forces it to handle any mix of inputs and outputs.

The Dataset: They gathered 31,000 complex questions. These aren't simple "What is this?" questions. They are like: "Here is a video of a car crash, an audio recording of the impact, a 3D scan of the damage, and a legal document. Based on all of these, write a repair plan and generate a new video showing the fix."
The Variety: It covers 7 types of "ingredients": Text, Images, Audio, Video, Documents, Code, and 3D models.
The Difficulty: The exam has three levels (Easy, Medium, Hard). The "Hard" level requires the AI to do deep reasoning, like a detective connecting clues from a video, a voice note, and a blueprint simultaneously.

3. The Grading System: The "Three-Legged Stool"

How do you grade an AI that gives you a mix of text, a video, and a song? You can't just check if the answer is "right" or "wrong." The authors invented a new grading system with three legs:

Did it make sense? (Semantic Correctness): If the AI says "The car is red" but the video shows a blue car, it fails.
Did it follow the rules? (Structure Integrity): If the question asked for two images and one audio file, and the AI gave you three images and no audio, it fails, even if the content was good.
Did it flow well? (Interleaved Coherence): This is the most important one. Imagine a story where the sentences are interrupted by random, unrelated pictures. That's bad. The AI needs to weave the text and media together so smoothly that it feels like a natural conversation.

4. The Star Student: UNIMA

To prove this exam is hard, the authors built their own AI model called UNIMA to take the test.

How it works: Instead of just guessing, UNIMA acts like a project manager.
- Step 1: It reads the messy inputs and takes notes (creating a "dense caption").
- Step 2: It plans the answer. It asks, "Do I need to do math? Do I need to write code? Do I need to generate a video?"
- Step 3: It double-checks its own work. It asks, "Did I include the right number of images? Did I mix the audio with the right part of the text?"
- Step 4: It builds the final answer, piece by piece.
The Result: While other famous AI models (like AnyGPT or NExT-GPT) got very low scores (often failing to even include the right number of images), UNIMA scored much higher. It proved that with the right "thinking process," an AI can handle this chaotic, mixed-media world.

The Big Picture

Think of the current state of AI as a chef who can only cook soup. They are great at soup (text-to-text or text-to-image). But the real world is a five-course banquet where you need to serve soup, steak, a salad, a dessert, and a drink, all at the same time, and they all need to taste good together.

UNIM is the new kitchen that forces chefs to learn how to cook the whole banquet. UNIMA is the first chef who actually learned the recipe. This paper shows us that while current AI is still struggling with the full banquet, we now have a map (the benchmark) and a prototype (the model) to get there.

1. Problem Statement

Current Multimodal Large Language Models (MLLMs) have evolved from simple visual-language understanding to unified frameworks capable of both understanding and generation. However, existing benchmarks and models suffer from significant limitations:

Narrow Modality Scope: Most benchmarks focus exclusively on text-image interleaving, failing to capture the complexity of real-world interactions involving audio, video, documents, code, and 3D data.
Lack of True "Any-to-Any" Capability: Real-world scenarios require systems to accept arbitrary combinations of input modalities and generate outputs in any interleaved sequence of modalities (e.g., input: video + audio + text $\rightarrow$ output: code + 3D model + text). Current benchmarks do not adequately test this flexibility.
Insufficient Evaluation Metrics: Traditional metrics (e.g., accuracy, BLEU) are ill-suited for open-ended, interleaved multimodal generation. They fail to assess structural integrity (correct modality types and counts), interleaved coherence (logical flow across modalities), and generation quality simultaneously.
Reasoning Gaps: Existing models often struggle with complex, multi-step reasoning required to synthesize diverse modalities into a coherent, structured response.

2. Methodology

A. The UNIM Benchmark

The authors introduce UNIM, the first Unified Any-to-Any Interleaved Multimodal Benchmark.

Dataset Scale & Diversity: Contains 31,026 high-quality instances spanning 30 real-world domains (Natural Science, Social Science, General Area) and 7 modalities: Text, Image, Audio, Video, Document, Code, and 3D.
Interleaved Structure: Unlike previous datasets, UNIM supports 41 distinct interleaved combinations where modalities can appear in arbitrary orders in both input and output.
Difficulty Taxonomy: Instances are categorized into three difficulty levels (Easy, Medium, Hard) based on four dimensions: Comprehension, Generation, Reasoning, and Task complexity.
Construction Pipeline: Data is curated from public datasets, social media, and open resources. It undergoes a rigorous "human-led with model-assisted expansion" process, followed by a two-phase quality control (systematic verification and multi-reviewer evaluation) to ensure logical consistency and semantic validity.

B. The UNIM Evaluation Suite

To address the inadequacy of traditional metrics, the authors propose a three-dimensional evaluation framework:

Semantic Correctness & Generation Quality (SQCS):
- Semantic Correctness (SC): Measures alignment with ground truth using an LLM-as-a-Judge after converting all modalities to text captions.
- Generation Quality (GQ): Assesses perceptual quality (e.g., NIQE for images, signal processing for audio, code review for code) independent of semantics.
- Coupled Score (SQCS): Combines SC and GQ, penalizing high-quality generation that is semantically incorrect.
Response Structure Integrity (StS & LeS):
- Strict Structure Score (StS): Evaluates exact match of modality types and counts (precision/recall of placeholders).
- Lenient Structure Score (LeS): Evaluates coverage of modality types (Jaccard similarity).
Interleaved Coherence (ICS):
- Holistic Coherence (HC): Assesses cross-modal logical flow and semantic consistency.
- Stylistic Harmony (SH): Evaluates consistency in tone, style, and visual aesthetics across modalities.
- Supporting Rate ( $\tau$ ): A conditional modifier to distinguish between a model's absolute capability on supported tasks vs. its overall performance considering modality support limitations.

C. The UNIMA Baseline Model

To establish a strong baseline, the authors propose UNIMA (Unified Any-to-Any Interleaved Multimodal Agentic model).

Agentic Framework: UNIMA operates via a pipeline of specialized modules rather than a single end-to-end transformer.
Receiving Module: Converts non-text inputs (video, 3D, etc.) into Task-Conditioned Dense Captions (TCDC) using specialized encoders (e.g., Qwen3-Omni for video/audio, PointLLM for 3D).
Traceable Evidence Reasoning (TER) Module: The core innovation. It constructs a Structured Evidence Reasoning Chain (SERC):
1. Generates TCDC and paraphrased questions.
2. Invokes a Code Interpreter for data analysis if needed.
3. Plans the output structure (modalities, text, tools).
4. Verification Submodule: Uses a Checker and Judger to detect errors, backtrack to the source of the error, and regenerate the reasoning chain. This ensures traceable and reliable outputs.
Generating Module: Executes the verified plan using specialized generative tools (Sora-2 for video, GPT-Image-1 for images, etc.) to produce the final interleaved output.

3. Key Contributions

First Unified Any-to-Any Benchmark: UNIM is the first dataset to systematically benchmark interleaved multimodal learning across 7 modalities and 30 domains, moving beyond the text-image paradigm.
Comprehensive Evaluation Suite: Introduces a principled, multi-dimensional metric suite (SQCS, StS/LeS, ICS) that captures semantic, structural, and coherence aspects of interleaved generation.
Agentic Baseline (UNIMA): Proposes a novel agentic architecture with Traceable Evidence Reasoning and iterative self-verification, demonstrating that structured reasoning significantly outperforms direct generation in complex multimodal tasks.
Empirical Insights: Reveals that current state-of-the-art MLLMs (AnyGPT, NExT-GPT, MIO) struggle significantly with UNIM, particularly in structural integrity and complex reasoning, highlighting a critical gap in current multimodal intelligence.

4. Experimental Results

Baseline Performance: Existing MLLMs perform poorly on UNIM.
- Semantic Correctness: Most baselines score below 20% on SQCS, indicating severe semantic deviations.
- Structure: Baselines fail to match required modality counts (StS often < 5%), showing an inability to follow complex interleaved instructions.
- Coherence: Interleaved Coherence scores are low (< 50%), indicating disjointed outputs.
UNIMA Performance: UNIMA significantly outperforms all baselines:
- Achieves ~60% SQCS and ~70% ICS, roughly 2–6x higher than the best baseline in structural metrics.
- Demonstrates robustness across difficulty levels, whereas baselines fail even on "Easy" tasks.
- Ablation studies confirm that the TER module and Verification Submodule are critical for structural adherence and semantic grounding.

5. Significance

Paradigm Shift: UNIM shifts the focus from simple "text-to-image" or "image-to-text" tasks to complex, real-world multimodal interactions where inputs and outputs are arbitrary sequences of diverse media.
Research Direction: The paper highlights that future MLLMs must move beyond simple unification toward agentic, reasoning-driven architectures capable of planning, verifying, and coordinating multiple modalities.
Foundation for Future AI: By providing a rigorous benchmark and a strong baseline, UNIM sets the stage for developing the next generation of AI assistants capable of seamless perception, reasoning, and generation across the full spectrum of human communication modalities.