KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

Imagine you are trying to test how smart a group of robots is. You've been giving them tests in English, and they are getting pretty good scores. They can read signs, look at pictures, and solve math problems. But now, you want to see if they can handle a test written in Korean, specifically one that deals with Korean laws, local customs, and technical exams that only exist in South Korea.

The paper you shared introduces KMMMU, which is exactly that: a new, tough exam designed specifically to test AI models on their ability to understand the Korean world, not just the English one.

Here is the story of the paper, broken down into simple concepts:

1. The Problem: The "Translated" Trap

Think of previous AI tests like translated menus. If you take a menu from a French restaurant and translate it into English, you might get the words right, but you lose the flavor. You might miss that "escargot" is a specific cultural dish, or that the portion sizes are different.

Most AI tests today are like that. They are either written in English or translated from English. This means the AI is just recognizing patterns it already knows. It hasn't actually learned how to navigate the specific rules, laws, and visual styles of a Korean office, a Korean engineering blueprint, or a Korean legal document.

KMMMU is the "authentic local menu." It wasn't translated; it was written natively in Korean using real exams from Korean civil service tests, engineering certifications, and university Olympiads.

2. The Exam: A "Giant Jigsaw Puzzle"

The researchers gathered 3,466 questions. Imagine a giant jigsaw puzzle where every piece is a different type of challenge:

The Subjects: It covers 9 different fields, from Engineering and Law to Art and Design.
The Visuals: It's not just text. The AI has to look at circuit diagrams, maps, thermal camera photos, and complex tables.
The "Korean-Only" Pieces: There is a special section of 300 questions that are impossible to answer without knowing specific Korean laws (like how to define a "small vehicle" under Korean traffic rules) or cultural context.

3. The Results: The Robots Hit a Wall

The researchers tested the smartest AI models available (both free open-source ones and expensive "pro" ones) on this exam.

The Score: Even the best AI only got about 52% on the hardest questions. That's barely a passing grade for a high schooler, let alone a "super-intelligent" robot.
The Gap: When the questions required specific Korean knowledge (like local laws), the AI's performance dropped significantly. It's like a tourist who knows how to order coffee in Seoul but gets lost when asked to fill out a tax form.
The Bottlenecks: The AI struggled most in Law & Ethics and Arts & Design. Why? Because these fields rely on memorizing very specific, rigid rules and labels that don't exist in the general "world knowledge" the AI learned from the internet.

4. Why Did They Fail? (The "Why" Behind the Score)

The researchers looked at how the AI failed, and it wasn't because the robots were "dumb." They failed for three specific reasons:

The "Dictionary" Problem: The AI could read the Korean words, but it didn't know the official definition.
- Analogy: Imagine a robot sees a picture of a car. It knows it's a "car." But the Korean law says, "If it has an engine between 1000cc and 1600cc, it is a 'Small Vehicle' and has a different tax rate." The AI sees "car" but misses the specific legal label "Small Vehicle." It's like knowing what a "dog" is, but not knowing the specific breed required for a dog show.
The "Pattern" Problem: Some questions asked the AI to figure out a secret rule from a few examples (like a logic puzzle).
- Analogy: If you show a robot three pictures of a "happy" face and one "sad" face, and ask it to guess the rule, it might guess "smiles mean happy." But if the rule is actually "blue eyes mean happy," the robot gets confused because it's guessing based on what it thinks is common, not the specific rule in front of it.
The "Translation" Noise: Sometimes, the AI tried to translate the Korean question into English in its head to solve it, and in doing so, it lost the nuance.
- Analogy: It's like trying to solve a riddle written in a dialect you don't speak by translating it word-for-word into your native language. The joke falls flat because the cultural context is lost.

5. The "Hard" Mode

The researchers also created a "Hard Subset" of 627 questions that even the smartest models got wrong. They wanted to see if the AI could learn from its mistakes.

The Result: Even with "thinking" models (AI that talks to itself before answering), they didn't get much better. This proves that the problem isn't that the AI isn't "thinking hard enough." The problem is that it doesn't have the right information (the local rules) or can't map the visual clues to the right labels.

The Big Takeaway

This paper is a wake-up call. It tells us that being good at English and general science doesn't make an AI an expert in a specific country.

If we want AI to help doctors, lawyers, and engineers in Korea, we can't just translate English tests. We need to build systems that understand the local culture, the specific laws, and the unique visual language of that country. KMMMU is the first step in building that bridge, ensuring that future AI isn't just a "global tourist," but a "local expert."

1. Problem Statement

Current Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in English-centric benchmarks (e.g., MMMU). However, existing evaluations fail to accurately assess performance in non-English contexts, particularly where tasks are shaped by:

Local Institutional Conventions: Specific legal, administrative, and educational standards unique to a country.
Discipline-Specific Visual Formats: Technical diagrams, regulatory tables, and cultural symbols that differ from global norms.
Information-Dense Materials: Complex documents requiring deep integration of visual and textual reasoning.

Translated benchmarks often miss these nuances, leading to a "localization gap" where models perform well on general knowledge but fail on expert-level, culturally grounded tasks. There is a lack of a large-scale, native Korean benchmark that covers diverse disciplines and challenges models with information-dense, multimodal problems.

2. Methodology

A. Dataset Construction (KMMMU)

The authors constructed KMMMU, a native Korean benchmark containing 3,466 questions derived from official Korean examinations and competitions.

Sources: Civil service recruitment (PSAT), National Technical Qualifications (NTQ), National Competency Standards (NCS), and Academic Olympiads.
Scale & Diversity: The dataset spans 9 disciplines (Engineering, Natural Sciences, CS & IT, Business & Public, Social Sciences, Math & Stats, General, Law & Ethics, Arts & Design) and 9 visual modality categories (Diagrams, Tables, Text/Code, Photos, etc.).
Annotation Pipeline:
1. Collection: ~68,000 raw instances collected via web crawling and OCR (MinerU-2.5).
2. Verification: Five human annotators verified LaTeX formulas, image cropping, and legibility.
3. Adversarial Filtering: A multi-stage pipeline removed questions solvable by four strong baseline models (Phi-3.5-Vision, InternVL-3.5, Gemini-2.5-Flash-Lite, Gemini-2.5-Flash) in a zero-shot setting. This ensures the final benchmark contains only challenging, unsolved instances.
4. Taxonomy: Items are annotated by discipline, visual modality, question format, and a "Korean-specific" flag (identifying questions requiring local institutional knowledge).

B. Special Subsets

Hard Subset: 627 questions (18% of the total) that were missed by three state-of-the-art models (Gemma-3-27B, Qwen3-VL-235B, GPT-5-nano).
Korean-Specific Subset: 300 questions specifically targeting domestic legal, administrative, and cultural knowledge.

C. Experimental Setup

Models Evaluated: A diverse set of open-source (Gemma-3, Qwen3-VL, Llama-4, VARCO-VISION, HyperCLOVAX) and proprietary models (GPT-5, Claude-Opus/Sonnet, Gemini-3, Grok-4).
Evaluation Protocol: Zero-shot setting with a standardized prompt. Responses were evaluated using an LLM-Judge framework, validated against human annotations (showing high agreement, $\kappa \approx 0.88-0.98$ ).

3. Key Contributions

First Native Korean Expert Benchmark: KMMMU is the first large-scale, native Korean multimodal benchmark designed for expert-level reasoning, moving beyond translated datasets to capture linguistic and cultural specificity.
Adversarial Filtering for Difficulty: The use of a multi-model adversarial filtering pipeline ensures the benchmark remains unsaturated and challenging for current frontier models.
Granular Analysis of Failure Modes: The paper provides a detailed taxonomy of errors, distinguishing between reasoning depth failures and failures in knowledge recall, convention mapping, and symbolic induction.
Korean-Specific Performance Gap: The introduction of a "Korean-specific" flag allows for the isolation of performance gaps caused by localized knowledge versus general multimodal capability.

4. Key Results

A. Overall Performance

Low Accuracy: Even the strongest models struggle. The best open-source model (Qwen3-VL-235B-A22B-Thinking) achieved only 42.05% on the full set. The best proprietary model (Gemini-3-Pro) reached 52.42% on the Hard subset.
Scale vs. Reasoning: While model scale generally improves performance, explicit reasoning (Chain-of-Thought) yields inconsistent gains. In many cases, reasoning helps with answer formatting but fails to recover missing domain knowledge.

B. Disciplinary Variation

Bottlenecks: General and Arts & Design remain persistent bottlenecks across all models, with accuracy often below 30% even for top-tier models.
Strengths: Models perform relatively better in Engineering and Natural Sciences, though still far from perfect.

C. Korean-Specific Challenges

Performance Gap: Strong multilingual models show a significant drop in accuracy on Korean-specific questions compared to non-Korean-specific ones.
- Example: Qwen3-VL-235B-A22B-IT dropped from 40.62% (non-Korean) to 27.11% (Korean-specific), a gap of -13.51%.
Localization vs. Generalization: Korean-focused models (e.g., HyperCLOVAX) show slightly better performance on Korean-specific items but suffer from lower overall capacity, indicating that language specialization alone is insufficient for expert reasoning.

D. Error Analysis Findings

The paper identifies that failures are rarely due to insufficient reasoning depth but rather stem from:

Convention-to-Label Mapping: Models fail to map visual cues to specific, localized regulatory terms (e.g., confusing "small vehicle" vs. "passenger vehicle" in Korean traffic laws).
Few-Shot Symbolic Induction: Difficulty in inferring latent rules from sparse examples (common in General/Linguistics tasks).
Localized Knowledge Recall: Inability to retrieve specific institutional standards (e.g., specific turning radii defined by Korean road regulations).
Answer Completion: Reasoning models often generate plausible intermediate steps but fail to finalize the answer correctly or format it according to strict constraints.

5. Significance

Beyond English-Centric Evaluation: KMMMU demonstrates that high performance on English benchmarks does not guarantee robustness in culturally and institutionally specific contexts.
Diagnostic Tool: The benchmark serves as a diagnostic testbed to identify specific weaknesses in MLLMs, such as the inability to handle "information-dense" local documents or specialized technical terminology.
Future Directions: The results suggest that improving MLLMs for real-world expert tasks requires:
- Pretraining on niche, domain-specific technical materials.
- Post-training with instruction data that enforces strict adherence to local standards and conventions.
- Enhanced mechanisms for few-shot symbolic induction and precise knowledge retrieval.

In conclusion, KMMMU establishes a rigorous standard for evaluating multimodal understanding in Korean contexts, revealing that current models are far from robust in expert-level, culturally grounded scenarios, primarily due to gaps in localized knowledge and convention mapping rather than reasoning capabilities.