KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

This paper introduces KMMMU, a comprehensive native Korean benchmark comprising 3,466 multimodal exam questions across nine disciplines that reveals significant performance gaps in current AI models due to challenges in understanding local conventions, standards, and domain-specific knowledge rather than reasoning depth.

Nahyun Lee, Guijin Son, Hyunwoo Ko, Chanyoung Kim, JunYoung An, Kyubeen Han, Il-Youp Kwak

Published 2026-04-16
📖 5 min read🧠 Deep dive

Imagine you are trying to test how smart a group of robots is. You've been giving them tests in English, and they are getting pretty good scores. They can read signs, look at pictures, and solve math problems. But now, you want to see if they can handle a test written in Korean, specifically one that deals with Korean laws, local customs, and technical exams that only exist in South Korea.

The paper you shared introduces KMMMU, which is exactly that: a new, tough exam designed specifically to test AI models on their ability to understand the Korean world, not just the English one.

Here is the story of the paper, broken down into simple concepts:

1. The Problem: The "Translated" Trap

Think of previous AI tests like translated menus. If you take a menu from a French restaurant and translate it into English, you might get the words right, but you lose the flavor. You might miss that "escargot" is a specific cultural dish, or that the portion sizes are different.

Most AI tests today are like that. They are either written in English or translated from English. This means the AI is just recognizing patterns it already knows. It hasn't actually learned how to navigate the specific rules, laws, and visual styles of a Korean office, a Korean engineering blueprint, or a Korean legal document.

KMMMU is the "authentic local menu." It wasn't translated; it was written natively in Korean using real exams from Korean civil service tests, engineering certifications, and university Olympiads.

2. The Exam: A "Giant Jigsaw Puzzle"

The researchers gathered 3,466 questions. Imagine a giant jigsaw puzzle where every piece is a different type of challenge:

  • The Subjects: It covers 9 different fields, from Engineering and Law to Art and Design.
  • The Visuals: It's not just text. The AI has to look at circuit diagrams, maps, thermal camera photos, and complex tables.
  • The "Korean-Only" Pieces: There is a special section of 300 questions that are impossible to answer without knowing specific Korean laws (like how to define a "small vehicle" under Korean traffic rules) or cultural context.

3. The Results: The Robots Hit a Wall

The researchers tested the smartest AI models available (both free open-source ones and expensive "pro" ones) on this exam.

  • The Score: Even the best AI only got about 52% on the hardest questions. That's barely a passing grade for a high schooler, let alone a "super-intelligent" robot.
  • The Gap: When the questions required specific Korean knowledge (like local laws), the AI's performance dropped significantly. It's like a tourist who knows how to order coffee in Seoul but gets lost when asked to fill out a tax form.
  • The Bottlenecks: The AI struggled most in Law & Ethics and Arts & Design. Why? Because these fields rely on memorizing very specific, rigid rules and labels that don't exist in the general "world knowledge" the AI learned from the internet.

4. Why Did They Fail? (The "Why" Behind the Score)

The researchers looked at how the AI failed, and it wasn't because the robots were "dumb." They failed for three specific reasons:

  • The "Dictionary" Problem: The AI could read the Korean words, but it didn't know the official definition.
    • Analogy: Imagine a robot sees a picture of a car. It knows it's a "car." But the Korean law says, "If it has an engine between 1000cc and 1600cc, it is a 'Small Vehicle' and has a different tax rate." The AI sees "car" but misses the specific legal label "Small Vehicle." It's like knowing what a "dog" is, but not knowing the specific breed required for a dog show.
  • The "Pattern" Problem: Some questions asked the AI to figure out a secret rule from a few examples (like a logic puzzle).
    • Analogy: If you show a robot three pictures of a "happy" face and one "sad" face, and ask it to guess the rule, it might guess "smiles mean happy." But if the rule is actually "blue eyes mean happy," the robot gets confused because it's guessing based on what it thinks is common, not the specific rule in front of it.
  • The "Translation" Noise: Sometimes, the AI tried to translate the Korean question into English in its head to solve it, and in doing so, it lost the nuance.
    • Analogy: It's like trying to solve a riddle written in a dialect you don't speak by translating it word-for-word into your native language. The joke falls flat because the cultural context is lost.

5. The "Hard" Mode

The researchers also created a "Hard Subset" of 627 questions that even the smartest models got wrong. They wanted to see if the AI could learn from its mistakes.

  • The Result: Even with "thinking" models (AI that talks to itself before answering), they didn't get much better. This proves that the problem isn't that the AI isn't "thinking hard enough." The problem is that it doesn't have the right information (the local rules) or can't map the visual clues to the right labels.

The Big Takeaway

This paper is a wake-up call. It tells us that being good at English and general science doesn't make an AI an expert in a specific country.

If we want AI to help doctors, lawyers, and engineers in Korea, we can't just translate English tests. We need to build systems that understand the local culture, the specific laws, and the unique visual language of that country. KMMMU is the first step in building that bridge, ensuring that future AI isn't just a "global tourist," but a "local expert."

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →