HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

This paper introduces HSSBench, a comprehensive multilingual benchmark featuring over 13,000 samples generated through a novel expert-agent collaboration pipeline, designed to evaluate and address the current limitations of Multimodal Large Language Models in handling the interdisciplinary and abstract reasoning tasks characteristic of the Humanities and Social Sciences.

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, Siqi He, Shannan Yan, Junzhe Chen, Xiaomin He, Chaoya Jiang, Wei Ye, Kaidong Yu, Xuelong Li

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you've built a brilliant robot student. You've tested it on math problems, science experiments, and coding challenges, and it's aced them all. It can solve complex equations faster than a calculator and write code without typos. You think, "This robot is a genius!"

But then, you hand it a picture of an old, dusty painting and ask, "What is this artist trying to tell us about the French Revolution?" Or you show it a graph of a company's profits and ask, "Why did this business fail, and what does it say about the economy of that time?"

Suddenly, the robot freezes. It might guess the wrong answer, make up facts, or stare blankly. It's like a brilliant chess player who suddenly can't understand a poem.

This is the problem HSSBench solves.

The Problem: The "STEM vs. Humanities" Gap

For a long time, we've been testing AI on STEM (Science, Technology, Engineering, Math). These subjects are like climbing a ladder: you go step-by-step, A leads to B, which leads to C, until you reach the single correct answer at the top. AI is great at climbing ladders.

But Humanities and Social Sciences (HSS)—like history, art, culture, and economics—are more like navigating a foggy forest. There isn't just one path. To understand a historical event or a piece of art, you need to connect dots across different fields, understand human emotions, cultural context, and "read between the lines." It requires horizontal thinking (connecting many different things) rather than vertical thinking (climbing one straight line).

Current AI models are like ladders in a forest; they get lost because they don't know how to wander and connect the dots.

The Solution: HSSBench (The "Forest Map")

The authors of this paper created a new test called HSSBench. Think of it as a massive, multi-language map of that foggy forest, designed specifically to see if AI can actually navigate it.

  • The Size: It's huge. It contains over 13,000 questions covering six major areas: Geography, Art, Culture, Social Sciences, History, and Economics.
  • The Languages: It's not just in English. It's in the six official languages of the United Nations (English, Chinese, French, Russian, Spanish, and Arabic), because culture and history look different depending on where you are in the world.
  • The Format: Instead of just text, it uses images. You might see a photo of a rock formation and have to guess its geological age, or look at a painting and guess the cultural significance of the costumes.

How They Built It: The "Expert Kitchen"

You can't just ask a computer to write these questions; it would be like asking a robot to write a novel about human heartbreak without ever feeling love. The data would be shallow and full of errors.

So, the team built a special "kitchen" to cook up the data:

  1. The Chefs (Human Experts): Real professors and specialists in history, art, and economics gathered raw ingredients (old photos, textbooks, maps).
  2. The Sous-Chefs (AI Agents): They used AI to help chop the ingredients, organize the recipes, and generate draft questions.
  3. The Tasting Committee: Both the human experts and the AI checked every single question to make sure it wasn't a trick question, that the answer was correct, and that it actually required looking at the picture to solve.

This ensured the test was high-quality and fair.

The Results: The Robot is Still a Toddler in the Forest

When they ran the test on over 20 of the smartest AI models in the world (including giants like GPT-4 and Qwen), the results were surprising: The robots struggled.

  • The Score: Even the best models only got about 40-50% of the answers right. That's barely passing a high school exam!
  • The "Human" Score: Real human experts got over 93% right.
  • The Hardest Part: The models did especially poorly on Economics and Open-ended questions (where you have to explain why without multiple-choice hints).
  • The "Hallucination" Issue: When the models tried to "think step-by-step" (a technique called Chain-of-Thought), they often got worse. It's like when you try to explain a joke to someone, and in trying to explain the logic, you ruin the punchline. The AI got confused by its own reasoning.

Why This Matters

This paper is a wake-up call. It tells us that just because an AI can solve a math problem, it doesn't mean it "understands" the world.

  • Real-world impact: If we want AI to help doctors, lawyers, historians, or policymakers, it needs to understand human culture and context, not just numbers.
  • The Future: HSSBench gives researchers a target. Now that we have a map of where the AI is failing, we can start building better models that can truly "think" like humans, connecting art, history, and economics into a coherent picture of our world.

In short: HSSBench is the test that finally asks, "Can you understand the human story?" and currently, the answer from our smartest robots is a hesitant, "Not really, not yet."