HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Imagine you've built a brilliant robot student. You've tested it on math problems, science experiments, and coding challenges, and it's aced them all. It can solve complex equations faster than a calculator and write code without typos. You think, "This robot is a genius!"

But then, you hand it a picture of an old, dusty painting and ask, "What is this artist trying to tell us about the French Revolution?" Or you show it a graph of a company's profits and ask, "Why did this business fail, and what does it say about the economy of that time?"

Suddenly, the robot freezes. It might guess the wrong answer, make up facts, or stare blankly. It's like a brilliant chess player who suddenly can't understand a poem.

This is the problem HSSBench solves.

The Problem: The "STEM vs. Humanities" Gap

For a long time, we've been testing AI on STEM (Science, Technology, Engineering, Math). These subjects are like climbing a ladder: you go step-by-step, A leads to B, which leads to C, until you reach the single correct answer at the top. AI is great at climbing ladders.

But Humanities and Social Sciences (HSS)—like history, art, culture, and economics—are more like navigating a foggy forest. There isn't just one path. To understand a historical event or a piece of art, you need to connect dots across different fields, understand human emotions, cultural context, and "read between the lines." It requires horizontal thinking (connecting many different things) rather than vertical thinking (climbing one straight line).

Current AI models are like ladders in a forest; they get lost because they don't know how to wander and connect the dots.

The Solution: HSSBench (The "Forest Map")

The authors of this paper created a new test called HSSBench. Think of it as a massive, multi-language map of that foggy forest, designed specifically to see if AI can actually navigate it.

The Size: It's huge. It contains over 13,000 questions covering six major areas: Geography, Art, Culture, Social Sciences, History, and Economics.
The Languages: It's not just in English. It's in the six official languages of the United Nations (English, Chinese, French, Russian, Spanish, and Arabic), because culture and history look different depending on where you are in the world.
The Format: Instead of just text, it uses images. You might see a photo of a rock formation and have to guess its geological age, or look at a painting and guess the cultural significance of the costumes.

How They Built It: The "Expert Kitchen"

You can't just ask a computer to write these questions; it would be like asking a robot to write a novel about human heartbreak without ever feeling love. The data would be shallow and full of errors.

So, the team built a special "kitchen" to cook up the data:

The Chefs (Human Experts): Real professors and specialists in history, art, and economics gathered raw ingredients (old photos, textbooks, maps).
The Sous-Chefs (AI Agents): They used AI to help chop the ingredients, organize the recipes, and generate draft questions.
The Tasting Committee: Both the human experts and the AI checked every single question to make sure it wasn't a trick question, that the answer was correct, and that it actually required looking at the picture to solve.

This ensured the test was high-quality and fair.

The Results: The Robot is Still a Toddler in the Forest

When they ran the test on over 20 of the smartest AI models in the world (including giants like GPT-4 and Qwen), the results were surprising: The robots struggled.

The Score: Even the best models only got about 40-50% of the answers right. That's barely passing a high school exam!
The "Human" Score: Real human experts got over 93% right.
The Hardest Part: The models did especially poorly on Economics and Open-ended questions (where you have to explain why without multiple-choice hints).
The "Hallucination" Issue: When the models tried to "think step-by-step" (a technique called Chain-of-Thought), they often got worse. It's like when you try to explain a joke to someone, and in trying to explain the logic, you ruin the punchline. The AI got confused by its own reasoning.

Why This Matters

This paper is a wake-up call. It tells us that just because an AI can solve a math problem, it doesn't mean it "understands" the world.

Real-world impact: If we want AI to help doctors, lawyers, historians, or policymakers, it needs to understand human culture and context, not just numbers.
The Future: HSSBench gives researchers a target. Now that we have a map of where the AI is failing, we can start building better models that can truly "think" like humans, connecting art, history, and economics into a coherent picture of our world.

In short: HSSBench is the test that finally asks, "Can you understand the human story?" and currently, the answer from our smartest robots is a hesitant, "Not really, not yet."

HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

The Problem: The "STEM vs. Humanities" Gap

The Solution: HSSBench (The "Forest Map")

How They Built It: The "Expert Kitchen"

The Results: The Robot is Still a Toddler in the Forest

Why This Matters

1. Problem Statement

2. Methodology

A. Dataset Construction Pipeline (VGP)

B. Dataset Statistics

C. Experimental Setup

3. Key Contributions

4. Key Results

5. Significance

HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

The Problem: The "STEM vs. Humanities" Gap

The Solution: HSSBench (The "Forest Map")

How They Built It: The "Expert Kitchen"

The Results: The Robot is Still a Toddler in the Forest

Why This Matters

1. Problem Statement

2. Methodology

A. Dataset Construction Pipeline (VGP)

B. Dataset Statistics

C. Experimental Setup

3. Key Contributions

4. Key Results

5. Significance

More like this

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

PACED: Distillation at the Frontier of Student Competence

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Reversible Lifelong Model Editing via Semantic Routing-Based LoRA