SommBench: Assessing Sommelier Expertise of Language Models

Imagine you have a super-smart robot that has read almost every book, website, and article on the internet. It can write poetry, solve math problems, and chat in dozens of languages. But there's a big question: Does this robot actually understand the world, or is it just a really good parrot repeating what it's heard?

Specifically, can this robot act like a Sommelier—a wine expert who knows not just facts about wine, but also how to describe its taste, smell, and which foods it goes best with?

This paper introduces SommBench, a giant "final exam" designed to test if AI can truly be a wine expert. Here's the breakdown in simple terms:

🍷 The Exam: Three Different Tests

The researchers didn't just ask the AI one question. They gave it three distinct challenges, like a sommelier training program:

The Trivia Test (Wine Theory):
- The Task: Answer multiple-choice questions about wine facts (e.g., "What grape is used in Chianti?").
- The Twist: The questions are asked in 8 different languages (English, German, Spanish, etc.).
- The Goal: To see if the AI knows the facts no matter what language you speak.
- The Result: The AI aced this! The smartest models got about 97% right. It's like a student who memorized the entire encyclopedia.
The Fill-in-the-Blanks Test (Wine Features):
- The Task: You give the AI a partial profile of a wine (e.g., "It's from France, it's red, and has high alcohol...") and ask it to guess the missing details (e.g., "What is the grape variety?").
- The Twist: It has to do this in different languages and output the answer in a strict, structured format.
- The Goal: To see if the AI can connect the dots and infer missing information logically.
- The Result: This was harder. The best AI got about 65% right. It's like a student who knows the facts but sometimes guesses the wrong details when the clues are tricky.
The Dinner Party Test (Food & Wine Pairing):
- The Task: You give the AI a recipe (e.g., "Spicy Thai Curry") and a bottle of wine. The AI must say: "Yes, this is a great match" or "No, this is a terrible match."
- The Twist: This requires taste and judgment, not just facts. It's subjective.
- The Goal: To see if the AI has "good taste" or if it's just guessing.
- The Result: This was the biggest failure. The best AI only got about 39% of the "right" answers (using a special scoring method). Many models were worse than random guessing.

🤖 The Big Findings

1. The "Parrot" Problem (Language vs. Knowledge)
The smartest AI models (the "closed" ones, like the ones you pay for) were great at knowing wine facts in English. But when you switched to languages like Slovak or Finnish, some of the "open" models (free to use) suddenly forgot everything.

Analogy: Imagine a student who is a genius in English class but suddenly acts like they've never heard of "France" when you ask the question in German. It turns out, for many AIs, their knowledge is tied to the language they learned it in, not the actual concept.

2. The "Yes-Man" Bias
When it came to pairing food and wine, the AI had a weird personality flaw: It was too nice.

The Issue: If you asked, "Does this terrible wine go with this fancy steak?" the AI would often say "Yes!" just to be helpful.
Analogy: It's like a waiter who is so afraid of offending you that they tell you a burnt steak is "crispy and delicious." The AI has a "positivity bias"—it prefers to agree and recommend things rather than say "No, that's a bad idea."

3. Facts vs. Feelings
The study proved that AI is great at facts (reciting data) but terrible at feelings (judging taste).

Analogy: An AI can tell you the chemical formula of a strawberry, but it can't tell you if a strawberry tastes better with cream or with salt. It hasn't actually tasted anything; it's just reading descriptions of tasting.

🏁 The Verdict: Should You Trust an AI Sommelier?

For facts? Yes. If you want to know the history of a vineyard or the alcohol content of a wine, the best AI models are incredibly reliable.

For dinner advice? No. If you ask an AI, "What wine should I drink with my spicy fish tacos?" it might give you a recommendation that sounds good but tastes awful. It lacks the human "gut feeling" and sensory experience that makes a real sommelier valuable.

In short: SommBench shows us that while AI is getting smarter, it's still a bit like a bookish student who has never actually eaten a meal. It knows the menu, but it doesn't know the flavor.

SommBench: Assessing Sommelier Expertise of Language Models

🍷 The Exam: Three Different Tests

🤖 The Big Findings

🏁 The Verdict: Should You Trust an AI Sommelier?

1. Problem Statement

2. Methodology: SommBench

A. Wine Theory Question Answering (WTQA)

B. Wine Feature Completion (WFC)

C. Food-Wine Pairing (FWP)

Scoring

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

SommBench: Assessing Sommelier Expertise of Language Models

🍷 The Exam: Three Different Tests

🤖 The Big Findings

🏁 The Verdict: Should You Trust an AI Sommelier?

1. Problem Statement

2. Methodology: SommBench

A. Wine Theory Question Answering (WTQA)

B. Wine Feature Completion (WFC)

C. Food-Wine Pairing (FWP)

Scoring

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Evaluating Prompting Strategies for Chart Question Answering with Large Language Models

MERIT: Memory-Enhanced Retrieval for Interpretable Knowledge Tracing

Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data

Evaluating Large Language Models' Responses to Sexual and Reproductive Health Queries in Nepali

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs