La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America

María Grandury, Javier Aula-Blasco, Júlia Falcão, Clémentine Fourrier, Miguel González, Gonzalo Martínez, Gonzalo Santamaría, Rodrigo Agerri, Nuria Aldama, Luis Chiruzzo, Javier Conde, Helena Gómez, Marta Guerrero, Guido Ivetta, Natalia López, Flor Miriam Plaza-del-Arco, María Teresa Martín-Valdivia, Helena Montoro, Carmen Muñoz, Pedro Reviriego, Leire Rosado, Alejandro Vaca, María Estrella Vallecillo-Rodríguez, Jorge Vallego, Irune Zubiaga

Published 2026-03-06

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

Imagine you have a giant library of books written in many different languages. For a long time, the "smartest" books in the library were only written in English. If you wanted to know which book was the best, you'd ask a librarian who only spoke English. If you asked about a book in Spanish, Basque, or Catalan, the librarian might just guess, or worse, translate the English questions poorly, missing the local jokes, cultural references, and unique ways of speaking.

"La Leaderboard" is a new, community-built library guide designed to fix this. It's the first open-source "scoreboard" specifically for testing how well Artificial Intelligence (AI) understands and speaks the many varieties of Spanish and the other languages of Spain (like Basque, Catalan, and Galician).

Here is a breakdown of how it works, using some everyday analogies:

1. The Problem: The "One-Size-Fits-All" Trap

Think of current AI models like a universal translator that studied hard in English but only skimmed the other languages.

The Issue: Most AI tests are like a driving test written in English and then poorly translated into Spanish. The questions might make sense grammatically, but they miss the local road signs, the slang, and the cultural context.
The Result: An AI might get a perfect score on a translated test but fail miserably when a real person in Mexico City or Buenos Aires asks it a question about local laws or humor.

2. The Solution: A "Taste-Test" for AI

The creators of La Leaderboard decided to stop using translated tests. Instead, they organized a massive taste-test.

The Ingredients: They gathered 66 different "recipes" (datasets). Some were donated by researchers, and some were cooked up specifically for this event.
The Menu: The menu covers everything from medical advice (like a doctor diagnosing a patient) to legal questions (like a lawyer arguing a case), humor (can the AI get a joke?), and reading comprehension.
The Languages: The test isn't just in "Spanish." It's in the specific dialects of Spain, Mexico, Argentina, Chile, and Uruguay, plus Basque, Catalan, and Galician. It's like testing a chef not just on "Italian food," but specifically on Neapolitan pizza vs. Roman pasta.

3. The Contestants: The AI Models

They invited 50 different AI models to take the test.

The Big Giants: Some are the famous, heavy-hitters from big tech companies (like Meta's Llama or Google's Gemma). These are like professional athletes who have trained on every sport in the world.
The Local Heroes: Others are smaller, specialized models built by European and Spanish researchers (like Salamandra or EuroLLM). These are like local champions who know the neighborhood streets better than anyone else.
The Results: The scoreboard shows who wins. Surprisingly, the big giants often do well, but the local heroes sometimes beat them in specific areas, proving that you don't always need a giant engine to win a local race.

4. The "Eco-Friendly" Twist

Usually, testing AI is like burning a ton of coal to light a single candle. It takes massive amounts of electricity and time.

The Innovation: The team decided to be smarter. Instead of asking the AI to read a long list of examples before answering (like a student cramming for a test), they often asked them to answer without any examples (zero-shot) or with very few.
The Benefit: This saves a huge amount of energy (like turning off the lights when you leave the room) and makes it easier for smaller researchers to run their own tests without needing a supercomputer.

5. Why This Matters

Think of AI as a new employee joining a global company.

If you only test them on English, they might seem smart, but they will fail when talking to the team in Madrid, Mexico City, or San Salvador.
La Leaderboard ensures that the AI is culturally aware. It checks if the AI understands that a "joke" in Argentina might be different from a "joke" in Spain, or that a legal term in Mexico has a specific meaning.

The Bottom Line

La Leaderboard is a community-driven project that says: "We want AI that speaks our language, understands our culture, and respects our diversity."

It's not just about who is the "smartest" AI in the world; it's about who is the most helpful and respectful AI for the 600 million people who speak Spanish and the other languages of the Iberian Peninsula. By making this scoreboard open to everyone, they hope to inspire other communities (like those speaking French, Arabic, or Indigenous languages) to build their own scoreboards, ensuring no one is left behind in the AI revolution.

La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America

1. The Problem: The "One-Size-Fits-All" Trap

2. The Solution: A "Taste-Test" for AI

3. The Contestants: The AI Models

4. The "Eco-Friendly" Twist

5. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology

A. Data Collection and Composition

B. Evaluation Configuration and Efficiency

C. Infrastructure

3. Key Contributions

4. Evaluation Results and Analysis

5. Significance and Future Work

La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America

1. The Problem: The "One-Size-Fits-All" Trap

2. The Solution: A "Taste-Test" for AI

3. The Contestants: The AI Models

4. The "Eco-Friendly" Twist

5. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology

A. Data Collection and Composition

B. Evaluation Configuration and Efficiency

C. Infrastructure

3. Key Contributions

4. Evaluation Results and Analysis

5. Significance and Future Work

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers