WTMAD-4: A Fair Weighting Scheme for GMTKN55

This paper identifies a significant flaw in the existing WTMAD-2 weighting scheme for the GMTKN55 benchmark set that underweights certain components, and proposes a new WTMAD-4 metric based on typical errors of dispersion-corrected functionals to ensure fair evaluation across all benchmarks, which subsequently reveals performance issues in functionals previously optimized using the flawed metric.

Original authors: Kyle R. Bryenton, Erin R. Johnson

Published 2026-06-18
📖 4 min read☕ Coffee break read

Original authors: Kyle R. Bryenton, Erin R. Johnson

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a judge at a massive cooking competition. The goal is to find the "best chef" (a computer program called a Density Functional Theory, or DFT, method) who can predict how chemical reactions behave.

To do this, you have a giant scorecard called GMTKN55. This scorecard isn't just one dish; it's a collection of 55 different challenges, ranging from simple tasks like baking a small cookie (small molecules) to complex feats like building a skyscraper (large molecules) or predicting how two magnets stick together (non-covalent interactions).

The Problem: A Broken Scorecard

For years, the judges used a specific way to calculate the final score, called WTMAD-2. Think of this like a grading system where the score for each challenge is weighted by how "expensive" or "big" the challenge is.

The paper argues that this old system was fundamentally unfair. Here is the analogy:

Imagine the competition has two types of challenges:

  1. The "Big" Challenge: A massive banquet with 76 dishes (called BH76).
  2. The "Small" Challenge: A tiny appetizer with only 16 bites (called IL16).

Under the old WTMAD-2 rules, the banquet (BH76) was worth so much more than the appetizer (IL16) that if a chef messed up the appetizer, it barely changed their final score. But if they messed up the banquet, their score tanked.

In reality, the paper found that the banquet was worth nearly 200 times more than the appetizer. This meant a chef could be terrible at the appetizer and still win the whole competition just because they were good at the banquet. The old system was "over-weighting" the big challenges and "under-weighting" the small ones, making the results misleading.

The Solution: WTMAD-4 (The Fair Scorecard)

The authors, Kyle Bryenton and Erin Johnson, propose a new way to score the competition called WTMAD-4.

Instead of weighing the challenges based on their size or energy cost, they decided to weigh them based on how hard they are for a typical, reliable chef to get right.

  • The Old Way: "This challenge is huge, so it counts for 50% of your grade."
  • The New Way (WTMAD-4): "We asked 10 expert chefs how hard this challenge usually is. Since it's usually hard, it counts for a fair share of the grade. Since that other challenge is usually easy, it counts for a smaller share, but not zero."

By using this new method, every single one of the 55 challenges gets a fair voice. No single challenge can dominate the final score, and no challenge is ignored.

What Happened When They Re-Scrolled?

The authors took 115 different "chefs" (computer methods) and re-ran the scores using the new WTMAD-4 system. The results were surprising:

  1. The Rankings Changed: Some chefs who were previously ranked at the very top dropped down the list. Others who were in the middle moved up.
  2. The "Overfitting" Trap: They found a specific chef (called XYG8) who was ranked #3 under the old rules. Why? Because this chef was incredibly good at the "Big Banquet" (BH76) but terrible at the "Small Appetizers." Under the old rules, the chef's greatness at the banquet hid their failures elsewhere. Under the new WTMAD-4 rules, their failures at the small challenges were finally counted, and their rank dropped significantly.
  3. The Lesson: The paper warns that if you design a chef to only win based on the old, unfair rules, they might be "overfitting." They become a specialist at one type of dish but fail at everything else. The new WTMAD-4 system ensures that a "best chef" is actually good at everything, not just the big, loud challenges.

The Bottom Line

The paper doesn't invent a new cooking method or a new ingredient. Instead, it fixes the scorecard.

It argues that for a long time, scientists were using a ruler that stretched and shrank depending on what they were measuring. This new WTMAD-4 metric is a straight, honest ruler that treats every chemical challenge fairly, ensuring that the "best" computer methods are truly the most reliable for all types of chemistry, not just the big ones.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →