WTMAD-4: A Fair Weighting Scheme for GMTKN55

Imagine you are a judge at a massive cooking competition. The goal is to find the "best chef" (a computer program called a Density Functional Theory, or DFT, method) who can predict how chemical reactions behave.

To do this, you have a giant scorecard called GMTKN55. This scorecard isn't just one dish; it's a collection of 55 different challenges, ranging from simple tasks like baking a small cookie (small molecules) to complex feats like building a skyscraper (large molecules) or predicting how two magnets stick together (non-covalent interactions).

The Problem: A Broken Scorecard

For years, the judges used a specific way to calculate the final score, called WTMAD-2. Think of this like a grading system where the score for each challenge is weighted by how "expensive" or "big" the challenge is.

The paper argues that this old system was fundamentally unfair. Here is the analogy:

Imagine the competition has two types of challenges:

The "Big" Challenge: A massive banquet with 76 dishes (called BH76).
The "Small" Challenge: A tiny appetizer with only 16 bites (called IL16).

Under the old WTMAD-2 rules, the banquet (BH76) was worth so much more than the appetizer (IL16) that if a chef messed up the appetizer, it barely changed their final score. But if they messed up the banquet, their score tanked.

In reality, the paper found that the banquet was worth nearly 200 times more than the appetizer. This meant a chef could be terrible at the appetizer and still win the whole competition just because they were good at the banquet. The old system was "over-weighting" the big challenges and "under-weighting" the small ones, making the results misleading.

The Solution: WTMAD-4 (The Fair Scorecard)

The authors, Kyle Bryenton and Erin Johnson, propose a new way to score the competition called WTMAD-4.

Instead of weighing the challenges based on their size or energy cost, they decided to weigh them based on how hard they are for a typical, reliable chef to get right.

The Old Way: "This challenge is huge, so it counts for 50% of your grade."
The New Way (WTMAD-4): "We asked 10 expert chefs how hard this challenge usually is. Since it's usually hard, it counts for a fair share of the grade. Since that other challenge is usually easy, it counts for a smaller share, but not zero."

By using this new method, every single one of the 55 challenges gets a fair voice. No single challenge can dominate the final score, and no challenge is ignored.

What Happened When They Re-Scrolled?

The authors took 115 different "chefs" (computer methods) and re-ran the scores using the new WTMAD-4 system. The results were surprising:

The Rankings Changed: Some chefs who were previously ranked at the very top dropped down the list. Others who were in the middle moved up.
The "Overfitting" Trap: They found a specific chef (called XYG8) who was ranked #3 under the old rules. Why? Because this chef was incredibly good at the "Big Banquet" (BH76) but terrible at the "Small Appetizers." Under the old rules, the chef's greatness at the banquet hid their failures elsewhere. Under the new WTMAD-4 rules, their failures at the small challenges were finally counted, and their rank dropped significantly.
The Lesson: The paper warns that if you design a chef to only win based on the old, unfair rules, they might be "overfitting." They become a specialist at one type of dish but fail at everything else. The new WTMAD-4 system ensures that a "best chef" is actually good at everything, not just the big, loud challenges.

The Bottom Line

The paper doesn't invent a new cooking method or a new ingredient. Instead, it fixes the scorecard.

It argues that for a long time, scientists were using a ruler that stretched and shrank depending on what they were measuring. This new WTMAD-4 metric is a straight, honest ruler that treats every chemical challenge fairly, ensuring that the "best" computer methods are truly the most reliable for all types of chemistry, not just the big ones.

Technical Summary: WTMAD-4: A Fair Weighting Scheme for GMTKN55

Problem Identification
The GMTKN55 database is a standard benchmark collection in molecular quantum chemistry, comprising 55 subsets covering thermochemistry, reaction barriers, and non-covalent interactions (NCI) across small and large molecules. To aggregate performance across these chemically diverse subsets, the community utilizes the Weighted Mean Absolute Deviation (WTMAD). However, this paper identifies a critical flaw in the widely used WTMAD-2 and WTMAD-3 metrics. These schemes weight individual benchmarks based on the ratio of the mean reference energy ( $|\Delta E|_i$ ) to the average reference energy of the set, scaled by the number of data points ( $N_i$ ).

The authors demonstrate that this approach leads to disproportionate weighting. Benchmarks with large numbers of reactions (e.g., BH76 with 76 reactions) or specific energy scales dominate the total error metric, while benchmarks with fewer systems or different energy scales (e.g., IL16, DIPCS10) contribute negligibly (orders of magnitude less). Consequently, optimizing a Density Functional Approximation (DFA) to minimize WTMAD-2 may result in a functional that performs exceptionally well on a few large subsets but fails significantly on marginalized benchmarks. This issue is exacerbated by the fact that reference data updates have led to inconsistencies in the average energy values used in literature, further complicating comparisons.

Methodology
To address these disparities, the authors propose a new metric, WTMAD-4. The methodology involves the following steps:

Data Reassessment: The authors re-evaluated 115 previously studied dispersion-corrected DFAs (DC-DFAs) using updated reference data from the revised GMTKN55 set.
Weight Derivation: Unlike WTMAD-2, which relies on reference energy scales, WTMAD-4 weights are derived from the expected performance of a representative set of ten "minimally empirical" and well-behaved hybrid functionals (e.g., PBE0-D3(BJ), B3LYP-D3(BJ)).
Weight Calculation: The weight for each benchmark $i$ is defined as:
$w_i^{WTMAD-4} = \frac{100}{N_{bench}} \left( \frac{3.5}{MAD_i} \right)$
where $MAD_i$ is the mean Mean Absolute Deviation for benchmark $i$ across the ten reference functionals. The factor of 3.5 scales the metric to be comparable in magnitude to WTMAD-2.
Rationale: By using the inverse of the mean error of robust functionals as the weight, benchmarks where typical functionals struggle (high $MAD_i$ ) receive lower weights, while benchmarks where they perform well (low $MAD_i$ ) receive higher weights. This ensures that no single benchmark is marginalized due to its size or energy scale, but rather contributes based on the typical difficulty of the chemical problem it represents.

Key Results

Distribution of Contributions: Analysis of the 115 DC-DFAs reveals that WTMAD-2 and WTMAD-3 produce highly skewed distributions where some benchmarks contribute up to ~10% of the total error, while others contribute less than 0.1%. In contrast, WTMAD-4 produces a much tighter, more centralized distribution. The interquartile range (IQR) of contributions drops from ~1.6–1.9% for previous metrics to 0.97% for WTMAD-4.
Reordering of Functionals: The shift to WTMAD-4 significantly alters the ranking of DFAs:
- GGA and Meta-GGA: Rankings show minor shifts, though meta-GGAs generally perform less favorably relative to GGAs under WTMAD-4 compared to WTMAD-2.
- Hybrid Functionals: Significant reordering occurs. For instance, PW6B95-D3(BJ) improves from 7th to 2nd place, while $\omega$ B97X-V, though still top-ranked, shows a larger gap between its WTMAD-2 and WTMAD-4 scores. The authors attribute this to $\omega$ B97X-V's poor performance on specific "Iso + Large" benchmarks (C60ISO, MB16-43) that are under-weighted in WTMAD-2 but fairly represented in WTMAD-4.
- Double Hybrids: The ranking changes are attributed to the reduced weight of the BH76 barrier set and increased weight of other subsets. Notably, XYG8, which was ranked 3rd by WTMAD-2, drops to 17th by WTMAD-4. The authors note that XYG8's parameters were fitted specifically to minimize WTMAD-2, suggesting it overfit to the BH76 subset at the expense of other benchmarks. Conversely, revDH23 and DH24 remain top performers under both metrics, indicating greater robustness.
Outliers: The only significant outliers in WTMAD-4 contributions are for the ADIM6 benchmark (n-alkane dimers), where specific Minnesota functionals (MN15L, M06, MN15) show systematic overbinding, leading to high contributions. This is consistent with known limitations of these functionals regarding dispersion.

Significance and Claims
The paper claims that WTMAD-4 provides a "fair treatment across all benchmarks" by ensuring each of the 55 subsets contributes meaningfully to the overall error metric. The authors argue that the previous reliance on WTMAD-2 allowed for the marginalization of chemically important but numerically smaller subsets.

The primary significance of this work is the demonstration that minimizing WTMAD-2 can lead to functionals that are overfitted to specific subsets (like BH76) while underperforming on others. By using WTMAD-4, developers can identify functionals that are more robust across the entire chemical space of GMTKN55. The authors caution against the "Goodhart's law" effect in functional development, where optimizing for a single, unbalanced metric ceases to be a good measure of general performance. They advocate for the use of WTMAD-4 to reduce the likelihood of such overfitting, particularly in the context of AI-guided DFA development, while emphasizing that multiple statistical measures should still be considered rather than relying on a single target number.

The Problem: A Broken Scorecard

The Solution: WTMAD-4 (The Fair Scorecard)

What Happened When They Re-Scrolled?

The Bottom Line

Technical Summary: WTMAD-4: A Fair Weighting Scheme for GMTKN55

More like this