Count your bits: fingerprint benchmarking to assess… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to organize a massive library containing millions of books. But instead of titles and authors, these "books" are chemical molecules. Your goal is to find books that are similar to each other—maybe they have the same plot (structure) or the same genre (chemical properties).

To do this, you need a system to summarize each book into a short "ID card." In the world of chemistry, these ID cards are called fingerprints.

This paper is like a massive report card for different types of ID cards. The authors, Florian Huber and Julian Pollmann, tested dozens of these fingerprint systems to see which ones actually work well when you have a huge, messy library of chemicals.

Here is the breakdown of their findings using simple analogies:

1. The Problem: The "Folding" Trap

Most fingerprint systems try to squeeze a molecule's complex structure into a fixed-size box (like a 4,096-bit vector). Think of this like trying to fit a giant, detailed map of a city into a tiny postcard.

The Issue: To make it fit, you have to "fold" the map. Sometimes, two completely different streets get squished onto the same spot on the postcard.
The Result: The computer thinks two very different molecules are identical twins because their "postcards" look the same. This is called a bit collision.
The Fix: The authors found that for some fingerprints (like RDKit and MAP4), you shouldn't fold the map at all. You should use an unfolded version (a huge scroll) so every street gets its own unique spot. This stops the computer from getting confused.

2. The "Yes/No" vs. The "Count"

There are two main ways to write these ID cards:

Binary (Yes/No): "Does this molecule have a benzene ring? Yes."
Count (How Many?): "This molecule has three benzene rings."

The Finding: The "Count" method is almost always better.

Analogy: Imagine two people. One has one red shirt; the other has ten red shirts.
- The Binary card says: "Both have red shirts." (They look the same).
- The Count card says: "One has one, the other has ten." (They look different).
In chemistry, knowing how many parts a molecule has helps the computer understand the difference between a small molecule and a giant one. The "Count" method also handles the "folding" problem much better than the "Yes/No" method.

3. The "Log" Scaling

Sometimes, having 20 red shirts isn't that much more different than having 21, but having 1 red shirt vs. 2 is a huge difference.

The authors suggest using a special math trick called log-scaling. It's like turning down the volume on the loud numbers so the computer pays more attention to the small, important differences. This works great for visualizing chemical spaces (making pretty maps of where molecules live).

4. The "Goldilocks" Fingerprints

The paper tested many different fingerprint types (Morgan, RDKit, MAP4, etc.). Here is what they found:

Morgan/FCFP: These are like the reliable, standard ID cards. They work well, especially if you use a larger "radius" (looking at a bigger neighborhood of the molecule) and use the "Count" method.
RDKit & MAP4: These are very detailed and powerful, but they are very prone to the "folding" trap. If you use them on a huge, diverse dataset, you must use the "unfolded" (unfolding the map) version, or the results will be garbage.
Dictionary-based (MACCS, PubChem): These are like pre-printed checklists. They are easy to read, but they often miss the nuance of unique molecules, making them less specific for broad searches.

5. The New Tool: `chemap`

To help everyone else do this testing without reinventing the wheel, the authors built a free software tool called chemap.

Think of this as a "Swiss Army Knife" for chemists. It lets you easily generate these different ID cards (folded, unfolded, counted, or scaled) and compare them instantly.

The Big Takeaway

For a long time, chemists just picked a fingerprint type and used the default settings (usually "folded" and "binary"). This paper says: "Stop guessing!"

If you are working with a huge, diverse mix of chemicals (like natural products or metabolites), don't fold your fingerprints.
Use counts instead of just yes/no.
If you use RDKit or MAP4, you absolutely need the unfolded version to avoid false matches.

By choosing the right ID card and the right way to write it, you can make your chemical searches, drug discoveries, and data visualizations much more accurate and reliable.

1. Problem Statement

Molecular similarity quantification is fundamental to cheminformatics, underpinning virtual screening, chemical space visualization, and machine learning (ML) model training. While the Tanimoto coefficient applied to 2D fingerprints is the de facto standard, its practical behavior is highly sensitive to:

Fingerprint Type: Dictionary-based vs. circular (Morgan/FCFP) vs. path-based (RDKit) vs. distance-encoded (MAP4).
Representation: Binary (presence/absence) vs. Count (frequency).
Vector Folding: Fixed-length vectors (e.g., 1024 or 4096 bits) vs. Unfolded (variable length).

Key Issues Identified:

Bit Collisions: Folding high-occupancy fingerprints into fixed-length vectors causes distinct substructures to map to the same bit, artificially inflating similarity scores and distorting chemical space.
Arbitrary Defaults: Common default settings (binary, folded, specific radii) are often chosen arbitrarily without systematic benchmarking, leading to suboptimal performance in diverse tasks.
Task Mismatch: Traditional benchmarks focus heavily on virtual screening (retrieval of active compounds), neglecting other critical applications like chemical space visualization, dimensionality reduction, and ML target prediction where the full distribution of similarity scores matters.

2. Methodology

The authors introduced a multi-criteria benchmarking framework and a new open-source Python library, chemap, to evaluate fingerprint variants across large, heterogeneous datasets.

Datasets:

ms2structures (37k compounds): Curated mass spectrometry data.
biostructures (718k compounds): Large, biologically relevant, chemically heterogeneous dataset (stress test for broad representation).
Subclass Datasets: Balanced sets of 25 and 120 chemical subclasses for classification tasks.
rascalMCES (5.4M pairs): A large set of compound pairs used to compare fingerprint similarities against a graph-based reference (Maximum Common Edge Subgraph).

Fingerprint Variants Tested:

Types: MACCS, PubChem, Klekota-Roth, Biosynfoni (Dictionary-based); Morgan, FCFP (Circular); RDKit (Path-based); Atom Pair, MAP4 (Distance-based); LINGO, Avalon.
Configurations: Binary vs. Count (and log-count); Folded (4096-bit default) vs. Unfolded (sparse/hash-based); Frequency-folded (retaining top 4096 most frequent bits).
Scaling: Application of TF-IDF and log-scaling to count vectors.

Evaluation Metrics:

Specificity: Duplicate fingerprint rates and mass discrepancies between identical fingerprints.
Score Behavior: Distribution of similarity scores and dependence on compound size (mass).
Ranking Agreement: Top- $k$ overlap between different fingerprint methods.
Graph-Based Alignment: Spearman correlation with RascalMCES (a computationally expensive but structurally rigorous reference).
Downstream Tasks:
- Multi-label bioactivity prediction (Neural Networks).
- Chemical subclass classification.
- Subclass neighborhood consistency (k-NN analysis).
- Chemical space visualization (UMAP).

3. Key Contributions

The chemap Library: An open-source Python library providing unified, optimized computation for folded, unfolded, and frequency-folded fingerprints, enabling reproducible benchmarking.
Systematic Benchmarking Framework: Moves beyond simple retrieval tasks to evaluate specificity, score distributions, size dependence, and structural alignment with graph-based references.
Empirical Evidence on Bit Collisions: Demonstrated that folding high-occupancy fingerprints (like RDKit and MAP4) into standard sizes (4096 bits) causes severe artifacts, particularly in heterogeneous datasets.
Recommendation for Count Variants: Provided robust evidence that count (and log-count) variants generally outperform binary variants across specificity, structural alignment, and downstream ML tasks.

4. Key Results

A. Bit Collisions and Folding Artifacts

High Occupancy Fingerprints: RDKit and MAP4 fingerprints exhibit very high bit occupation rates. When folded to 4096 bits, they suffer from massive bit collisions.
Impact: Folding leads to artificially high similarity scores for dissimilar compounds. Unfolded variants of RDKit and MAP4 showed significantly better correlation with the graph-based RascalMCES reference (e.g., MAP4 correlation jumped from ~0 to ~0.59 when unfolded).
Recommendation: For RDKit and MAP4 on heterogeneous datasets, unfolded or sparse variants are essential. For low-occupancy fingerprints (Morgan, FCFP), folding has minimal impact.

B. Binary vs. Count Representations

Specificity: Count variants drastically reduce fingerprint duplicates and mass discrepancies compared to binary variants.
Structural Alignment: Count variants correlate better with RascalMCES scores.
Downstream Performance:
- Bioactivity Prediction: Binary and count variants performed similarly (presence/absence was sufficient).
- Chemical Class Prediction & Neighborhood Consistency: Count variants (especially log-count) significantly outperformed binary variants.
Conclusion: Count variants should be the default unless specific computational constraints dictate otherwise.

C. Compound Size Dependence

Folded fingerprints (especially RDKit and MAP4) showed strong dependence on molecular mass; larger molecules received artificially inflated similarity scores due to bit collisions.
Unfolded and count variants largely eliminated this size bias, providing more consistent similarity metrics across small and large molecules.

D. Ranking and Visualization

Ranking Disagreement: Different fingerprint types often select completely different "top-10" neighbors (median top-10 overlap was only ~3.5 out of 10).
Visualization: Unfolded log-count variants provided the most consistent subclass neighborhood structures in UMAP visualizations.
Best Performers: RDKit and MAP4 (when unfolded) and Morgan/FCFP (with larger radii, e.g., 9, and count variants) performed best across the broadest range of tasks.

5. Significance and Implications

Rethinking Defaults: The paper challenges the community to move away from arbitrary defaults (e.g., binary Morgan-2/3, folded 1024/2048 bits). It suggests Morgan-9/FCFP-9 with count/log-count representations as robust starting points for broad chemical space analysis.
Critical for ML: For tasks using similarity as a loss function or target (e.g., predicting structures from spectra), using folded fingerprints with bit collisions introduces noise that degrades model performance.
Scalability: The chemap library enables the handling of massive datasets (millions of compounds) with optimized similarity calculations, facilitating large-scale chemical space exploration.
Guidance for Future Research: The authors argue that "one size fits all" does not exist; however, for general-purpose representation of heterogeneous chemical spaces, unfolded count variants of path-based or distance-encoded fingerprints offer the best trade-off between specificity, structural fidelity, and computational feasibility.

In summary, the paper provides a rigorous, data-driven argument for adopting count-based, unfolded (or frequency-folded) fingerprints to avoid the distortions caused by bit collisions, particularly when analyzing large, diverse chemical datasets.

Count your bits: fingerprint benchmarking to assess broad chemical space representation