Best practices to cluster large molecular libraries

This paper presents a data-driven strategy to optimize BitBIRCH clustering parameters for large molecular libraries, demonstrating that specific similarity thresholds and high branching factors, combined with an iterative re-clustering procedure, effectively mitigate issues with singletons and oversized clusters to enhance the algorithm's robustness and usability.

Lope Perez, K., Miranda Quintana, R. A.

Published 2026-04-08
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a massive library containing billions of books, but instead of stories, these books are actually tiny, complex instructions for building molecules (the stuff that makes up medicines and materials). Your goal is to organize this chaotic mountain of books into neat piles so you can find similar ones quickly.

This is the challenge scientists face with BitBIRCH, a smart computer program designed to sort these molecular "books." However, the program has two annoying habits:

  1. It sometimes leaves too many books alone on the shelves because it thinks they are too unique (these are called singletons).
  2. Other times, it shoves almost all the books into one giant, messy pile (a disproportionately large cluster), making it useless for finding specific similarities.

The paper you're asking about is like a user manual that teaches you how to tune this program so it works perfectly. Here is the breakdown using some everyday analogies:

1. The "Goldilocks" Setting (Similarity Thresholds)

Think of the "similarity threshold" as the rule for how much two books need to look alike to be put in the same pile.

  • Too strict: If the rule is too picky, every book gets its own shelf (too many singletons).
  • Too loose: If the rule is too easy, you end up with one giant pile containing everything from cookbooks to encyclopedias.

The authors tested millions of settings and found the "Goldilocks zone." They discovered that setting the rule to be 3 to 4 "steps" (standard deviations) stricter than the average works best.

  • The Analogy: Imagine you are sorting people by height. If you say "only group people who are exactly the same height," you get no groups. If you say "group anyone taller than a shoe," you get one giant group. The sweet spot is saying, "Group people who are within 3 or 4 inches of the average height." This creates neat, manageable groups without leaving anyone out or making the groups too messy.

2. The "Super-Organizer" (Branching Factor)

The "branching factor" is like how many sub-piles a main pile can split into at once.

  • The Finding: The more sub-piles you allow, the better. The authors recommend cranking this number up as high as your computer can handle (up to 1,024).
  • The Analogy: Imagine a librarian trying to sort books. If they can only put books into 2 boxes at a time, they will leave many books on the floor (singletons). But if they have 1,024 boxes ready to go, they can quickly sort almost every book into a specific spot, leaving very few books stranded on the floor.

3. The "Second Chance" Round (Iterative Re-clustering)

Sometimes, even with the best settings, a few books still end up in the wrong place or alone.

  • The Solution: The authors suggest a "do-over" strategy. After the first big sort, you take the lonely books and the tiny, weird piles and run them through the sorter again, but this time with slightly looser rules.
  • The Analogy: Think of it like a dance party. First, you ask people to find partners who are exactly the same dance style. Some people are left standing alone. Then, you say, "Okay, let's try again, but this time, if you just kinda like the same music, you can dance together." This merges the lonely dancers into the groups without ruining the whole party.

The Bottom Line

This paper isn't just about math; it's about practical advice. It tells scientists: "Don't just guess how to use BitBIRCH. Use these specific settings (the 3-4 step rule and the high box count) and do a quick second pass if needed."

By following these guidelines, scientists can finally organize their massive libraries of molecules efficiently, making it easier to discover new medicines and materials without getting lost in the data.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →