Protein solubility depends on centrifugation: Aiki-Sol, a per-regime predictor for E. coli

The paper introduces Aiki-Sol, a protein solubility predictor that overcomes the performance plateau of existing models by explicitly accounting for centrifugation regimes as a critical feature rather than noise, achieving significant accuracy gains on a newly released, stringency-annotated E. coli dataset.

Original authors: Rajagopalan, R., Meda, R. S., Shastry, S., Mysore, V.

Published 2026-05-14
📖 4 min read☕ Coffee break read

Original authors: Rajagopalan, R., Meda, R. S., Shastry, S., Mysore, V.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to teach a computer to predict whether a specific protein (a tiny building block of life) will dissolve nicely in water or clump up into a solid mess when made inside a bacteria called E. coli. For the last eight years, scientists have been using advanced AI to make these predictions, but they've hit a wall. The computers aren't getting any better, no matter how smart they get.

The Hidden Problem: The "Spin" Confusion
The paper argues that the computers aren't failing because they aren't smart enough; they are failing because they are being tricked by a hidden variable: centrifugation.

Think of making a protein like making a smoothie with chunks of fruit.

  • If you put the smoothie in a blender and spin it slowly, the big chunks stay at the bottom, and the liquid on top looks clear. You call this "soluble."
  • If you spin it super fast, even the tiny bits get forced to the bottom, leaving you with almost no liquid. You might call this "insoluble."

The protein itself hasn't changed. It's the same smoothie. But the method used to separate the liquid from the solids (the "centrifugation regime") changes the result.

For years, scientists have been feeding their AI models data where the "spin speed" was hidden. They just labeled everything as "soluble" or "insoluble." It's like trying to teach a student to predict the weather, but you hide the fact that some data comes from a sunny beach and some comes from a rainy mountain. The student gets confused because the rules seem to change randomly. The paper calls this a "latent confound"—a hidden trap in the data.

The Solution: Aiki-Sol and the New Dataset
The researchers fixed this by creating a massive new library of data called the Aiki-Sol Dataset. Instead of just saying "soluble" or "insoluble," they tagged every single protein with exactly how hard it was spun (the "stringency").

They organized this into three tiers:

  1. The Benchmark: A strict, high-quality set of about 85,000 proteins where the spin speed is known.
  2. The Extension: A larger set of about 147,000 proteins with just the basic labels.
  3. The Research Pool: A huge collection of about 229,000 proteins from various sources.

The Results: It's About the Rules, Not the Brain
When they tested old AI models on this new, honest data, the results were shocking. On the "high-speed spin" group, the best existing models actually performed worse than random guessing (like flipping a coin). They were so confused by the hidden spin speeds that they got it wrong more often than right.

Then, they built a new model called Aiki-Sol.

  • The Trick: Instead of trying to guess one single answer, Aiki-Sol is trained to give five different answers depending on how hard the protein is spun, plus one answer if the spin speed is unknown.
  • The Surprise: They found that making the AI "bigger" (adding more brainpower or using complex 3D structures) didn't help. The magic wasn't in the architecture; it was in curation. By teaching the AI to pay attention to the "spin speed" rules, a standard-sized model suddenly became much smarter.

The Outcome
When tested on new groups of proteins that the AI had never seen before, Aiki-Sol jumped from a success rate of about 70% to over 82%. Even more impressively, on groups where the AI had zero prior knowledge of the specific proteins, it still improved by a huge margin.

In a Nutshell
The paper claims that for years, protein solubility predictors were stuck because they ignored the "spin speed" used in the lab. By creating a new dataset that respects these different lab conditions and teaching the AI to adapt its predictions based on them, they broke the performance plateau. The key wasn't building a bigger, more complex brain, but rather teaching the existing brain to understand the specific rules of the game.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →