TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval

The paper introduces TIGER, a text-informed framework that leverages protein-to-text generation models and a dynamic gating network to create generalized enzyme representations, significantly outperforming existing methods in bidirectional enzyme-reaction retrieval across diverse distributions and tasks.

Original authors: Yuhang Zhang, Keyan Ding, Peilin Chen, Han Liu, Can Lin, Ruixi Chen, Shiqi Wang, Qi Song

Published 2026-05-26
📖 4 min read☕ Coffee break read

Original authors: Yuhang Zhang, Keyan Ding, Peilin Chen, Han Liu, Can Lin, Ruixi Chen, Shiqi Wang, Qi Song

Original paper dedicated to the public domain under CC0 1.0 (http://creativecommons.org/publicdomain/zero/1.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a librarian trying to match two very different types of books: Enzymes (which are like tiny, complex biological machines made of proteins) and Reactions (which are like chemical recipes describing what those machines do).

For a long time, scientists have tried to build a computer system that can look at a protein and guess its recipe, or look at a recipe and guess which protein made it. But existing systems have been like clumsy librarians:

  1. They are biased: They are great at finding the recipe if you give them the protein, but terrible at finding the protein if you give them the recipe.
  2. They are fragile: If you change the way you organize the books (the data), the librarian suddenly forgets everything.
  3. They only look at the spine of the book (the raw sequence of letters) and ignore the summary on the back cover (the text description of what the machine actually does).

Enter TIGER (Text-Informed Generalized Enzyme-Reaction Retrieval). Think of TIGER as a super-smart, bilingual librarian who has learned to read both the "spine" and the "summary" to make perfect matches.

Here is how TIGER works, broken down into simple parts:

1. The "Translator" (Protein-to-Text)

Traditional systems only read the raw code of the enzyme (a long string of letters like A-C-G-T...). It's like trying to understand a machine just by looking at its serial number.
TIGER uses a special AI tool to translate that serial number into a plain English summary. It reads the protein and writes a sentence like: "This machine grabs a specific molecule and turns it into something else."

  • Why this helps: It adds "common sense" and context that the raw code misses, making it easier to match the machine to its recipe.

2. The "Quality Control Manager" (Dynamic Gating Network)

Here is the catch: The AI writing the English summaries isn't perfect. Sometimes it hallucinates or gets things slightly wrong (like a student who studied too hard but still made a few mistakes on the test).
TIGER has a built-in Quality Control Manager (the Dynamic Gating Network).

  • When the AI generates a summary, this manager checks: "Does this summary make sense compared to the raw protein data?"
  • If the summary is good, the manager says, "Use this!" and boosts its importance.
  • If the summary is nonsense or noisy, the manager says, "Ignore that," and turns down the volume.
  • Result: The system learns to trust the good text and ignore the bad text, making it much more reliable.

3. The "Universal Translator" (Structure-Shared Feature Projector)

Even with good summaries, the "Protein Language" and the "Chemical Recipe Language" are still different dialects.
TIGER uses a Universal Translator (the Structure-Shared Feature Projector). It takes the protein's data and the reaction's data and forces them to speak the same language in a shared "meeting room" (a unified space).

  • This ensures that when the system looks for a match, it's comparing apples to apples, not apples to oranges. This fixes the "bias" problem, making the system just as good at finding recipes from proteins as it is at finding proteins from recipes.

4. The "Double-Check" (Bidirectional Training)

Most systems train themselves to only go one way (Protein \to Recipe). TIGER trains itself to go both ways simultaneously. It constantly practices:

  • "Given this protein, find the recipe."
  • "Given this recipe, find the protein."
    This double-checking makes the system robust. It doesn't matter if you throw a new, weird protein at it or a strange new recipe; the system has learned the relationship between them, not just a memorized list.

The Results: A Super-Librarian

The authors tested TIGER on a massive dataset called ReactZyme (a giant library of enzyme-reaction pairs). They challenged it with three difficult scenarios:

  1. Time-based: Newer data the system had never seen before.
  2. Similarity-based: Proteins that look very different from anything in the training set.
  3. Reaction-based: Chemical reactions that were completely new.

The Outcome:
TIGER crushed the competition. While other systems stumbled and failed when the data changed, TIGER kept performing at a high level.

  • It improved accuracy by huge margins (sometimes doubling or tripling the success rate of previous methods).
  • It fixed the "bias" problem, performing equally well in both directions.
  • It proved that adding text descriptions (and filtering out the bad ones) is the secret sauce to understanding how biological machines work.

In short, TIGER is a system that doesn't just memorize data; it reads the "story" behind the data, filters out the lies, and learns the true connection between biological machines and their chemical recipes.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →