LAFA: A Framework for Reproducible Longitudinal… — Plain-Language Explanation

Original authors: An Phan, Yanli Wang, Frimpong Boadu, Maxat Kulmanov, Robert Hoehndorf, Jianlin Cheng, Predrag Radivojac, Iddo Friedberg

Published 2026-04-23

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: An Phan, Yanli Wang, Frimpong Boadu, Maxat Kulmanov, Robert Hoehndorf, Jianlin Cheng, Predrag Radivojac, Iddo Friedberg

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: From a "Pop-Up Exam" to a "Living Gym"

Imagine the world of protein function prediction (figuring out what a specific protein does in our bodies) as a massive, complex gym. Scientists build different "machines" (computer programs) to guess what these proteins do.

For years, the only way to see which machine was the best was through CAFA (Critical Assessment of Function Annotation). Think of CAFA like a triennial pop-up exam that happens only once every three years.

The Problem: Between these exams, no one knows if a new machine is better or worse. Also, the "answer key" (the real biological data) keeps changing. If a machine was trained on old textbooks, it might fail the new exam, but we wouldn't know until three years later. Plus, once the exam is over, the machines often get locked away or become hard to use.

LAFA (Longitudinal Assessment of Protein Function Annotation Models) is the solution. It is like turning that pop-up exam into a 24/7 Living Gym with a Continuous Scoreboard.

How LAFA Works (The Analogy)

1. The "Time-Travel" Training Camp

In the old CAFA system, there was a risk that the machines might "cheat" by peeking at the answer key while they were studying.

LAFA's Fix: LAFA uses containers (think of these as sealed, time-traveling bubbles).
When a scientist puts their prediction machine into a container, it gets sealed shut. It is given a set of protein sequences to analyze, but the bubble is locked so it cannot see any new biological data that gets discovered after the start date.
This ensures the machine is judged fairly on what it knew at that specific moment, not on what it learned later.

2. The "Living" Answer Key

In biology, our understanding of proteins is like a Wikipedia page that is constantly being edited. New facts are added, and old facts are sometimes corrected or removed.

The Old Way: You took a snapshot of the Wikipedia page, ran your test, and then waited three years to see how you did. By then, the page had changed so much the test felt outdated.
The LAFA Way: LAFA keeps the Wikipedia page open. It continuously checks how well the machines predict the new facts as they appear. It creates "time windows" (like checking your progress every month instead of every three years) to see how the machines handle the evolving truth.

3. The "Replay" Button (Reproducibility)

One of the biggest headaches in science is that if you try to run someone else's code three years later, it often breaks because the software environment has changed.

LAFA's Fix: Because every method is inside a container, it's like putting the machine, its tools, and its manual into a single, indestructible box.
You can open that box five years from now, and the machine will run exactly the same way it did today. This means anyone can verify the results, ensuring the science is honest and reproducible.

4. The "Dashboard" (The Front-End)

LAFA isn't just a backend computer crunching numbers; it has a public website.

Imagine a sports scoreboard that updates in real-time. You can log in and see:
- Which machine is currently the fastest?
- Which machine is the most accurate at guessing "Molecular Functions" vs. "Cellular Locations"?
- How did Machine A improve after its developers retrained it with new data?
It allows scientists to compare different "time windows" (e.g., "How did Machine A do in the last 4 months vs. the last 8 months?").

Why Does This Matter?

No More "Stale" Science: Instead of waiting three years to see if a new method works, developers can get feedback immediately. If their model is failing, they can fix it right away.
Fair Play: Because the data is constantly updating, LAFA can show which models are "aging well" and which ones are becoming obsolete because they rely on old training data.
Open Access: Anyone can see the results, and anyone can submit their own "containerized" machine to be tested. It turns protein prediction from a closed club into an open community effort.

The Bottom Line

LAFA is a new, permanent platform that treats protein function prediction like a continuous marathon rather than a one-time sprint. It ensures that the tools we use to understand life are constantly tested against the latest scientific discoveries, are easy to reuse, and are judged fairly in a transparent, open environment.

The authors are essentially saying: "Stop waiting for the exam every three years. Let's put all the machines in a gym, give them a live scoreboard, and watch them run together every day."

1. Problem Statement

Protein function prediction is a critical yet challenging task in computational biology. While the Critical Assessment of protein Function Annotation (CAFA) provides a prestigious, triennial benchmark for evaluating these methods, it suffers from several limitations that hinder continuous scientific progress:

Infrequency: CAFA occurs only every three years, leaving a gap where new methods cannot be continuously evaluated against evolving ground truth.
The "Open World" Assumption: Annotations accumulate continuously. Methods evaluated in a short CAFA window (4–6 months) may miss valid labels that appear later, leading to mis-evaluation. Conversely, models trained on old data suffer from performance decay as the "ground truth" changes (e.g., GO terms becoming obsolete or being corrected).
Reproducibility and Usability: Many CAFA participants' methods are difficult to install, available only as web servers, or lack robust maintenance in public repositories, making independent verification difficult.
Data Stagnation: Training sets used by models often become outdated between CAFA rounds, failing to reflect the most recent biological knowledge.

2. Methodology: The LAFA Framework

To address these issues, the authors developed LAFA (Longitudinal Assessment of Protein Function Annotation Models), a persistent, containerized benchmarking server.

Core Architecture

LAFA merges the concept of prospective evaluation with continuous data updates. It consists of two main components:

Back-end (Computation): Hosted on a high-performance computing (HPC) cluster using Nextflow workflows. It handles data acquisition, prediction generation, and evaluation.
Front-end (Visualization): A lightweight Dockerized website for interactive visualization of results, allowing users to browse, compare, and analyze performance across different time windows.

Operational Workflow

LAFA operates on a continuous timeline defined by UniProt-GOA data releases (approx. every 8 weeks):

Time Point Build ( $t_0$ ):
- External data (UniProt-GOA, UniProt Knowledgebase, GO Graph) is acquired.
- A Test Set is constructed from high-quality, manually annotated SwissProt sequences (approx. 550k–580k proteins) that remain stable across time points.
- A Training Set is built from experimentally validated annotations available up to $t_0$ .
- Containerized Methods (hosted on DockerHub) are pulled and executed offline to generate predictions for the test set. This ensures no data leakage, as the containers cannot access data released after $t_0$ .
Time Window Evaluation ( $t_0 \to t_1$ ):
- A time window is defined between two data releases (e.g., Sep 2025 to Nov 2025).
- Ground Truth Construction: New experimental annotations accumulated between $t_0$ and $t_1$ for the test set proteins are extracted.
- Evaluation: Predictions generated at $t_0$ are compared against the new ground truth using the CAFA-evaluator package.
- Metrics: Precision, Recall, and F1-scores are calculated across various thresholds. The best F1-score is reported.

Implemented Models

The current implementation includes three state-of-the-art methods and four baselines:

TransFew: Combines protein representations with semantic GO term representations.
FunBind: A multimodal foundational model (currently using sequence modality only).
DeepGOPlus: Combines deep convolutional neural networks with sequence similarity.
Baselines: Naive (frequency-based), Non-Experimental GOA, BLAST (homology-based), and Embedding Similarity.

3. Key Contributions

Continuous Benchmarking: LAFA shifts protein function evaluation from a periodic (triennial) event to a continuous, longitudinal process, allowing for real-time tracking of method performance.
Reproducibility via Containerization: By requiring methods to be submitted as Docker containers, LAFA ensures that methods are portable, reproducible, and isolated from external data changes during execution.
Dynamic Training Data Analysis: The framework allows researchers to retrain models with updated data and immediately compare the new version against the old version on the same time window. This quantifies the impact of training data recency on performance.
Open Infrastructure: The entire pipeline (Nextflow workflows, helper scripts, and containerization guides) is open-source, fostering community contribution and transparency.
Granular Time-Window Comparison: Users can compare performance across different accumulation periods (e.g., 4-month vs. 8-month windows) to assess robustness.

4. Results and Current Status

Pilot Implementation: The system has been tested with a "ground truth target set" of 7,401 proteins (a subset of SwissProt) due to current compute resource limitations.
Data Integration: The system successfully ingested data releases from September 2025 through March 2026 (simulated/future dates in the paper context) and generated predictions for the participating methods.
Visualization: The front-end successfully displays F1-scores and Precision-Recall curves for Molecular Function, Biological Process, and Cellular Component aspects.
Reproducibility: The study demonstrates that predictions from static models (like TransFew and DeepGOPlus) remain identical if input sequences are unchanged, validating the stability of the containerized approach.

5. Significance and Future Directions

LAFA represents a paradigm shift in computational biology benchmarking:

For Developers: It provides a platform to test model robustness against the "drift" of biological knowledge and offers immediate feedback on how training data updates affect performance.
For the Community: It creates a persistent, transparent record of progress in protein function prediction, moving beyond the "snapshot" nature of CAFA.
Future Work: The authors plan to scale the system to the full SwissProt dataset (~550k sequences), collaborate with UniProt-GOA for private hold-out sets to enable immediate evaluation, and introduce new metrics (e.g., GO-slim based) to reduce sensitivity to ontology artifacts.

Conclusion: LAFA establishes a sustainable, open, and reproducible ecosystem for the longitudinal assessment of protein function prediction, addressing the critical need for continuous evaluation in a rapidly evolving biological data landscape.

LAFA: A Framework for Reproducible Longitudinal Assessment of Protein Function Annotation Models