Longevity Bench: Are SotA LLMs ready for aging research?

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a group of very smart, super-advanced robots (the AI models) that can read almost everything ever written on the internet. They can write poetry, solve math problems, and chat like humans. But here's the big question: Do they actually understand how living things work, or are they just really good at guessing based on patterns they've seen before?

To find out, the scientists at Insilico Medicine created a special "final exam" called LongevityBench.

Think of this exam like a driver's test for aging. Just as a driving test checks if you can actually handle a car in rain, snow, and traffic—not just recite the rules of the road—LongevityBench checks if these AI robots truly understand the biology of getting old.

The Test Subjects: The "Super-Students"

The researchers picked 15 of the smartest AI models currently available (like the latest versions of GPT, Gemini, Claude, and others). They threw a massive pile of biology homework at them. This wasn't just reading a textbook; it was real-world data.

The homework covered five different "subjects":

The Crystal Ball (Survival): Looking at a person's medical records and blood tests, can the AI guess how long they will live?
The Time Machine (Age Prediction): Looking at a sample of DNA or blood proteins, can the AI tell you exactly how old the person is?
The Genetic Puzzle (Mutations): If you change a specific gene in a fly or a mouse, will they live longer or shorter? Can the AI predict the outcome?
The Cancer Detective: Looking at tumor data, can the AI guess which patient will stay healthy longer?
The Gene Writer: Can the AI look at a partial list of active genes and "fill in the blanks" with the rest of the list correctly?

The Results: Who Passed?

The results were a mix of "brilliant" and "confused."

The Top Performers: The Google Gemini 3 Pro and OpenAI's GPT-5/o3 models came out on top. They were like the valedictorians of the class, getting the highest average scores. They were particularly good at looking at medical records and guessing survival times.
The "One-Trick Ponies": Some models were amazing at one thing but terrible at another. For example, Claude Sonnet was the best at predicting cancer survival but struggled with other tasks. It's like a student who is a genius at math but fails history.
The "Random Guessers": When the test got really hard—like trying to predict age just by looking at a list of proteins (proteomics)—most of the AIs dropped to near-random guessing. It was as if they were just flipping a coin.

The Big Surprise: The "Question Trick"

The most interesting part of the paper wasn't just who got the best grades, but how they took the test.

The researchers noticed that the AI's performance changed wildly depending on how the question was asked.

The Analogy: Imagine asking a student, "Who is taller: Alice or Bob?" They might get it right. But if you ask, "Who is shorter: Alice or Bob?" they might get it wrong, even though the answer is the same.
The Reality: The AI models were very sensitive to the wording. If you asked them to pick the older person between two people (a "pairwise" choice), they often failed. But if you asked them to guess the age group (e.g., "Is this person in their 60s or 70s?"), they did much better.

This suggests the AI isn't building a deep, internal "map" of how aging works. Instead, they are often just recognizing specific patterns in the question format. If you change the format, their "knowledge" seems to vanish.

The "Compression" Problem

There was another weird glitch. When the AI tried to guess exactly how many months a person had left to live, they almost always underestimated it.

The Metaphor: It's like a weather forecaster who, no matter the data, always predicts it will rain tomorrow. Even if the sky is blue, they say "rain."
The Cause: The AI seems to be trained on so much text about "disease" and "death" that whenever it sees a medical record, it panics and assumes the worst, ignoring the fact that many people live long lives despite having health issues.

The Verdict: Are They Ready for Science?

Not quite yet.

The paper concludes that while these AI models are incredibly powerful tools for writing code, summarizing papers, or brainstorming ideas, they are not yet ready to be trusted as independent scientists in the field of aging.

They are great at: Reading medical records and giving a "best guess" on survival.
They are bad at: Understanding the deep, complex mechanics of how genes and proteins interact to cause aging, especially when the data is messy or the question is phrased differently.

The Takeaway

The scientists aren't saying "stop using AI." They are saying, "Use AI, but check its work."

Think of LongevityBench as a report card. It tells researchers, "Hey, this AI is good at math, but don't let it drive the car yet." The goal now is to use these test results to train the next generation of AI so they don't just memorize facts, but actually understand the biology of life and death.

In short: The robots are smart, but they still need a human teacher to make sure they aren't just making things up.

Longevity Bench: Are SotA LLMs ready for aging research?

The Test Subjects: The "Super-Students"

The Results: Who Passed?

The Big Surprise: The "Question Trick"

The "Compression" Problem

The Verdict: Are They Ready for Science?

The Takeaway

1. Problem Statement

2. Methodology: LongevityBench Framework

A. Dataset Composition

B. Task Formats

C. Evaluation Metrics

D. Models Evaluated

3. Key Results

A. Overall Performance

B. Specific Domain Findings

C. Format Dependence (Critical Finding)

4. Key Contributions

5. Significance and Implications

Longevity Bench: Are SotA LLMs ready for aging research?

The Test Subjects: The "Super-Students"

The Results: Who Passed?

The Big Surprise: The "Question Trick"

The "Compression" Problem

The Verdict: Are They Ready for Science?

The Takeaway

1. Problem Statement

2. Methodology: LongevityBench Framework

A. Dataset Composition

B. Task Formats

C. Evaluation Metrics

D. Models Evaluated

3. Key Results

A. Overall Performance

B. Specific Domain Findings

C. Format Dependence (Critical Finding)

4. Key Contributions

5. Significance and Implications

More like this

Functional-space alignment resolves the eco-evolutionary landscape of siderophore biosynthesis across bacteria

Exploring molecular signatures of senescence with markeR, an R toolkit for evaluating gene sets as phenotypic markers

TFBindFormer: A Cross-Attention Transformer for Transcription Factor-DNA Binding Prediction

TSvelo: Comprehensive RNA velocity by modeling cascade of gene regulation, transcription and splicing

From Movement to METs: A Validation of ActTrust(R) for Energy Expenditure Estimation and Physical Activity Classification in Young Adults