The Big Idea: The Problem of the "Old Menu"

Imagine you walk into an upscale restaurant in 2026. You ask the waiter: "What can this kitchen do?" The waiter hands you a menu, but it is a menu from 2023. It lists dishes prepared with ingredients that no longer exist and with cooking techniques that have been replaced by faster, smarter methods.

If you read the menu, you might conclude: "This restaurant cannot prepare good food." But that is not true. The restaurant can prepare good food; they simply haven't updated the menu you are reading.

This paper argues that scientific research on AI does exactly that.

Researchers test AI models that are already "old" (from one or two years ago) and test them in a "simple" way (without their newest, smartest features). Then they write papers stating: "AI cannot do X." But because they did not test the current AI or use its current settings, the conclusion is misleading. It is like judging a 2026 Ferrari by driving a 2023 Ford Pinto.

The Three Ways the "Menu" Is Outdated

The authors found that the gap between what AI can actually do right now and what the papers claim is huge. They divided this gap into three parts:

1. The Time Lag (The Problem of "Yesterday's News")

The Analogy: Imagine a tech reviewer testing a new smartphone. Instead of testing the model released today, they test a model released 18 months ago.
The Result: The median paper in this study tested an AI model that was approximately one major generation behind the best available AI at the time of the study. If the best AI is a "super-brain," the papers mainly tested a "smartphone" from the previous year.

2. The Distribution Lag (The Problem of the "Budget Version")

The Analogy: Imagine a car manufacturer releases two cars: a "Pro" model with a turbo engine and a "Mini" model with a standard engine. A reviewer buys the "Mini" because it is cheaper, drives a few laps around the block, and writes a report stating: "This car brand is slow." They never drove the "Pro."
The Result: Even when researchers used the "right" AI family (like GPT or Claude), they often tested the cheaper, weaker version (like "Mini" or "Flash"), while a much stronger "Pro" or "Opus" version was already available.

3. The Configuration Lag (The Problem of the "Switched-Off Light")

The Analogy: Imagine you are testing a high-tech robot that can think, use tools, and solve puzzles. But you test it with the "thinking" switch turned off, the tool kit locked, and you ask it only a simple question without giving it any hints. Then you conclude: "This robot is useless."
The Result: This is the biggest surprise. Modern AI has a "reasoning mode" (like a deep thinking process) and can use tools (like web search or code editors).
- Only 3.2% of the papers testing these "thinking" models actually stated whether they had the reasoning mode turned on or off.
- Most papers tested the AI in "Zero-Shot" mode (just a single question), instead of giving it time to think or tools to help.
- Result: They test the AI with its hands tied and then claim it cannot complete the task.

The "Generalization" Trap

The paper found that 52.5% of the abstracts (the short summaries at the beginning of papers) made a dangerous error.

What they did: They tested a specific, older, weaker AI.
What they wrote: They concluded that "AI" (as an entire category) cannot handle the task.
The Analogy: It is like testing a specific, defective bicycle and writing a headline: "Bicycles are dangerous." The headline ignores the fact that they only tested a single defective bicycle, not all bicycles.

Because these headlines are cited by doctors, lawyers, and policymakers, the world begins to believe that AI is worse than it actually is.

Why Does This Happen? (It Is Not Malice)

The authors carefully emphasize: The researchers are not lying. They are doing their best with the tools they have.

Money: Running the newest, smartest AI models is incredibly expensive. Academic researchers often cannot afford the "Pro" versions, so they use the free or cheaper versions.
Time: It takes years to publish a paper. By the time a paper is printed, the world of AI has evolved further.
Habit: The rules for writing these papers were written before AI had "reasoning modes" or "toolkits." Researchers follow old rules that do not fit the new technology.

The Solution: A New "Label" System

The paper proposes a simple correction called versio-ai. It is like a new nutrition label for AI papers. Before a paper is published, the authors must clearly state:

Exactly which model they used (e.g., "GPT-5.5 Pro," not just "GPT").
When they tested it.
How they tested it (Did they turn the "reasoning" mode on? Did they give it tools?).

If these three points are missing, the paper should be rejected. This does not make AI smarter, but it prevents us from reading the "old menu" and thinking the restaurant has stopped cooking.

Summary

The scientific literature currently shows us only a shadow of what AI can do, not the reality. It is a shadow cast by older, weaker models that were tested in simple ways. The gap between this shadow and real AI grows larger every year. The paper argues that the world will continue to underestimate the capabilities of AI unless researchers become more specific about what they exactly tested.

Technical Summary: Frontier Lag: A Bibliometric Examination of Capability Misrepresentation in Academic AI Evaluation

1. Problem Statement

The applied literature evaluating Large Language Models (LLMs) in domains such as medicine, law, programming, education, and scientific reasoning systematically misrepresents current AI capabilities. The examination identifies a structural discrepancy between the systems tested in scientific papers and the contemporary "frontier" (peak) of AI capabilities.

This discrepancy, termed the publication elicitation gap, arises from three compounding factors:

Temporal Lag: Papers evaluate models released months or years prior to the publication date, thereby missing subsequent generations.
Tier Lag: Papers frequently test weaker tiers of a model family (e.g., "mini" or "Flash" versions), while stronger sibling models (e.g., "Pro" or "Opus") are already publicly available.
Configuration Underspecification: Method sections often omit critical elicitation details (reasoning mode, tool access, scaffolding, sampling parameters), leading to a "naive" evaluation that fails to capture the model's full potential.

Consequently, abstracts and subsequent citations generalize specific, underspecified results to the class of "AI," creating a misleading narrative for clinicians, policymakers, and downstream consumers regarding what AI can currently achieve.

2. Methodology

The study is a preregistered bibliometric examination conducted on a corpus of academic literature from January 1, 2022, to April 1, 2026.

Corpus Construction

Source: OpenAlex Snapshot (March 2026).
Scope: 112,303 records matched via keywords ("LLM", "GPT", "Claude", etc.) across five domains: medicine, law, programming, education, and scientific reasoning.
Inclusion Criteria: 18,574 papers met eligibility criteria (empirical evaluation of a named LLM on an applied task, quantitative results, peer-reviewed or frontier preprint).
Capture Validation: A stratified random sample from a remaining pool estimated the capture rate at approximately 80%, with no significant bias in primary outcomes (gap size, valence, framing).

Measurement Framework

The examination assesses papers across three dimensions:

Capability Dimension: Measured via the Epoch AI Capabilities Index (eci). The primary outcome is the eci_gap, defined as the difference between the contemporary frontier (the highest eci model available on the evaluation date) and the model tested in the paper.
- Imputation: If the evaluation date is not disclosed, it is imputed as max(publication date - 180 days, model release date).
- Sensitivity: Results are validated against independent scales: Chatbot Arena Elo and the Artificial Analysis Intelligence Index.
Elicitation Dimension: Assesses the disclosure of configuration details (reasoning mode, thinking effort, tool usage, scaffolding, multi-agent architecture, prompting strategy).
Interpretation Dimension: Measures whether conclusions are generalized from the specific tested model to the class of "AI" (ai_generic framing) and whether human/professional comparison groups are present.

Extraction and Validation

Pipeline: Automated extraction using a frontier LLM (V4F-Max) for eligibility classification and field extraction, validated against a double human gold standard (n=300) and cross-family triads (GPT-5, Claude Opus, Gemini).
Validation: Cohen's $\kappa$ values exceeded preregistered thresholds (e.g., 0.896 for the primary model, 0.767 for conclusion valence).
Hypothesis Testing: Preregistered confirmatory tests (H1, H3, H6) use the Holm step-by-step correction ( $\alpha=0.05$ ) against null hypotheses of structural zeros. Descriptive magnitudes (H2, H4, H5) use simultaneous 95% confidence intervals.

3. Key Contributions

Quantification of the Publication Elicitation Gap: The examination provides the first domain-spanning, preregistered measurement of the distance between academic evaluations and the frontier, broken down into temporal, tier, and configuration components.
Definition of "Combined Failure": It operationalizes a metric for papers that fail simultaneously on capabilities (lagging behind the frontier), elicitation (missing configuration details), and interpretation (overly generalizing claims).
versio-ai v1.2 Checklist: A 13-point reporting checklist aimed at extending existing frameworks (CONSORT-AI, TRIPOD-LLM, etc.) by mandating disclosure of the "elicitation surface" (model snapshot, evaluation date, reasoning mode, tool access, etc.).
frontierlag Tool: A live Python package and web tool enabling users to input a DOI and receive an audit report detailing the paper's distance to the frontier and its disclosure status.

4. Key Findings

Significant and Widening Lag (H1, H2):
- The median paper evaluates a model +10.85 eci behind the contemporary frontier. This gap corresponds to approximately 1.4 times the distance between Claude Sonnet 3.7 and Opus 4.5 (a major tier jump).
- The gap widens at a rate of +5.53 eci/year, indicating that literature falls behind the frontier faster than publication cycles can renew the corpus.
Tier Lag (H3):
- For papers where a stronger sibling model was publicly available within 90 days, the median tier lag is +12.63 eci.
Configuration Underspecification (H4):
- Only 3.2% of abstracts and 21.2% of full texts disclose the reasoning mode for reasoning-capable models.
- Evaluation data is disclosed in only 18.4% of full-text papers.
Class-Level Generalization (Descriptive):
- 52.5% of abstracts frame conclusions at the level of "AI" rather than the specific tested model.
- This tendency is increasing, with odds rising by OR = 1.23 per year.
Combined Failure Rate (H5):
- Under a conservative operationalization, 9.2% of eligible papers fail all three examination dimensions simultaneously.
- Under an inclusive sensitivity analysis, this rate rises to 38.3%.
Valence Asymmetry (H6):
- No significant correlation was found between the magnitude of the lag and the valence (positive/negative) of the paper's conclusion.

5. Implications and Claims

The work argues that the academic record as a whole increasingly fails to inform readers about which AI it is discussing.

Structural, Not Individual: The examination explicitly states it does not accuse individual authors of malicious intent. The pattern is a predictable equilibrium of peer-review cycles, cost-constrained API access, and reporting standards inherited from a pre-reasoning-model era.
Misrepresentation vs. Truth: The examination measures "distance to the frontier," not "distance to truth." It does not claim that repeating these experiments with frontier models would necessarily reverse the results, but rather that the published claims are decoupled from the current state of the art.
Downstream Impacts: The findings suggest that policy briefs, clinical procurement decisions, and safety research based on these papers operate with outdated and underspecified data.
Remediation: The work proposes shared responsibility among authors, editors, and funders:
- Authors: Apply the versio-ai checklist to disclose the configuration surface.
- Editors/Reviewers: Enforce disclosure of model snapshots, evaluation data, and reasoning modules.
- Funders: Tie grants to disclosure and provide budgets for API access so academic groups can evaluate configurations near the frontier rather than relying exclusively on cheaper, outdated alternatives.

The work concludes that while no single paper "answers its own question wrong," the collective literature paints a distorted picture of AI capabilities that requires structural intervention to correct.

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation