Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers

Imagine the world of academic research as a massive, bustling library where millions of scholars write papers every year. For decades, the "voice" of these papers was consistent: a specific style of English, with certain words appearing frequently and others rarely.

Then, a new group of invisible scribes arrived: Large Language Models (LLMs) like ChatGPT, Claude, and Gemini. These AI tools are incredibly smart, but they have their own unique "accents" and habits.

This paper, titled "Beyond Via," is like a linguistic detective story. The authors went into the library (specifically, the arXiv preprint server) to answer two big questions:

How are these AI scribes changing the way scholars write?
Can we tell which specific AI wrote a sentence, or are they all starting to sound the same?

Here is the breakdown of their findings, explained with some everyday analogies.

1. The "Accent" of the AI Scribes

Just as a person from New York might say "bodega" and someone from London might say "corner shop," different AI models have developed distinct linguistic habits.

The "Via" and "Beyond" Craze: The authors noticed that newer AI models love using the words "via" and "beyond" in paper titles. It's like if every new chef suddenly started garnishing every dish with a specific, fancy herb that no one used before.
- The Analogy: Imagine a fashion trend where suddenly, every single person in town starts wearing a specific type of hat. The researchers saw this happening with words in academic titles.
The Disappearing "The" and "Of": Conversely, the most common words in English, like "the" and "of," are becoming less frequent in abstracts. The AI seems to be trying to sound more "efficient" or "dense," skipping the little connecting words humans use naturally.
- The Analogy: It's like a text message where you drop all the vowels and just send "Wnt2g2st." The AI is doing something similar with academic writing, stripping away the "glue" words.

2. The Chameleon Effect (Models Changing Over Time)

The paper highlights that AI isn't static; it evolves. The "voice" of an AI model from 2023 is different from the "voice" of a model in 2025.

The "Delve" vs. "Together" Shift: In the early days of ChatGPT, the word "delve" (as in "delve into the data") was a huge red flag for AI writing. But newer models have stopped using it so much. Meanwhile, the word "together" was once rare in AI text but has recently spiked in popularity.
- The Analogy: Think of it like pop music. In 2020, everyone was singing a specific type of ballad. By 2024, the trend shifted to upbeat pop. If you hear a song, you can guess the year it was made based on the style. The authors found that AI writing styles shift just as fast as pop music trends.

3. The "Whodunit" Problem (Can We Detect the AI?)

The researchers tried to build a "detector" to figure out which specific AI wrote a text. They set up a game: "Was this paragraph written by GPT-4, DeepSeek, or a human?"

The Result: The detectors were good at spotting if AI was used, but terrible at guessing which AI.
- The Analogy: Imagine a police lineup where the suspects are all wearing identical grey jumpsuits. You can easily tell they aren't the regular townspeople (humans), but you can't tell which specific person is standing where. The different AI models are becoming so similar that they are blurring together.
The Homogenization: The paper suggests that as AI gets better, it's becoming more "human-like," but in a way that makes all the AIs sound like the same generic "super-human." This makes it harder to distinguish between them.

4. The "Crystal Ball" Method (Estimating Impact)

Since the "detective" approach (classifiers) is getting confused, the authors used a simpler, more transparent method: Word Counting.

They treated the academic library like a garden.

The Baseline: They looked at how often certain weeds (words like "the" or "furthermore") grew in the garden before the AI scribes arrived (2015–2021). They drew a straight line predicting how many weeds should be there in 2025 if humans were still writing alone.
The Deviation: Then, they looked at the actual garden in 2025. They saw that the "weeds" (specific words) were growing way faster or slower than the line predicted.
The Conclusion: By measuring this "overgrowth" or "undergrowth," they could estimate how much of the garden was being tended to by AI. They found that by 2025, a significant portion of academic writing has been touched by AI, and this number is growing fast.

The Big Takeaway

The paper concludes that AI is reshaping the landscape of academic writing, not just by writing the papers, but by subtly changing the vocabulary and style of the entire field.

The Warning: If we rely only on complex "black box" detectors, we might miss the nuance because the AI models are becoming too similar to each other.
The Insight: Simple tools—like watching which words are becoming trendy or disappearing—are actually very powerful for understanding how technology is changing human communication.

In short: The authors are telling us that the "sound" of academic research is changing. It's becoming slightly more efficient, slightly more "AI-accented," and the models are all starting to sound like the same person. We need to keep our eyes on the little details (like the word "via") to understand the big picture.

1. Problem Statement

The rapid integration of Large Language Models (LLMs) into academic writing has raised concerns regarding the detectability of AI-generated content and the evolving stylistic shifts in scholarly publications. While previous studies have focused on detecting LLM usage or identifying specific "AI-style" words (e.g., "delve"), this paper addresses two critical gaps:

Dynamic Evolution: LLMs are not static; different models (and versions of the same model) exhibit distinct and evolving linguistic preferences.
Estimation vs. Detection: Current classifiers struggle to distinguish between specific LLMs in multi-class scenarios, and black-box detection methods are often unreliable in real-world settings where human editing occurs.
The authors aim to quantify the impact of LLMs on academic writing by analyzing word usage patterns, specifically focusing on how different models influence titles and abstracts in the arXiv repository.

2. Methodology

The study employs a mixed-method approach combining large-scale data analysis, simulation, and statistical modeling.

Data Collection & Preparation

Source: A weekly-updated arXiv dataset (Kaggle) containing over 2.9 million papers.
Simulation Dataset: 2,000 real arXiv abstracts from Jan–Oct 2022 (pre-ChatGPT) were selected as a baseline.
LLM Simulation: These 2,000 abstracts were processed using nine different LLMs (GPT-3.5, GPT-4o mini, GPT-5 nano, DeepSeek V3/R1/V3.2, Gemini 2.5/3, Claude 3/4.5) via API.
Prompts: Two prompt types were used:
- Short: "Revise the following sentences" (light editing).
- Long: A detailed persona prompt acting as a professional academic editor (deep editing).
Tasks: The models were tasked with (1) rewriting abstracts and (2) generating titles based on the abstracts.

Analytical Framework

Trend Modeling:
- The authors model word frequency $f_w(t)$ as a linear trend ( $a_w + b_w t$ ) derived from pre-LLM data (2015–2021).
- Deviations from this linear trend in post-2022 data are attributed to LLM influence.
Impact Estimation (Inverse Problem):
- They formulate an optimization problem to estimate the proportion of text generated by specific models ( $\eta_{m,p}(t)$ ) versus human-written text ( $\eta_0(t)$ ).
- The model minimizes the difference between observed word frequencies and a weighted sum of frequencies from human text and various LLM outputs (Equation 6).
Similarity & Classification:
- Similarity: Used ROUGE-1/2/L and BERTScore to compare LLM outputs against original human texts and across different prompts.
- Classification: Trained classifiers (BERT, GPT-2, T5, LLM2Vec) to distinguish between texts generated by different LLMs and human texts in binary and multi-class settings.

3. Key Contributions

Identification of "Beyond" and "Via": The paper highlights a specific shift in title generation where newer models (e.g., DeepSeek, GPT-5) show a marked preference for the words "beyond" and "via", which has subsequently appeared in real arXiv titles starting in 2025.
Stopword Shifts: Contrary to the assumption that LLMs only affect rare vocabulary, the study finds significant changes in high-frequency stopwords. Specifically, the frequency of "the" and "of" has decreased in abstracts, while "together" shows a volatile pattern (decline followed by a sharp rise), reflecting the shifting preferences between older and newer models.
Homogenization vs. Differentiation: The study demonstrates that while LLMs are becoming more similar to human writing in terms of ROUGE scores (lexical overlap), they remain distinct in semantic space (BERTScore). Furthermore, different models are converging in style (homogenization), making multi-class classification increasingly difficult.
Interpretable Estimation Method: The authors propose a transparent, linear regression-based approach to estimate LLM usage impact, arguing it offers better interpretability than complex black-box classifiers.

4. Key Results

Word Usage Patterns

Titles: Words like "via" and "beyond" show a sharp increase in frequency in real titles post-2025, correlating with the release of newer models that favor these terms.
Abstracts:
- Decrease: "The" and "of" frequencies dropped significantly in real data compared to linear predictions.
- Increase: Words like "furthermore" spiked with certain models (e.g., GPT-4o mini) but have since normalized, suggesting a rapid evolution in model behavior.
- Model Specificity: Different models have distinct "fingerprints." For instance, GPT-5 nano and DeepSeek V3.2 show different preferences for common words compared to GPT-3.5.

Classification Performance

Binary Classification (Human vs. AI): Achieved high accuracy (78%–95%) using BERT and T5.
Multi-Class Classification (Specific Model Identification): Accuracy dropped significantly (to ~60% for 4-class, and lower for 7+ classes).
Confusion: Newer models are harder to distinguish from each other and from human text. The confusion matrices show that human-written texts are frequently misclassified as LLM-generated (false positives), and different LLMs are often confused with one another.

Impact Estimation

The estimated proportion of LLM-generated text in arXiv abstracts was near zero before late 2022.
Post-2023, the model estimates a growing but heterogeneous adoption. The proportion of text resembling "GPT-3.5 style" initially rose and then declined, suggesting users are shifting to newer models (GPT-4/5, DeepSeek).
Stopword-based estimates showed higher volatility, while common word estimates provided more stable trends.

5. Significance and Implications

Dynamic Monitoring: The paper argues that static detection tools are insufficient. As LLMs evolve, their linguistic fingerprints change, requiring continuous re-evaluation of impact metrics.
Beyond Detection: The focus shifts from "Is this AI?" to "How much AI influence exists?" The proposed linear estimation method provides a scalable way to track these trends without relying on fragile classifiers.
Academic Integrity: The findings suggest that LLMs are subtly reshaping academic discourse, not just by generating content but by altering the frequency of fundamental grammatical structures (stopwords) and title conventions.
Future Research: The authors highlight the need for more granular tracking of model-specific impacts and the potential for "human-LLM co-evolution," where human writing styles adapt to mimic or counteract AI patterns.

In conclusion, "Beyond Via" provides empirical evidence that LLMs are actively reshaping the linguistic landscape of academic publishing. It moves the field beyond simple detection toward a nuanced understanding of how different models dynamically influence scholarly communication.