Founder effects shape the evolutionary dynamics of multimodality in open LLM families

Imagine the world of Open Source AI models (like Llama, Gemma, or Mistral) as a massive, bustling family tree where thousands of "children" models are constantly being born from "parent" models.

For a long time, these families were like text-only book clubs. They were great at writing stories, answering questions, and chatting, but they were blind. They couldn't see images.

Recently, the tech world started creating multimodal models—AI that can both read text and see pictures (like describing a photo or solving a math problem from a picture). The big question this paper asks is: How did these "sighted" models enter the text-only families?

Did the text-only models slowly learn to see over time, like a person learning to paint? Or did a few special "founders" suddenly appear with the ability to see, and then their children inherited that power?

Here is the simple breakdown of what the research found:

1. The "Blind" Family vs. The "Sighted" Ecosystem

Think of the entire AI world (Hugging Face) as a giant city. In this city, there are many small, independent workshops making "sighted" robots (multimodal models) long before the big, famous families (like Llama or Gemma) started making them.

The Finding: The city had plenty of sighted robots years ago. But inside the famous families, the members remained "blind" until very recently (late 2024/2025).
The Analogy: It's like how electric cars were being built by small startups for years, but the big legacy car companies (Ford, GM) didn't start making electric versions of their popular sedans until much later. The technology existed, but the big families didn't adopt it immediately.

2. The "Magic Leap" (Founder Effects)

The researchers looked at the "birth certificates" of these models to see who their parents were. They wanted to know: Did a text-only parent suddenly have a child that could see?

The Finding: Almost never. It is incredibly rare (less than 1 in 500 times) for a text-only parent to "mutate" into a sighted child just by being tweaked or fine-tuned.
The Analogy: Imagine a family of blind bats. If you take a blind bat and train it really hard, it doesn't suddenly grow eyes. It stays a blind bat.
- Instead, a new species of "sighted bat" (a Vision-Language Model or VLM) had to be introduced from the outside.
- Once this one "sighted founder" appeared, it had many, many children. Those children were also sighted, and they had sighted children, and so on.

3. The "Founder Effect" in Action

The paper uses a biological concept called the Founder Effect. This happens when a new population is started by a very small number of individuals.

How it works here:
1. A new "sighted" model appears (the Founder). It might be a new model released by a company like Google or Meta, or a clever combination of parts.
2. This model has no recorded parent in the text-only family tree. It's a "root" node.
3. Suddenly, hundreds of new models are created by tweaking this specific sighted model.
4. Because the parent could see, all the children can see.
The Result: The ability to "see" didn't spread slowly from text-model to text-model. It arrived in bursts. One day, a few "sighted founders" showed up, and the next year, their descendants took over the family tree.

4. Why Didn't Text Models Just "Learn" to See?

You might wonder, "Why couldn't a text model just add a camera to its brain?"

The Reality: It's not just a software update; it's a structural overhaul.
The Analogy: Think of a text-only model as a radio. It's amazing at processing sound. To make it a television, you can't just turn a dial. You have to completely rebuild the chassis, add a screen, install a camera, and rewrite the wiring.
Because this is so hard, the "radio" families didn't just slowly turn into TVs. Instead, someone built a TV from scratch (the Founder), and then everyone started copying the TV design.

Summary: The "Punctuated" Evolution

The paper concludes that the evolution of AI isn't a smooth, slow line. It's punctuated.

Smooth: Text models getting slightly smarter at writing.
Punctuated: A sudden, rare event where a "sighted" model is born. Once born, it explodes in popularity within its own family line.

The Takeaway:
If you want to see the future of AI vision, don't look at the text-only models trying to adapt. Look for the new "Founders"—the rare, new models that appear out of nowhere with the ability to see. Once they appear, they will quickly dominate their family trees, while the old text-only families will remain blind until the next "Founder" arrives.

1. Problem Statement

While Large Language Model (LLM) families are rapidly evolving, the mechanisms governing the emergence and propagation of multimodal capabilities (specifically Vision-Language Models or VLMs) within open model ecosystems remain unclear.

The Core Question: Does multimodality arise primarily through the incremental adaptation of existing text-only checkpoints (a "gradual conversion" model), or does it enter via rare, high-complexity integration events that create new "founder" models, followed by rapid expansion within those specific lineages?
The Gap: Previous studies have focused on performance benchmarks, but there is a lack of ecosystem-scale measurement regarding the lineage dynamics (parent-child relationships) of multimodal adoption in open source communities (e.g., Hugging Face).

2. Methodology

The study utilizes the ModelBiome AI Ecosystem dataset, a snapshot of the Hugging Face model hub as of July 2025.

Data Scale: ~1.86 million model entries and ~3.02 million directed lineage edges (parent-child relationships).
Data Sources: Model metadata, task tags (pipeline tags), model cards, and recorded lineage fields (fine-tuning, merging, adapters, quantization).
Definitions:
- VLMs: Models tagged with image-text tasks (e.g., image-to-text, image-text-to-text).
- Open LLM Families: Identified via name-based patterns within Transformer architectures, excluding diffusion-oriented pipelines.
Analytical Approach:
- Temporal Analysis: Tracking the share of multimodal models over time in the broader ecosystem vs. specific LLM families.
- Lineage-Conditioned Transition Rates: Calculating the probability $P(\text{child is VLM} \mid \text{parent is Text})$ across different relation types (fine-tuning, merging, etc.).
- Founder Concentration: Analyzing the "effective number of founders" ( $N_{eff}$ ) and the share of descendants attributable to top parent models to detect "founder effects."
- Statistical Tools: 95% Wilson score confidence intervals for proportion estimates; HHI (Herfindahl-Hirschman Index) for diversity metrics.

3. Key Contributions

Decoupling of Ecosystem vs. Family Adoption: Demonstrated that cross-modal tasks are prevalent in the broader Hugging Face ecosystem years before they become common within major open LLM families.
Quantification of Transfer Barriers: Provided empirical evidence that the transition from text-generation models to VLMs is extremely rare (approx. 0.2% for fine-tuning), refuting the hypothesis of gradual, routine conversion.
Identification of "Founder Effects": Established that multimodality in open families follows a punctuated equilibrium pattern: rare integration events create VLM "founders," which then undergo rapid, high-fidelity replication within their own lineages.
Lineage Asymmetry: Showed a stark asymmetry where VLMs are rarely born from text parents, but VLM children are overwhelmingly derived from VLM parents (94.5% of VLM fine-tuning edges originate from VLM parents).

4. Key Results

A. Temporal Lag and Ecosystem Prevalence

Ecosystem-wide: Multimodal tags appeared early in the record (2022).
LLM Families: Multimodality remained near zero through 2023 and most of 2024, only surging sharply in late 2024–2025.
Dominant Modality: The surge is driven almost exclusively by image-text tasks, not audio or video.

B. Transition Rates (Text $\to$ VLM)

The probability of a text-generation parent producing a VLM child is negligible:

Fine-tuning: 0.218% (50 out of 22,928 edges).
Merging: 0.104%.
Quantization: 0.133%.
Time-Series: These rates do not show a sustained upward trend; they remain near zero with only transient, episodic spikes (e.g., a brief rise in late 2024).

C. Founder-Driven Expansion (VLM $\to$ VLM)

Once a VLM lineage is established, it expands rapidly and self-sustains:

Retention: 94.5% of VLM children come from VLM parents.
Concentration: A small number of "founder" models dominate the lineage. For example, naver-clova-ix/donut-base accounts for ~28% of all observed VLM $\to$ VLM fine-tuning edges. The top 3 founders account for ~49%.
Roots: Approximately 60% of VLM releases appear as "roots" (no recorded parent), suggesting they are new integration events rather than derivatives of existing text models.

D. Lag Times

There is a significant delay between the release of the first text-generation model in a family and the first VLM variant:

Gemma: ~1 month lag.
GLM: ~26 months lag.
Others: Often >1 year.

5. Significance and Implications

Evolutionary Dynamics: The paper applies founder effect theory from biology to AI. It suggests that multimodal innovation in open source is not a smooth, continuous diffusion but a punctuated process. Rare, high-effort integration events create new "species" (VLMs), which then diversify rapidly, while the "ancestral" text-only lineages rarely cross the modality barrier.
Technical Bottlenecks: The rarity of text-to-VLM transitions implies that adding vision to a text model is not a simple fine-tuning task. It requires distinct architectural interfaces, data pipelines, and evaluation protocols that are not captured by standard derivative operations (fine-tuning/merging).
Future Predictions:
- If the community develops standardized, low-friction interfaces (e.g., efficient adapters, modular vision encoders), the transition rates from text to VLM should increase, shifting the dynamic from "founder-driven" to "gradual conversion."
- Without such standardization, multimodal growth will remain dominated by the replication of existing VLM lineages and periodic, high-cost founder events.
Policy & Research: Researchers should focus on the "bridging mechanisms" that allow text models to become VLMs, as the current ecosystem relies heavily on a few specific models to seed the entire multimodal derivative tree.

In summary, the paper argues that the open LLM ecosystem is currently bottlenecked by the difficulty of the initial integration event. Once a VLM founder is established, the ecosystem efficiently replicates it, but the "jump" from text-only to multimodal remains a rare, high-barrier event.

Founder effects shape the evolutionary dynamics of multimodality in open LLM families

1. The "Blind" Family vs. The "Sighted" Ecosystem

2. The "Magic Leap" (Founder Effects)

3. The "Founder Effect" in Action

4. Why Didn't Text Models Just "Learn" to See?

Summary: The "Punctuated" Evolution

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

A. Temporal Lag and Ecosystem Prevalence

B. Transition Rates (Text →\to→ VLM)

C. Founder-Driven Expansion (VLM →\to→ VLM)

D. Lag Times

5. Significance and Implications

More like this

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

Graphs RAG at Scale: Beyond Retrieval-Augmented Generation With Labeled Property Graphs and Resource Description Framework for Complex and Unknown Search Spaces

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Personalized Federated Sequential Recommender

B. Transition Rates (Text $\to$ VLM)

C. Founder-Driven Expansion (VLM $\to$ VLM)