Founder effects shape the evolutionary dynamics of multimodality in open LLM families

This paper analyzes over 1.8 million models to reveal that multimodality in open LLM families emerges through rare "founder events" rather than gradual cross-modal transfer, subsequently expanding rapidly within specific descendant lineages to create a punctuated adoption pattern dominated by image-text tasks.

Manuel Cebrian

Published 2026-03-25
📖 4 min read☕ Coffee break read

Imagine the world of Open Source AI models (like Llama, Gemma, or Mistral) as a massive, bustling family tree where thousands of "children" models are constantly being born from "parent" models.

For a long time, these families were like text-only book clubs. They were great at writing stories, answering questions, and chatting, but they were blind. They couldn't see images.

Recently, the tech world started creating multimodal models—AI that can both read text and see pictures (like describing a photo or solving a math problem from a picture). The big question this paper asks is: How did these "sighted" models enter the text-only families?

Did the text-only models slowly learn to see over time, like a person learning to paint? Or did a few special "founders" suddenly appear with the ability to see, and then their children inherited that power?

Here is the simple breakdown of what the research found:

1. The "Blind" Family vs. The "Sighted" Ecosystem

Think of the entire AI world (Hugging Face) as a giant city. In this city, there are many small, independent workshops making "sighted" robots (multimodal models) long before the big, famous families (like Llama or Gemma) started making them.

  • The Finding: The city had plenty of sighted robots years ago. But inside the famous families, the members remained "blind" until very recently (late 2024/2025).
  • The Analogy: It's like how electric cars were being built by small startups for years, but the big legacy car companies (Ford, GM) didn't start making electric versions of their popular sedans until much later. The technology existed, but the big families didn't adopt it immediately.

2. The "Magic Leap" (Founder Effects)

The researchers looked at the "birth certificates" of these models to see who their parents were. They wanted to know: Did a text-only parent suddenly have a child that could see?

  • The Finding: Almost never. It is incredibly rare (less than 1 in 500 times) for a text-only parent to "mutate" into a sighted child just by being tweaked or fine-tuned.
  • The Analogy: Imagine a family of blind bats. If you take a blind bat and train it really hard, it doesn't suddenly grow eyes. It stays a blind bat.
    • Instead, a new species of "sighted bat" (a Vision-Language Model or VLM) had to be introduced from the outside.
    • Once this one "sighted founder" appeared, it had many, many children. Those children were also sighted, and they had sighted children, and so on.

3. The "Founder Effect" in Action

The paper uses a biological concept called the Founder Effect. This happens when a new population is started by a very small number of individuals.

  • How it works here:
    1. A new "sighted" model appears (the Founder). It might be a new model released by a company like Google or Meta, or a clever combination of parts.
    2. This model has no recorded parent in the text-only family tree. It's a "root" node.
    3. Suddenly, hundreds of new models are created by tweaking this specific sighted model.
    4. Because the parent could see, all the children can see.
  • The Result: The ability to "see" didn't spread slowly from text-model to text-model. It arrived in bursts. One day, a few "sighted founders" showed up, and the next year, their descendants took over the family tree.

4. Why Didn't Text Models Just "Learn" to See?

You might wonder, "Why couldn't a text model just add a camera to its brain?"

  • The Reality: It's not just a software update; it's a structural overhaul.
  • The Analogy: Think of a text-only model as a radio. It's amazing at processing sound. To make it a television, you can't just turn a dial. You have to completely rebuild the chassis, add a screen, install a camera, and rewrite the wiring.
  • Because this is so hard, the "radio" families didn't just slowly turn into TVs. Instead, someone built a TV from scratch (the Founder), and then everyone started copying the TV design.

Summary: The "Punctuated" Evolution

The paper concludes that the evolution of AI isn't a smooth, slow line. It's punctuated.

  • Smooth: Text models getting slightly smarter at writing.
  • Punctuated: A sudden, rare event where a "sighted" model is born. Once born, it explodes in popularity within its own family line.

The Takeaway:
If you want to see the future of AI vision, don't look at the text-only models trying to adapt. Look for the new "Founders"—the rare, new models that appear out of nowhere with the ability to see. Once they appear, they will quickly dominate their family trees, while the old text-only families will remain blind until the next "Founder" arrives.