BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

Imagine you are trying to teach a robot to recognize animals in the wild. You show it millions of photos of birds, bugs, and flowers. But here's the problem: for almost all of these photos, you only have a name tag, like "Calliope Hummingbird." You don't have a description of what the bird actually looks like in that specific picture.

This is the challenge the paper BIOCAP tackles. It asks: How do we teach a robot to see the details, not just memorize the names?

Here is the story of how they solved it, using simple analogies.

1. The Problem: The "Name-Only" Teacher

Think of a standard AI model (like the one behind Google Images) as a student who is only allowed to study flashcards.

Front of card: A photo of a bird.
Back of card: The word "Hummingbird."

The student gets really good at matching the photo to the word. But if you show them a photo of a bird they haven't seen before, or if the bird is hiding in the bushes, the student gets confused. They haven't learned what a hummingbird actually is (e.g., "it has a long beak and green feathers"); they've just memorized the pattern of the word "Hummingbird."

In the world of biology, we have millions of photos, but almost no detailed descriptions. Experts (biologists) are too busy to write a paragraph for every single photo of a beetle or a mushroom.

2. The Trap: The "Hallucinating" Robot

The researchers tried a clever shortcut: they asked a super-smart AI (a Multimodal Large Language Model, or MLLM) to look at the photos and write the descriptions for them.

But this AI had a bad habit: it hallucinated.

The Scenario: You show the AI a photo of a female hummingbird.
The AI's Mistake: Because it knows "male hummingbirds have red throats" from its training data, it confidently writes: "This bird has a red throat."
The Reality: The bird in the photo is a female with a plain throat.

If you teach your robot student with these fake descriptions, the student learns the wrong things. It starts associating "red throat" with "female hummingbird," which is a disaster.

3. The Solution: The "Expert Librarian" and the "Style Guide"

The BIOCAP team realized they couldn't just let the AI write freely. They needed to give it context, like a strict editor. They built a pipeline with two special tools:

A. The Wikipedia Librarian (Fact-Checking)

Before the AI writes a description, the system pulls up the Wikipedia page for that specific species.

Analogy: Imagine the AI is a student taking a test. Before it answers, a librarian hands it a cheat sheet that says: "Calliope Hummingbirds have green backs and white bellies. Males have red streaks; females do not."
The Result: The AI can no longer make up facts. It has to stick to the "truth" provided by the encyclopedia.

B. The Style Guide (Formatting)

The researchers also gave the AI examples of how a biologist should write.

Analogy: It's like giving the student a template: "Start by describing the color, then the shape, then the tail. Don't talk about the weather or what the bird is eating."
The Result: The AI stops writing vague sentences like "A bird is sitting there" and starts writing precise ones like "A small bird with a glossy green back and a white throat is perched on a branch."

4. The Result: BIOCAP

By combining these two tools, they generated millions of synthetic captions that were accurate, detailed, and specific to the image. They then trained their new model, BIOCAP, using both the name tags and these new descriptions.

What changed?

Old Model: Saw a photo and thought, "That looks like the 'Hummingbird' category."
BIOCAP: Saw a photo and thought, "That bird has a green back, a white throat, and is hovering. That matches the description of a female Calliope Hummingbird."

5. Why This Matters

The paper shows that BIOCAP is much better at two things:

Classification: It can tell the difference between very similar species (like a male vs. female bird) much better than before.
Search: If you type "Find me a bird with a red tail," BIOCAP can actually find it, because it learned what a "red tail" looks like, not just that the word "red" appears in the database.

The Big Picture

This isn't just about birds. It's about teaching computers to understand the real world using the language of experts.

In many fields (medicine, geology, astronomy), we have tons of pictures but very few descriptions. BIOCAP shows us a recipe: Don't just let AI guess. Give it the facts from reliable sources (like Wikipedia) and show it how to write like a pro. This turns a "guessing robot" into a "knowledgeable assistant" that can actually help scientists discover new things.

1. Problem Statement

Multimodal foundation models (like CLIP) rely heavily on paired image-text data for training. While general domains (e.g., the web) offer vast amounts of such data, organismal biology faces a critical bottleneck:

Label-Only Data: Most biological image repositories (e.g., TreeOfLife-10M) contain millions of images annotated only with taxonomic labels (species names) but lack instance-specific descriptive captions.
Limitations of Labels: Species names are symbolic and semantically thin. They do not encode the visual traits (morphology, color, patterns) necessary for fine-grained understanding or robust generalization.
Hallucination in Synthetic Data: Simply using Multimodal Large Language Models (MLLMs) to generate captions from images often leads to hallucinations (e.g., incorrect colors or features) because biological differentiation relies on subtle morphological details that generic MLLMs struggle to infer without domain guidance.
The Gap: There is a lack of faithful, instance-specific natural language supervision to bridge biological images with multimodal foundation models.

2. Methodology

The authors propose BIOCAP (BIOCLIP with Captions), a framework that generates high-quality synthetic captions to serve as additional supervision alongside species labels. The methodology consists of three core components:

A. Domain-Specific Context Construction

To mitigate hallucination and ensure biological accuracy, the authors curate two types of context for the MLLM generation process:

Wikipedia-Derived Visual Information:
- The system scrapes Wikipedia pages for species based on scientific names.
- It filters sections (e.g., "Description," "Morphology") and uses an LLM (Qwen3) to extract visual attributes (color, shape, texture) while discarding non-visual info (habitat, behavior, distribution).
- If species-level data is missing, it falls back to genus-level descriptions, mapping them to the species.
- Result: 132K visual descriptions covering ~30% of taxa in the dataset.
Taxon-Tailored Format Examples:
- To guide the style and focus of the caption, the authors curate format examples for 347 taxonomic classes.
- Using Gemini Deep Research, they retrieve descriptions of representative species, manually validating them for visual groundedness and format consistency.
- Result: 896 curated examples that teach the model to focus on diagnostic traits rather than generic descriptions.

B. Synthetic Caption Generation

Model: InternVL3 38B is used as the backbone MLLM.
Prompting Strategy: The model is prompted with the input image, the species name, the Wikipedia visual info (as a knowledge constraint), and the format examples (as a style guide).
Objective: Generate a concise, instance-specific caption that highlights diagnostic traits visible in the image, strictly avoiding hallucinations.

C. Model Architecture (BIOCAP)

Base: Initialized from OpenAI's ViT-B/16 CLIP checkpoint.
Dual Projector Design: To handle heterogeneous supervision (a short taxonomic label vs. a long descriptive caption), BIOCAP introduces two separate visual projectors after the shared visual encoder:
- One projector aligns the image with the taxonomic label.
- A second projector aligns the image with the descriptive caption.
Training Objective: The model is trained using a contrastive loss (InfoNCE) on the TreeOfLife-10M dataset (10M images) for 50 epochs, optimizing alignment with both the species name and the synthetic caption.

3. Key Contributions

Synthetic Caption Pipeline for Biology: A novel pipeline that combines external knowledge (Wikipedia) and few-shot format examples to generate faithful, instance-specific captions, solving the hallucination problem in biological MLLM captioning.
BIOCAP Model: A new biological foundation model that leverages these synthetic captions as complementary supervision. It demonstrates that descriptive text helps the model focus on diagnostic traits rather than spurious correlations (e.g., background noise).
Theoretical Insight: The paper provides a causal analysis showing that aligning images with faithful captions encourages the model to recover the shared latent "trait vector" ( $z^*$ ), effectively filtering out environmental noise ( $\epsilon$ ).
Open Resources: Release of the BIOCAP model, the generated TreeOfLife-10M-Captions dataset, and all code.

4. Experimental Results

BIOCAP was evaluated on species classification, text-image retrieval, and natural language understanding benchmarks.

Species Classification (Zero-Shot):
- BIOCAP achieved an average Top-1 accuracy of 46.4% across 10 benchmarks.
- This represents an 8.8% improvement over BIOCLIP (trained only on labels) and a 27.0% improvement over standard CLIP.
- Notable gains were seen in challenging domains: +23.5% on Fungi and +7.1% on Rare Species.
Text-Image Retrieval:
- BIOCAP significantly outperformed BIOCLIP by 21.9% on average in retrieval tasks (INQUIRE-Rerank, Cornell Bird, PlantID).
- It demonstrated superior ability to retrieve images based on fine-grained biological attributes (e.g., "fly," "red throat").
Ablation Studies:
- Context Matters: Using only raw Wikipedia text or simple prompts ("Base") degraded performance or failed to improve classification. The combination of Trait-focused prompts + Format Examples + Wikipedia Info yielded the best results.
- Architecture: The dual-projector design outperformed a single-projector design, validating the need to decouple supervision signals for labels vs. captions.
- Generalization: The model generalized well to species not covered by Wikipedia visual descriptions, indicating the learned trait representations are robust.
Qualitative Analysis:
- Grad-CAM Visualizations: Showed that BIOCAP correctly attends to biologically relevant parts (e.g., wings for "fly," specific feather patterns) whereas CLIP/BIOCLIP often focused on background or irrelevant features.
- Embedding Space: t-SNE plots showed BIOCAP successfully separated species by both sex (male/female) and behavior (perch/fly/stand), which baseline models failed to do.

5. Significance and Impact

Bridging the Gap: BIOCAP demonstrates that descriptive captions, when grounded in domain knowledge, are a critical supervision signal for scientific multimodal learning, unlocking capabilities that labels alone cannot provide.
Scalability: The proposed pipeline is scalable. By using MLLMs guided by structured knowledge, it can generate high-quality supervision for millions of images without requiring exhaustive human annotation.
Broader Applicability: The authors argue this "schema-driven captioning" approach is applicable to other scientific domains (astronomy, materials science, medical imaging) where images are abundant but detailed, standardized text descriptions are scarce.
Scientific Rigor: The work moves beyond simple classification, enabling models to understand and reason about biological semantics (traits, behaviors, morphology), which is essential for downstream tasks like biodiversity monitoring and ecological research.