FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

Imagine you walk into a library and pick up a scientific textbook. You open it to a complex page filled with a "compound figure"—a single image that is actually a collage of six or seven smaller pictures (panels), each showing a different experiment, graph, or microscope shot.

Usually, there is one big caption at the bottom of the whole page that says something vague like, "Figure 1: Results of the study." It doesn't tell you which part is which. If you want to understand the specific experiment in the top-left corner, you have to guess, or hope the text nearby explains it.

FigEx2 is a new AI tool designed to solve this exact problem. Think of it as a super-smart, bilingual librarian who can look at that messy collage and instantly:

Draw boxes around every single small picture to separate them.
Write a unique, detailed story for each small picture, explaining exactly what it shows.

Here is how the paper explains the magic behind this tool, broken down into simple concepts:

1. The Problem: The "Missing Manual"

In the real world, scientists often lose the detailed instructions. Sometimes the caption is missing entirely, or it's just a high-level summary. Previous AI tools tried to fix this by reading the big caption and guessing which part of the image matched which sentence. But this is like trying to assemble a puzzle while blindfolded, relying only on a blurry photo of the finished box. If the text is missing or vague, the AI gets confused.

FigEx2's Solution: Instead of waiting for a text manual, FigEx2 looks only at the pictures. It says, "I don't need the text to know what this graph is about; I can see it." It acts like a detective who solves the case based on visual clues alone.

2. The "Traffic Light" System (Gated Fusion)

One of the hardest parts of this job is that the AI has to do two things at once: write the story and find the picture.

The Challenge: Writing a story is creative and messy. Sometimes the AI might say, "This graph shows a red line," and other times, "The red line indicates growth." This variety in language can confuse the part of the AI trying to draw the box. It's like trying to drive a car while the passenger is shouting different, conflicting directions.
The Fix: The authors built a "Noise-Aware Gated Fusion" module. Imagine this as a smart traffic light or a bouncer at a club.
- As the AI generates words, this "bouncer" checks them.
- If the words are clear and helpful for finding the box, the gate opens, and the information flows to the detector.
- If the words are noisy, repetitive, or confusing, the gate closes or filters them out.
- This ensures the "box-drawing" part of the brain stays calm and focused, even if the "story-writing" part is being creative.

3. The "Coach" (Reinforcement Learning)

Training an AI to do this is hard. If you just tell it, "Do better," it doesn't know what "better" means.

The Strategy: The researchers used a technique called Reinforcement Learning (RL), which is like having a strict coach.
How it works: The AI writes a caption and draws a box. The coach then checks two things:
1. Did you get the meaning right? (Using a tool called BERTScore to check if the words make sense).
2. Does the picture match the words? (Using a tool called CLIP to see if the image and text are actually talking about the same thing).
If the AI gets it right, it gets a "reward" (a treat). If it hallucinates (makes things up) or draws the box in the wrong place, it gets a penalty. Over time, the AI learns to be a perfect match between image and text.

4. The "Universal Translator" (Zero-Shot Transfer)

The most impressive part of FigEx2 is its ability to learn from one subject and apply it to another without extra training.

The Analogy: Imagine you teach a student how to read a Biology textbook. Then, you hand them a Physics textbook they've never seen before. Most students would be lost.
FigEx2's Superpower: Because it learned the structure of scientific figures (graphs, heatmaps, labels) rather than just memorizing biology words, it can instantly handle Physics and Chemistry figures. It didn't need to be retrained; it just applied its logic to the new domain. This is called Zero-Shot Transfer.

5. The Result: A New Benchmark

The team created a new dataset called BioSci-Fig-Cap to teach the AI, which is like a high-quality "training manual" where every single panel has a perfect description. They tested FigEx2 against other top AI models (like Qwen3-VL) and found that FigEx2 was significantly better at:

Finding the panels: It drew the boxes much more accurately.
Writing the captions: It used better vocabulary and matched the images more faithfully.

Summary

FigEx2 is a tool that takes a messy, multi-panel scientific image and automatically breaks it down into neat, labeled sections, writing a clear explanation for each one. It uses a "traffic light" system to keep its logic stable and a "coach" to ensure it learns the right way. Best of all, it's so smart that it can take what it learned from biology and immediately start helping with physics and chemistry, acting as a universal translator for scientific visual data.

1. Problem Statement

Scientific literature frequently utilizes compound figures, which aggregate multiple labeled panels (e.g., A, B, C) into a single image to depict distinct experiments or analyses. Current pipelines face two major challenges:

Missing or Ambiguous Captions: In real-world scenarios (e.g., slides, cropped figures), detailed panel-level captions are often missing. Existing figures may only have a high-level summary caption or no caption at all.
Limitations of Existing Methods: Previous approaches treated panel extraction as a "caption separation" task, relying on a pre-existing, high-fidelity text-to-image mapping to align text segments with visual boxes. This fails when text is sparse or missing. Furthermore, purely supervised training often leads to semantic misalignment between generated captions and localized panels, especially when dealing with the linguistic variance of open-ended captioning.

The core problem is to develop a system that can autonomously localize labeled panels within a compound figure and generate specific captions for each panel based solely on visual content, without relying on a pre-existing full-figure caption.

2. Methodology: FigEx2 Framework

FigEx2 is a unified, visual-conditioned framework that integrates panel detection and captioning into a single forward pass. It consists of three primary components:

A. Visual-Conditioned Formulation

Unlike prior work that conditions on text, FigEx2 takes only the compound figure as input.

Unified Architecture: It employs a Vision-Language Model (VLM) backbone (specifically Qwen3-VL-8B) for captioning and a detection head (DAB-DETR).
Structured Output: The captioning branch generates panel-specific captions in a label-ordered format (e.g., "A: [caption] B: [caption]...") and terminates with a special trigger token, [DET].
Interface: The hidden state of the [DET] token serves as the direct interface to the detection branch, triggering the prediction of bounding boxes and labels in the same forward pass.

B. Noise-Aware Gated Fusion Module

To address the instability caused by the linguistic variance of open-ended captioning (which can destabilize the detector's query space), FigEx2 introduces a Gated Fusion Module:

Mechanism: It uses cross-attention to inject token-level features from the generated captions ( $F_{txt}$ ) into the detector's object queries ( $Q$ ).
Gating: A learnable gate mechanism adaptively filters these features. It suppresses noisy channels and modulates the influence of the caption features on the detection queries.
Goal: This ensures that the detector remains robust and maintains stable localization even when the generated text is highly variable or imperfect.

C. Staged Optimization with Reinforcement Learning (RL)

To ensure strict multimodal consistency between the generated text and the localized regions, the model undergoes a four-stage training process:

Captioning Pretraining: Freezes the detector; trains the captioning branch with token-level cross-entropy.
Detection Pretraining: Freezes the captioning branch; trains the detector with standard set-based detection losses (classification, box regression, GIoU).
Joint Supervised Fine-Tuning (SFT): Jointly optimizes both branches and the gated fusion module.
Reward-Augmented Training (SCST): Utilizes Self-Critical Sequence Training (SCST) with multimodal rewards to optimize non-differentiable sequence-level objectives:
- Semantic Reward ( $R_{BERT}$ ): Uses BERTScore to measure semantic faithfulness against ground-truth captions.
- Alignment Reward ( $R_{CLIP}$ ): Uses CLIP-based cosine similarity between the generated caption and the ground-truth image crop (to ensure the text describes the specific visual region).
- Objective: The final loss combines supervised losses with these RL rewards to enforce that the caption accurately describes the detected panel.

3. Key Contributions

Reformulation of the Task: Shifts panel extraction from "caption decomposition" (relying on existing text) to "visual-conditioned panel captioning," enabling operation in text-sparse environments.
Novel Architecture: Introduces the Gated Fusion Module, which stabilizes the interaction between generative captioning and discriminative detection, preventing the detector from being overwhelmed by noisy text signals.
Training Strategy: Proposes a SFT + RL recipe using CLIP and BERTScore rewards to enforce vision-text consistency, significantly reducing hallucinations and improving alignment.
Benchmark Curation:
- BioSci-Fig-Cap: A refined, high-quality benchmark for panel-level grounding derived from BioSci-Fig, filtering misaligned labels.
- Cross-Domain Test Suites: Introduced PhysSci-Fig-Cap-Test (Physics) and ChemSci-Fig-Cap-Test (Chemistry) to rigorously evaluate zero-shot transferability across scientific disciplines.

4. Experimental Results

FigEx2 was evaluated on MedICaT, BioSci-Fig-Cap, and the new cross-domain test suites.

Panel Detection:
- On BioSci-Fig-Cap, FigEx2 achieved 0.726 mAP@0.5:0.95, outperforming the strong baseline DAB-DETR (0.697) and Qwen3-VL-8B (0.439).
- On MedICaT, it achieved 0.291 mAP@0.5:0.95, surpassing Qwen3-VL-8B (0.230).
Panel Captioning:
- FigEx2 significantly outperformed Qwen3-VL-8B across all metrics.
- On BioSci-Fig-Cap, it improved METEOR by 0.51 (18.27 $\to$ 18.78) and BERTScore by 0.24 (87.00 $\to$ 87.24).
- On MedICaT, it improved METEOR from 12.78 to 13.89.
Zero-Shot Transfer:
- Without any fine-tuning on Physics or Chemistry data, FigEx2 (trained on Biology) demonstrated remarkable generalization, outperforming Qwen3-VL-8B in both domains on both detection and captioning tasks.
Few-Shot Adaptation:
- When provided with 1-2 ground-truth examples (panels A and B), FigEx2 showed superior adaptation capabilities compared to baselines, further improving performance on unseen panels.

5. Significance

Practical Utility: FigEx2 solves a critical bottleneck in scientific literature mining where captions are missing or incomplete. It allows for the automated reconstruction of panel-level semantic details from raw images.
Robustness: The gated fusion mechanism and RL-based alignment prove essential for handling the "open-ended" nature of scientific captioning, where phrasing varies wildly, ensuring the detector remains accurate regardless of text generation quality.
Generalization: The ability to transfer knowledge from Biomedical to Physics and Chemistry domains without retraining highlights the model's robustness to domain shifts, making it a versatile tool for multi-disciplinary scientific analysis.
Foundation for Future Research: By establishing high-quality benchmarks (BioSci-Fig-Cap) and demonstrating the efficacy of reward-augmented training for structured scientific outputs, this work sets a new standard for multimodal understanding in scientific domains.