Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

The Big Problem: AI is Good at Photos, Bad at Diagrams

Imagine you have a very smart student (an AI model called CLIP) who has read millions of books and looked at millions of photos. If you show this student a picture of a golden retriever and ask, "Is this a dog?" or "Show me the picture of the dog," they will get it right almost every time. They are experts at natural images (photos of cats, sunsets, people).

However, if you show this student a flowchart (a diagram with boxes, arrows, and text like "Start" $\rightarrow$ "Check Password" $\rightarrow$ "End"), they get confused.

Why?

Photos are about what things look like (fur, color, shape).
Diagrams are about how things connect (logic, flow, structure).

The student sees the boxes and arrows but misses the story they tell. They might think a flowchart where the arrows go in the wrong order is the same as the correct one, because the boxes look identical.

The Solution: A Specialized Training Camp

The author, Hiroshi Sasaki, created a new training method to turn this "general student" into a "diagram expert." They call it Structure-Aware Contrastive Learning.

Think of this as a special boot camp designed to teach the AI how to read the logic of a diagram, not just the pictures.

Step 1: Breaking it Down (Granulation)

Diagrams can be complex. To teach the AI better, the researchers break big flowcharts into tiny, bite-sized pieces.

Analogy: Imagine a long, complicated recipe. Instead of showing the AI the whole cookbook, they show them just one step at a time: "Take the egg," then "Crack the egg," then "Whisk the egg."
They take the code behind the diagram (like a blueprint) and chop it into small triplets of steps. This makes it easier for the AI to learn the specific rules of how arrows connect boxes.

Step 2: The "Tricky" Practice Tests (Hard Samples)

In normal training, the AI learns by matching a picture to the right description. But to get really smart, it needs to face tricky questions.

The researchers create two types of "fake" or "tricky" examples:

Hard Positives (The "Same Thing, Different Look"):
- They take a correct flowchart and flip it upside down or change the colors, but keep the logic exactly the same.
- Analogy: It's like showing the AI a photo of a red car and then a photo of the same red car, but taken from the back. The AI must learn: "These look different, but they are the SAME car."
- Goal: Teach the AI that the structure matters more than the visual orientation.
Hard Negatives (The "Look-Alike" Traps):
- They take a correct flowchart and make tiny, sneaky changes: swapping two boxes, reversing an arrow, or deleting a step.
- Analogy: Imagine a "Spot the Difference" puzzle. The AI sees a flowchart that looks 99% identical to the correct one, but one arrow points the wrong way. The AI must learn: "These look almost the same, but they are TOTALLY different stories."
- Goal: Force the AI to pay attention to the tiny details that change the meaning.

Step 3: The Two Special Rules (The Loss Functions)

To make sure the AI learns these lessons without getting confused, the researchers added two special rules to the training math:

The "Group Hug" Rule (Structure-Aware Contrastive Loss):
- This rule tells the AI: "Pull all the 'correct' versions of this diagram closer together in your brain, and push all the 'wrong' versions far away."
- It ensures the AI understands that flipping a diagram upside down doesn't change its meaning, but swapping two steps does.
The "Keep the Good Stuff" Rule (Distinct Factor Orthogonal Loss):
- This is the cleverest part. When the AI looks at a "correct" diagram and a "tricky wrong" one, they often share some parts (like the same box names).
- Analogy: Imagine you have a friend who loves pizza. You also have a "fake" friend who loves pizza but is actually a spy.
- The "Group Hug" rule might make the AI forget that the spy is a spy because they both love pizza.
- The "Keep the Good Stuff" rule says: "Okay, acknowledge that they both like pizza (shared info), but make sure you clearly separate the part of their brain that knows they are friends from the part that knows the spy is a spy."
- It forces the AI to separate the shared features (the boxes) from the unique features (the arrow directions) so it doesn't get confused.

The Results: Did it Work?

The researchers tested this new "diagram student" on flowcharts.

Matching Game: When asked to match a flowchart image to its text description, the new method was much better than the standard AI. It could spot the tricky "look-alike" diagrams that confused the others.
Question Answering: When integrated into a larger AI system to answer questions about diagrams (e.g., "What happens if the password is wrong?"), the new method gave much more accurate answers.

The Bottom Line

This paper is about teaching AI to stop just "looking" at diagrams and start "reading" them. By using tricky practice tests (hard samples) and special rules to separate similar-looking things, the AI learns to understand the logic and flow of diagrams, not just the pictures.

It's like taking a student who only knows how to recognize a car and teaching them how to drive one.

1. Problem Statement

Multimodal models like CLIP (Contrastive Language-Image Pre-training) have achieved remarkable success in aligning visual and linguistic representations for natural images. However, they struggle significantly with diagrammatic content (e.g., flowcharts, schematics, technical illustrations).

The Gap: Diagrams encode structured, symbolic information (nodes, edges, arrows, logical flow) rather than the photorealistic textures found in natural scenes.
Current Limitations: Standard CLIP models often fail to capture the intricate relationships between visual elements and textual annotations (labels, arrows). They tend to focus on individual objects (nouns) rather than structural relationships, leading to poor performance in diagram interpretation, technical Question Answering (VQA), and knowledge extraction.
Data Scarcity: Existing large-scale datasets (like LAION) are dominated by natural scenes, lacking the nuanced, structured image-text pairs required for diagrams.

2. Methodology

The authors propose a novel training paradigm called Structure-aware Contrastive Learning (SaCLIP). The methodology consists of three main stages:

A. Diagrammatic Data Granulation

Since standard CLIP models have limited input resolution and diagrams can be complex, the authors introduce a granulation process:

Input: Diagrams are provided as code (e.g., Mermaid code) defining nodes and edges.
Decomposition: The code is parsed to extract adjacent triplets of nodes.
Reconstruction: These triplets are reconstructed into simplified sub-diagrams (granulated images) and corresponding textual captions (e.g., "An arrow points from node A to node B").
Output: A dataset of smaller, modular diagram-image and caption pairs suitable for standard CLIP input.

B. Hard Sample Synthesis

To force the model to learn subtle structural distinctions, the authors generate Hard Positive and Hard Negative samples dynamically:

Hard Positive Samples: Visually distinct but semantically identical to the original.
- Image: The flow direction is inverted (e.g., Top-to-Bottom $\to$ Bottom-to-Top).
- Text: The original diagram code is used as the caption.
Hard Negative Samples: Visually similar but semantically distinct (counterfactuals).
- Image: Random node label swaps, arrow direction reversals, or arrow deletions.
- Text: Semantic distortions in the natural language description (e.g., swapping node names in the text).

C. Structure-Aware Contrastive Learning Framework

The training objective extends standard CLIP with two specialized loss functions:

Structure-Aware Contrastive Loss (SC Loss):
- Extends Triplet/NegCLIP losses by considering intra-modal (image-image, text-text) and inter-modal (image-text) distances simultaneously.
- It pulls the original anchor closer to hard positives while pushing it away from hard negatives.
- Goal: Encourage coherent local structure and robust cross-modal alignment.
Distinct Factor Orthogonal Loss (DO Loss):
- Problem: Hard negatives often share significant semantic information with the original (e.g., the same node names, just in a different order). Standard contrastive loss might force the model to discard this shared information.
- Solution: The DO loss assumes the embedding space can be decomposed into shared vectors ( $z^s$ ) and distinct vectors ( $z'$ ).
- Mechanism: Using Thales's theorem, the loss approximates the orthogonality between the distinct factors of the original and the hard negative. This ensures the model learns to separate the structural differences (distinct factors) while preserving the shared semantic content (shared factors).

Total Loss Function:
$\mathcal{L} = \mathcal{L}_{CL} + \lambda_{SC}\mathcal{L}_{SC} + \lambda_{DO}\mathcal{L}_{DO}$
Where $\mathcal{L}_{CL}$ is the standard CLIP loss, and $\lambda$ are hyperparameters.

3. Key Contributions

Novel Preprocessing: A granulation technique that decomposes complex diagram codes into modular sub-parts, enabling standard VLMs to process structured data effectively.
Hard Sample Generation: A systematic method for synthesizing hard positives (visual variation, semantic identity) and hard negatives (visual similarity, semantic variation) specifically for diagrams.
Dual-Loss Training Objective: The introduction of the SC Loss for structural alignment and the DO Loss for disentangling shared vs. distinct factors, preventing the loss of crucial semantic information during contrastive training.
Empirical Validation: Demonstrated superior performance over standard CLIP, NegCLIP, and TripletCLIP baselines on flowchart datasets.

4. Experimental Results

The method was evaluated on the FlowVQA dataset using two primary tasks:

Image-Text Matching:
- The proposed method (SaCLIP) achieved the highest Recall@1 (R@1) and Mean Reciprocal Rank (MRR) compared to baselines (Zero-shot, Standard CLIP, NegCLIP, TripletCLIP).
- In "Hard Negative" retrieval scenarios (where distractors are semantically similar), SaCLIP significantly outperformed all other methods, proving its ability to distinguish subtle structural differences.
- The DO Loss showed particular value in these hard-negative scenarios, improving retrieval accuracy by preserving shared semantics while distinguishing structure.
Visual Question Answering (VQA):
- The fine-tuned vision encoders were integrated into LLaVA-v1.6-Mistral-7B.
- SaCLIP-based encoders yielded the highest F1 scores and Precision in BERTScore evaluations.
- The inclusion of the DO loss specifically improved F1 scores, indicating better semantic alignment for complex diagram comprehension.

5. Significance and Limitations

Significance:

This work bridges the gap between general-purpose VLMs and specialized diagram understanding.
It demonstrates that structural awareness is critical for diagrams; models must learn not just what is in the image, but how elements relate.
The DO Loss offers a novel mathematical approach to disentangle shared and distinct features in contrastive learning, which could be applicable beyond diagrams.

Limitations:

Dependency on Source Code: The method currently requires access to the underlying diagram code (e.g., Mermaid) to generate hard samples. If only raster images are available, additional steps like image derendering or vectorization are needed, which may introduce errors.
Euclidean Assumption: The DO Loss relies on the assumption that the embedding space is locally Euclidean (Thales's theorem). If the actual embedding space is non-Euclidean, the loss function's effectiveness may be limited.

Conclusion:
The paper presents a robust framework for adapting multimodal models to structured visual domains. By combining granular data processing, hard negative mining, and specialized disentanglement losses, it significantly advances the state-of-the-art in diagram understanding, paving the way for more capable AI systems in technical and scientific domains.