Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning

Imagine you are a detective trying to solve a mystery about the history of human writing. You have thousands of ancient drawings (glyphs) from different cultures, and you want to figure out which ones are "cousins" (related by history) and which are strangers.

The problem? We don't have the family tree.

For some made-up writing systems (like the alien alphabet in Futurama or Tolkien's Elvish), we know exactly which letters are the same and which are different. But for real ancient scripts (like Greek, Latin, or ancient Chinese), historians argue about whether they share a common ancestor. If you try to teach a computer to learn this by saying, "These two are definitely not related," you might be making a mistake based on incomplete history.

This paper proposes a clever two-step solution to teach a computer how to understand these ancient scripts without getting stuck on the arguments. Think of it as a "Master Class followed by an Exploration Trip."

The Two-Stage Framework

Stage 1: The Master Class (The "Teacher")

First, the researchers train a smart AI model (the Teacher) on made-up alphabets where the answers are 100% clear.

The Analogy: Imagine a strict art teacher giving a student a set of perfectly distinct shapes: a red circle, a blue square, and a green triangle. The teacher says, "If you see two red circles, they are the same. If you see a red circle and a blue square, they are totally different."
The Goal: The AI learns to recognize the shape of a character regardless of how messy the handwriting is (is it tilted? zoomed in?). It becomes an expert at spotting differences and similarities in a "clean" world where there are no historical mysteries.

Stage 2: The Exploration Trip (The "Student")

Now, the researchers take that smart Teacher and use it to guide a new model (the Student) on real, ancient, messy scripts where the history is unclear.

The Analogy: The Teacher is like a tour guide who knows the rules of the game. The Student is a traveler exploring a new city. The Teacher says, "Here is how you recognize a shape. Go explore these ancient ruins. Don't worry if you aren't sure if two ruins are related; just look for patterns and similarities based on what I taught you."
The Twist: Unlike other methods that force the AI to guess which scripts are "enemies" (negative pairs), this Student is allowed to be flexible. It learns to group things that look similar, even if we don't know for sure if they are historically related. It discovers hidden connections on its own, guided by the Teacher's strong foundation but free to find new truths.

Why This is a Big Deal

Most AI methods try to learn everything at once, often making bad guesses about history because they are forced to label things as "different" when they might actually be related.

This paper's approach is like learning to ride a bike with training wheels, then taking them off.

Training Wheels (Stage 1): You learn balance and steering on a flat, safe track (invented alphabets).
Taking them off (Stage 2): You ride on the bumpy, real roads (ancient scripts). You still have the balance you learned, but now you can navigate the real world's curves and hills without being told exactly where every pothole is.

The Results

When they tested this method:

It recognized individual letters just as well as the best existing methods (like a human recognizing a messy "A" vs. a messy "B").
It grouped scripts better: When asked to rank how similar two writing systems are (e.g., "How similar is Greek to Cyrillic?"), this method got the rankings much closer to what linguists believe than other AI methods did.
It found hidden patterns: The AI didn't just memorize; it actually reorganized the "map" of writing systems to show that historically related scripts (like Greek and Latin) naturally clustered together, while unrelated ones stayed far apart.

The Bottom Line

This paper solves a tricky problem: How do you teach a computer about history when the history books are missing pages?

By first teaching the computer the rules using a "fake" world where everything is known, and then letting it explore the "real" world with those rules as a guide, the AI can discover the true relationships between ancient scripts without being forced to make up false facts. It's a bridge between what we know for sure and what we are still trying to discover.

Here is a detailed technical summary of the paper "Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning."

1. Problem Statement

The paper addresses the challenge of learning similarity metrics for ancient and invented writing systems (glyphs and scripts). The core difficulty lies in asymmetric supervision:

Glyph Level: Individual characters within a specific script (especially invented ones like Tolkien's Tengwar) can be reliably labeled. Different renderings of the same character are positive pairs, while different characters are negative pairs.
Script Level: Historical relationships between different writing systems are often uncertain, debated, or undocumented. Defining "negative pairs" (asserting two characters from different scripts are unrelated) risks baking in unverifiable linguistic or archaeological assumptions.
The Gap: Existing visual representation learning frameworks (like standard contrastive learning) rely on clear negative pairs, which is problematic for historical scripts where similarity might reflect borrowing or common ancestry rather than distinctness.

2. Methodology

The authors propose a two-stage framework that decouples reliable character supervision from uncertain script relations.

Stage 1: Supervised Contrastive Learning (Teacher Training)

Data: Trained on labeled invented alphabets (e.g., Omniglot's fictional scripts) where character identities are unambiguous and historically independent.
Objective: Train a "Teacher" encoder ( $f^*_\phi$ ) using Supervised Contrastive Loss (SupCon).
Mechanism:
- All instances of the same character (including augmentations) are treated as positives.
- All distinct character classes are treated as negatives.
- This establishes a robust, discriminative embedding space with clear intra-class clustering and inter-class separation.

Stage 2: Unsupervised Teacher-Student Distillation (Adaptation)

Data: Adapted to unlabeled historical scripts (e.g., real ancient alphabets) where cross-script relationships are unknown.
Objective: Transfer the discriminative structure from the Teacher to a "Student" network ( $f_\theta$ ) without imposing cross-script negatives.
Framework: Based on BYOL (Bootstrap Your Own Latent) but with three critical adaptations:
1. Initialization: Both the Student and the Target (EMA) networks are initialized from the Stage 1 Teacher, providing a semantic prior rather than random initialization.
2. Architecture: The projection MLP is removed; the predictor operates directly on the backbone embeddings to avoid overfitting on moderate-sized datasets.
3. Data Strategy: Instead of augmenting a single image twice, the method leverages multiple genuine handwritten instances per character class alongside geometric augmentations.
Loss Function: Minimizes the negative cosine similarity between the Student's prediction and the Target's representation (with stop-gradient), allowing the Student to reorganize representations to fit historical data while preserving the Teacher's geometric regularities.

3. Key Contributions

Two-Stage Strategy: A novel approach that separates reliable character-level supervision from uncertain script-level relations, avoiding the need to define unverifiable negative pairs for historical scripts.
Teacher-Initialized Self-Distillation: A modification of BYOL that uses a supervised contrastive teacher to guide unsupervised learning, enabling the model to inherit discriminative structure while discovering latent cross-script similarities.
Comprehensive Evaluation Protocol: A dual-metric evaluation system:
- Glyph Level: 20-way 1-shot retrieval accuracy.
- Script Level: Script-to-script distance ranking evaluated via NDCG@10 (Normalized Discounted Cumulative Gain) and Spearman's Rank Correlation against curated linguistic similarity levels.
New Dataset: Construction of a Unicode-based dataset for pre-19th-century writing systems with ground-truth similarity levels derived from historical typology.

4. Experimental Results

The framework was evaluated on Omniglot and a custom Unicode dataset across five backbone architectures (Simple CNN, Siamese CNN, ResNet-18/34/50) and compared against baselines (SupCon, BYOL, Barlow Twins, DINOv2).

Script-Level Ranking (Primary Metric):
- The hybrid approach achieved the best NDCG@10 on three out of five backbones (Simple CNN, ResNet-34, ResNet-50), significantly outperforming purely self-supervised methods.
- On ResNet-50, the hybrid method achieved an NDCG@10 of 0.3178, beating Barlow Twins (0.2997) and BYOL (0.2708).
- This indicates the model successfully organizes the embedding space to reflect historical linguistic relationships.
Glyph-Level Retrieval:
- The method remained competitive or superior on 20-way 1-shot retrieval (Top-1/Top-5 accuracy), particularly on Simple CNN and ResNet-50.
- On mid-sized ResNets (18/34), pure self-supervised methods sometimes achieved higher Top-1 accuracy, but the hybrid method maintained superior script-level coherence (NDCG).
Geometric Analysis:
- Separability Ratio ( $R$ ): The student model reduced the ratio of distance between related scripts (Greek/Latin) vs. unrelated scripts (CJK) by 35% compared to the teacher alone ( $R$ dropped from 0.323 to 0.210).
- t-SNE Visualizations: Confirmed that Stage 2 does not merely compress the space but selectively accentuates historically grounded proximities.
Baselines: Large pre-trained foundation models (DINOv2) performed poorly, highlighting the necessity of domain-adapted training for ancient scripts.

5. Significance

Epistemological Solution: The paper provides a computational solution to a fundamental problem in linguistics and archaeology: how to learn from data where ground truth is contested. It avoids forcing false negatives on historical data.
Bridging Supervision and Discovery: It successfully bridges the gap between supervised contrastive learning (which creates hard boundaries) and unsupervised discovery (which finds soft similarities), allowing for both distinct system separation and the detection of potential historical influences.
Future Applications: The resulting script distances can serve as a foundation for phylogenetic analyses of writing systems, potentially enabling tree- or network-based reconstructions of script lineages on a global scale. The methodology is also applicable to other domains where within-class identity is known but cross-category relations are uncertain.