Linguistically Informed Graph Model and Semantic Contrastive Learning for Korean Short Text Classification

Imagine you are trying to guess what a person is talking about, but they only give you a few words, like "Hospital" or "Battery." In English, this is already tricky. But in Korean, it's even harder because the language is like a Lego set where words are built by snapping small pieces (morphemes) together, and the order you snap them in can change the whole meaning. Plus, people often leave out the "connector pieces" (particles) in short texts like tweets or headlines.

This paper introduces a new AI system called LIGRAM that is specifically designed to solve this puzzle for Korean short texts. Here is how it works, broken down into simple concepts:

1. The Problem: The "Missing Context" Puzzle

Short texts are like post-it notes with very little information.

The English Problem: If you see "Apple," is it the fruit or the tech company?
The Korean Problem: It's worse. Because Korean is "agglutinative" (words are glued together), a single word can contain a noun, a verb, and a tense all at once. If you chop the word apart incorrectly, you lose the meaning. Also, Koreans often skip the "glue" (particles) in short messages, making sentences look like broken sentences.
The Result: Standard AI models, which were mostly trained on English, get confused and make mistakes because they don't understand the unique "glue" and structure of Korean.

2. The Solution: LIGRAM (The Three-Layer Detective)

Instead of just reading the text as a flat list of words, LIGRAM acts like a detective who builds three different maps of the same crime scene to find the truth. These maps are called "subgraphs."

Map 1: The Morpheme Map (The Bricks)
- Analogy: Imagine taking a Lego castle apart to see the individual bricks.
- What it does: It breaks Korean words down to their smallest meaningful pieces. This helps the AI understand the core meaning even if the word order is weird or parts are missing.
Map 2: The POS Map (The Grammar Skeleton)
- Analogy: Imagine looking at a skeleton to see how the bones connect, ignoring the skin.
- What it does: It tracks the "Part of Speech" (is this a noun? a verb?). Since Korean often hides the "glue" words, this map acts as a safety net, reminding the AI how the sentence should be structured grammatically.
Map 3: The Entity Map (The Landmarks)
- Analogy: Looking for famous landmarks in a city to figure out where you are.
- What it does: It highlights specific names like "Samsung," "Seoul," or "Doctor." These are strong clues that help the AI guess the topic even if the rest of the sentence is vague.

The Magic: LIGRAM doesn't just look at these maps separately; it stacks them on top of each other (hierarchical integration). This gives the AI a 3D view of the text, filling in the gaps left by the short length.

3. The Secret Sauce: "Semantic Contrastive Learning" (The Grouping Game)

Even with the maps, the AI might still be unsure if two short texts are similar.

The Old Way: The AI might think two sentences are different just because they use different words, even if they mean the same thing.
The LIGRAM Way (SemCon): Imagine a teacher sorting students into groups based on their interests, not just their names.
- The AI looks at a document and guesses its "topic distribution" (e.g., "80% Politics, 20% Sports").
- It then says, "Hey, Document A and Document B both have high 'Politics' scores. Let's pull them closer together in the AI's brain."
- It pushes documents with different topics further apart.
- This creates clearer boundaries between categories, making it much harder for the AI to get confused.

4. The Results: Why It Matters

The researchers tested LIGRAM on four different Korean datasets (news headlines, movie reviews, search snippets, and shopping reviews).

The Outcome: LIGRAM crushed the competition. It beat standard AI models, deep learning models, and even some massive "Large Language Models" (LLMs) on complex tasks.
The Takeaway: You don't need a giant, expensive super-computer to understand Korean short text. You just need a model that respects the language's unique structure. By building maps of the grammar and meaning, and then teaching the AI to group similar ideas together, LIGRAM solves the "short text" problem efficiently and accurately.

In a nutshell: LIGRAM is like a translator who doesn't just read the words, but understands the grammar, the building blocks, and the hidden context of Korean, allowing it to make sense of even the shortest, most confusing notes.

Here is a detailed technical summary of the paper "Linguistically Informed Graph Model and Semantic Contrastive Learning for Korean Short Text Classification."

1. Problem Statement

Short Text Classification (STC) is a critical NLP task for applications like search queries, social media analysis, and news recommendation. However, it faces significant challenges:

Contextual Scarcity: Short texts lack sufficient context, leading to high semantic ambiguity.
Structural Incompleteness: Short texts often omit grammatical markers (particles, endings) and exhibit irregular structures.
Language-Specific Nuances: Existing STC models are predominantly designed for English. They fail to account for the unique characteristics of Korean, an agglutinative language where meaning is constructed at the morpheme level, word order is flexible, and grammatical roles are encoded via suffixes and particles.
Data Scarcity: Korean STC datasets often suffer from label scarcity and class imbalance, making it difficult for standard deep learning models to learn robust decision boundaries.

2. Methodology: LIGRAM

The authors propose LIGRAM (Linguistically Informed Graph Model), a hierarchical heterogeneous graph neural network architecture specifically designed for Korean. The framework consists of three main components:

A. Hierarchical Heterogeneous Graph Construction

Instead of treating text as a flat sequence of words, LIGRAM constructs three distinct subgraphs to capture different linguistic levels:

Morpheme Graph ( $G_w$ ):
- Rationale: Korean is agglutinative; decomposing words into morphemes is essential for capturing semantic units.
- Implementation: Sentences are tokenized into morphemes using the Kiwi analyzer. Nodes are initialized with embeddings from KLUE/RoBERTa. Edges are weighted by Pointwise Mutual Information (PMI) between co-occurring morphemes to encode semantic proximity.
Part-of-Speech (POS) Graph ( $G_p$ ):
- Rationale: In Korean short texts, particles and endings are frequently omitted, causing syntactic ambiguity. Explicitly modeling POS tags helps reconstruct missing grammatical relationships.
- Implementation: POS tags are treated as nodes. Edges are constructed based on PMI of co-occurring POS tags within a document, compensating for omitted grammatical markers.
Named Entity Graph ( $G_e$ ):
- Rationale: Entities (locations, organizations, persons) serve as strong semantic anchors for disambiguation in short contexts.
- Implementation: Entities are extracted using a fine-tuned KPF-BERT-NER model. Edges are based on cosine similarity between entity vectors, capturing semantic relatedness across documents.

Integration:

A Graph Convolutional Network (GCN) is applied to each subgraph independently to generate node embeddings.
Hierarchical Pooling: Document-level embeddings are obtained by aggregating node embeddings using attention mechanisms (TF-IDF for morphemes/POS, binary presence for entities).
The final document representation is the concatenation of the pooled embeddings from all three subgraphs.

B. Semantics-Aware Contrastive Learning (SemCon)

To address the ambiguity of short text boundaries, the authors introduce SemCon, a contrastive learning strategy that differs from standard instance-level contrastive learning:

Pseudo-Topic Distribution: Instead of treating every document as a unique class, the model projects document embeddings through a Softmax layer to generate a pseudo-topic distribution.
Positive/Negative Pairing:
- Documents with the same dominant pseudo-topic are treated as positive pairs.
- Documents with different pseudo-topics are treated as negative pairs.
Objective: This forces the model to align semantically similar documents (even if they look different on the surface) and push apart those with different thematic meanings, creating clearer decision boundaries.

C. Unified Loss Function

The model is trained using a combined loss function:
$L = L_{ce} + \lambda L_{con}$
Where $L_{ce}$ is the standard cross-entropy loss for classification, and $L_{con}$ is the contrastive loss derived from the pseudo-topic distributions.

3. Key Contributions

LIGRAM Architecture: The first hierarchical heterogeneous graph model explicitly designed for Korean STC, integrating morpheme, POS, and entity levels to handle agglutinative morphology and flexible word order.
SemCon Framework: A novel contrastive learning approach that utilizes pseudo-topic distributions to define positive pairs, effectively resolving semantic ambiguity in short texts where surface features are insufficient.
Comprehensive Evaluation: Extensive experiments on four diverse Korean datasets (News, Movie Reviews, Web Snippets, Shopping Reviews) demonstrating superior performance over traditional, deep learning, and graph-based baselines.

4. Experimental Results

The model was evaluated on four datasets: KLUE YNAT (News), Movie Reviews, Snippets, and Shopping.

Performance: LIGRAM achieved state-of-the-art results across all datasets, significantly outperforming baselines.
- KLUE YNAT: 84.03% Accuracy / 82.69% Macro-F1 (a ~21.5% F1 improvement over the best graph baseline, HyperGAT).
- Snippets: 80.49% Accuracy / 79.86% Macro-F1.
- Shopping: 81.28% Accuracy / 81.15% Macro-F1.
Comparison with LLMs:
- LIGRAM outperformed Large Language Models (LLMs) on multi-class classification tasks (e.g., YNAT, Snippets) where fine-grained category distinctions are required.
- LLMs performed slightly better on binary sentiment tasks (Movie Reviews, Shopping) likely due to their pre-training on massive sentiment-labeled corpora, but LIGRAM remained competitive with only ~0.56M parameters.
Ablation Study:
- Removing any single subgraph (Morpheme, POS, or Entity) degraded performance, confirming the necessity of the multi-level integration.
- Removing SemCon resulted in a significant drop in performance (e.g., -9.8% average F1), proving that semantic alignment is crucial for distinguishing ambiguous short texts.

5. Significance

Language-Specific NLP: The paper highlights that "one-size-fits-all" English-centric models are insufficient for agglutinative languages. It provides a blueprint for incorporating linguistic priors (morphology, syntax, entities) directly into graph structures.
Low-Resource Efficiency: LIGRAM achieves high performance with a relatively small parameter count compared to massive LLMs, making it a practical solution for resource-constrained environments.
Semantic Disambiguation: The integration of contrastive learning based on topic distributions offers a robust method for handling the inherent ambiguity of short texts, a problem that plagues many real-world NLP applications.

In conclusion, LIGRAM demonstrates that combining linguistically informed graph representations with semantic contrastive learning is a highly effective strategy for overcoming the data sparsity and structural complexity inherent in Korean short text classification.

Linguistically Informed Graph Model and Semantic Contrastive Learning for Korean Short Text Classification

1. The Problem: The "Missing Context" Puzzle

2. The Solution: LIGRAM (The Three-Layer Detective)

3. The Secret Sauce: "Semantic Contrastive Learning" (The Grouping Game)

4. The Results: Why It Matters

1. Problem Statement

2. Methodology: LIGRAM

A. Hierarchical Heterogeneous Graph Construction

B. Semantics-Aware Contrastive Learning (SemCon)

C. Unified Loss Function

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Markovian Generation Chains in Large Language Models