Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

Imagine you are trying to teach a robot to understand metaphors in Chinese. You know the robot is smart, but when it says, "This sentence is a metaphor," it can't tell you why. It's like a student who gets the right answer on a math test but can't show their work. You don't know if they guessed, if they memorized the answer, or if they actually understood the logic.

This paper is about fixing that "black box" problem. The researchers built a new kind of robot that doesn't just guess; it follows a step-by-step recipe (a rule script) that humans can read, check, and even edit.

Here is the breakdown of their work using simple analogies:

1. The Problem: The "Magic 8-Ball" vs. The "Detective"

Most current AI models are like Magic 8-Balls. You ask, "Is this a metaphor?" and it shakes and says "Yes." But if you ask, "Why?", it just stays silent. This is a big problem because in Chinese, there are no little grammatical flags (like verb endings in English) to help spot metaphors. You have to rely on context and deep cultural knowledge.

The researchers wanted to build a Detective instead. A Detective doesn't just shout "Guilty!"; they present a file with evidence: "I know this is a metaphor because the word 'deep' usually means physical depth, but here it describes a 'profound idea,' which is a clash."

2. The Solution: Four Different Detective Manuals

The team didn't just build one detective; they built four different teams, each using a different "Detective Manual" (Protocol) to find metaphors. They turned these manuals into computer code that uses a large AI (LLM) only for specific, small tasks, like looking up a word's meaning.

Team A (The Dictionary Detective): This team follows the classic rule: "Does this word have a basic, physical meaning that is different from how it's used here?" (e.g., A "bright" future isn't actually glowing).
Team B (The Map Maker): This team looks for the "skeleton" of a metaphor: The Target (what is being described), the Vehicle (the image used), and the Ground (the shared trait). If they can draw a clear map connecting these three, it's a metaphor.
Team C (The Emotion Sensor): This team asks, "Does this sentence feel emotionally weird?" Metaphors often mix emotions that don't usually go together (e.g., "A joyful scream"). If the emotion feels incongruous, it's likely a metaphor.
Team D (The "Like" Hunter): This team only looks for the word "like" (or its Chinese equivalents). If it sees "A is like B," it checks if A and B are totally different things. If so, it's a simile (a type of metaphor).

3. The Big Discovery: The Rulebook Matters More Than the Robot

The researchers tested all four teams on the same pile of Chinese text. The results were shocking:

Team A found a lot of metaphors (high recall) but sometimes made mistakes.
Team D was very strict. It only found the obvious "like" comparisons. It was almost perfect when it spoke, but it missed almost everything else.
The Shocking Result: Team B and Team C agreed with each other almost 100% of the time. But Team A and Team D agreed with each other almost 0% of the time.

The Analogy: Imagine you are looking for "fruit" in a kitchen.

Team A is a botanist who counts everything that grows on a plant (including tomatoes and cucumbers).
Team D is a baker who only counts things that are sweet and red (only apples and strawberries).
If you ask them to list the "fruit" in the kitchen, they will produce two completely different lists. They aren't arguing about the robot; they are arguing about the definition of fruit.

The paper proves that how you define a metaphor matters more than how smart your AI is.

4. Why This is a Game-Changer: The "Editability"

Because these teams follow written rules (scripts) rather than just "thinking" like a black box, humans can fix them easily.

The Old Way: If a neural network makes a mistake, you have to retrain the whole model, which is like rebuilding a car engine just to fix a flat tire.
The New Way: If Team A keeps making a mistake with the word "deep," a human can just open the script, change one line of code, and say, "Okay, from now on, 'deep' in this context is literal, not metaphorical."

The researchers showed that their system is 100% reproducible (if you run it twice, you get the exact same result) and fully editable.

5. The Trade-off

There is a small price to pay. The "Detective" system (Team A) got a score of 0.47 on a standard test, while a super-smart, unexplainable AI (Fine-tuned BERT) got 0.65.

However, the researchers argue: Would you rather have a robot that gets the answer right but can't explain why, or one that gets it mostly right but can show you its homework?

For education, law, or linguistics, the ability to explain why is worth the slight drop in raw score.

Summary

This paper is a call to stop treating metaphor detection as a simple "guessing game." Instead, it proposes building transparent, rule-based systems where humans can see the logic, fix the errors, and understand that "metaphor" isn't one single thing—it depends entirely on which rulebook you choose to use.

Here is a detailed technical summary of the paper "Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study."

1. Problem Statement

The paper addresses the interpretability gap in computational metaphor identification, particularly for the Chinese language.

Opacity of Current Models: State-of-the-art neural classifiers (e.g., BERT-based models) achieve high performance but function as "black boxes," offering no structured explanation for why a specific expression is deemed metaphorical.
Chinese-Specific Challenges:
- Lack of Morphology: Chinese lacks the inflectional cues common in Indo-European languages, making metaphor detection reliant entirely on context and world knowledge.
- Diverse Figurative Traditions: Chinese figurative language includes conceptual metaphors, similes, metonymy, and culture-specific figures that do not map neatly onto English-centric annotation frameworks.
- Resource Scarcity: Annotated Chinese resources are fragmented and often use incompatible schemes.
Protocol Ambiguity: There is no consensus on what constitutes a "metaphor." Different theoretical frameworks (e.g., lexical contrast vs. conceptual mapping vs. emotion) yield vastly different identification results, yet this variation is rarely studied systematically.

2. Methodology

The authors propose a Rule-Script Architecture that separates the identification protocol from the underlying model. Instead of using an LLM as a monolithic classifier, the LLM acts as a constrained subroutine within a deterministic pipeline.

System Architecture

The system implements four distinct, linguistically grounded protocols as executable rule scripts:

Protocol A (MIP/MIPVU): Lexical-level identification based on the contrast between a word's contextual meaning and its basic (dictionary) meaning.
Protocol B (CMDAG): Sentence-level identification based on extracting conceptual mapping triples (Tenor, Vehicle, Ground).
Protocol C (Emotion): Identification based on affective incongruity (mismatch between literal and figurative emotional valence).
Protocol D (Simile): Identification of explicit comparisons using markers like xiang (like) or ru (as).

Pipeline Design

Each protocol is decomposed into a five-stage pipeline:

Text Preprocessing: Segmentation and POS tagging (using jieba and character-level analysis).
Candidate Selection: Filtering for specific targets (e.g., content words for Protocol A, sentences with comparison markers for Protocol D).
Semantic Analysis: A constrained LLM call (GPT-4, temperature=0) performs specific subtasks (e.g., retrieving basic meanings, extracting domains, assessing valence) with strict output formatting.
Classification: Deterministic rules applied to the structured LLM output to make the final binary decision.
Rationale Generation: The system outputs a structured JSON rationale explaining the decision path.

Key Design Principle: The protocol logic is encoded in the pipeline configuration, while the LLM is only a tool for specific reasoning steps. This ensures determinism (given fixed LLM outputs, results are reproducible) and auditability.

3. Key Contributions

First Rule-Script System for Chinese: The first system to operationalize multiple metaphor identification protocols as executable, human-auditable rule scripts for Chinese.
Comprehensive Cross-Protocol Comparison: The first study to evaluate four different theoretical protocols on seven Chinese datasets, revealing that protocol choice is a larger source of variation than model choice.
Interpretability Assessment Framework: Introduction of a three-dimensional evaluation metric:
- Rationale Correctness: Does the explanation match the evidence?
- Determinism: Is the output reproducible?
- Editability: Can the rule script be easily modified to fix errors?
Open Resources: Release of codebase, protocol implementations, and evaluation scripts to support reproducible research.

4. Experimental Results

The study evaluated the system on seven datasets (spanning token, sentence, and span levels) and compared performance across protocols.

Within-Protocol Performance

Protocol A (MIP): Achieved an F1 of 0.472 on token-level identification (PSU CMC). Performance varied by register (Academic: 0.598, Fiction: 0.360).
Protocol B (CMDAG) & C (Emotion): Showed high precision (~~0.64) but low recall (~~0.23) on sentence-level tasks, indicating they identify a specific, salient subset of metaphors.
Protocol D (Simile): High precision (0.909) but near-zero recall (0.009) in cross-protocol settings, confirming similes are a small subset of all metaphors.

Cross-Protocol Comparison (The Core Finding)

When applied to the same dataset (1,723 sentences from PSU CMC):

Divergence: The agreement between Protocol A (MIP) and Protocol D (Simile) was negligible (Cohen's $\kappa$ = 0.001).
Convergence: Protocols B (Conceptual Mapping) and C (Emotion) showed near-perfect agreement ( $\kappa$ = 0.986), suggesting these two theoretical approaches identify the same "prototypical" metaphors.
Implication: Protocol choice introduces more variation in results than the choice of the underlying AI model.

Interpretability Audit

Determinism: All protocols achieved 100% reproducibility.
Rationale Correctness: Ranged from 0.40 (Protocol B, due to implicit "ground" extraction difficulty) to 0.87 (Protocol D).
Editability: Ranged from 0.80 to 1.00, demonstrating that rule scripts can be manually tuned to fix specific error patterns, unlike retraining neural networks.

Baseline Comparison

Protocol A (F1 0.472) outperformed GPT-4 zero-shot (~~0.43) and simple heuristics but remained below fine-tuned BERT (~~0.65). The authors argue the slight performance drop is an acceptable trade-off for full transparency.

5. Significance and Conclusion

Paradigm Shift: The paper argues that the field must move away from optimizing a single F1 score under a single protocol. "Metaphor identification" is not a unitary task but a family of tasks dependent on theoretical operationalization.
Transparency over Black Boxes: The rule-script architecture proves that competitive performance can be achieved while maintaining full transparency, enabling users to trace reasoning and correct systematic errors.
Theoretical Insight: The high agreement between conceptual mapping and emotion-based protocols suggests a shared cognitive mechanism where salient metaphors often involve both cross-domain mapping and affective incongruity.
Future Directions: The authors suggest future work should focus on ensemble protocols, open-source LLM integration, and extending these cross-protocol comparisons to other languages.

In summary, this paper demonstrates that interpretable, rule-based systems assisted by LLMs can effectively handle the complexity of Chinese metaphor identification, providing a rigorous, auditable alternative to opaque neural classifiers while highlighting the critical impact of theoretical protocol selection on research outcomes.