CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language

Imagine you are a master chef who has spent years cooking in a massive, well-stocked kitchen filled with ingredients from every country (this is like Large Language Models trained on English, Python, and C++). You can whip up a perfect French soufflé or a complex Italian pasta dish with your eyes closed.

But then, someone hands you a recipe book for a brand-new, tiny island cuisine called Cangjie. You've never heard of it. There are no cookbooks, no YouTube tutorials, and almost no one has ever cooked it before. The ingredients are strange, the tools are different, and the rules are strict.

This paper, CANGJIEBENCH, is about testing how well these "master chefs" (AI models) can cook this new, obscure cuisine without any prior training.

Here is the breakdown of their experiment using simple analogies:

1. The Problem: The "Empty Pantry"

Most AI models are great at common languages (like Python) because they've read billions of recipes. But Cangjie is a new programming language created by Huawei for their HarmonyOS. It's so new that the AI has never seen it before.

The Challenge: If you ask the AI to write code in Cangjie, it usually just hallucinates. It tries to mix Python rules with Cangjie words, resulting in "gibberish" that doesn't work. It's like trying to bake a cake using a hammer because you forgot the recipe.

2. The Solution: Building a "Taste Test" (The Benchmark)

Since there was no existing data to test the AI, the researchers had to create their own "Taste Test."

The Translation Trick: They took famous, difficult cooking challenges from the "Python Kitchen" (called HumanEval and ClassEval) and manually translated them into Cangjie.
Why Manual? They didn't just scrape the internet (because there's nothing there). They hired experts to rewrite the problems. This ensures the AI isn't just "cheating" by remembering old answers; it actually has to learn the new rules on the spot.
The Result: A clean, contamination-free test with 248 problems, ranging from simple "stir-fry" tasks (functions) to complex "banquet" preparations (classes).

3. The Four Cooking Strategies

The researchers tested four different ways to help the AI cook this new dish:

Strategy A: The "Guess and Check" (Direct Generation)
- The Setup: You hand the AI the recipe and say, "Cook this." No help.
- The Result: Disaster. The AI fails almost 100% of the time. It doesn't know the basic rules of the new kitchen.
Strategy B: The "Cheat Sheet" (Syntax-Constrained Generation)
- The Setup: You give the AI a one-page cheat sheet with the most important rules of Cangjie (e.g., "Use a semicolon here," "This is how you make a list").
- The Result: Magic. The AI's performance jumped from near-zero to over 50%. It turns out the AI already knows how to cook (the logic); it just needed to know which utensils to use (the syntax). This was the best balance between effort and results.
Strategy C: The "Library Research" (RAG)
- The Setup: You give the AI access to a library of Cangjie cookbooks and tell it, "Look up the answer before you cook."
- The Result: It helped a little, but not as much as the cheat sheet. The AI got confused by too much information or couldn't find the right page in the library.
Strategy D: The "Intern with a Walkie-Talkie" (Agent)
- The Setup: You give the AI a robot assistant (an Agent) that can walk around the kitchen, open drawers, read manuals, and ask for help if it gets stuck. It can try, fail, check the manual, and try again.
- The Result: This produced the highest accuracy (the best dishes). However, it was extremely expensive and slow. It took the AI a huge amount of time and "brain power" (tokens) to read all those manuals. It's like hiring a team of 10 people to cook one meal.

4. The Big Surprise: The "Translation Trap"

The researchers also tried a second task: Code-to-Code Translation. Instead of asking the AI to cook from a description, they gave it a Python recipe and said, "Translate this to Cangjie."

The Expectation: "If I give you the source code, it should be easier!"
The Reality: It was actually harder.
The Analogy: When the AI sees the Python code, it gets "stuck" on the Python style. It tries to force Python habits onto the Cangjie language, like trying to wear a suit over a swimsuit. It's better to let the AI cook from scratch (Text-to-Code) than to let it try to translate, because the old habits get in the way.

The Takeaway

This paper teaches us three main things:

Logic is Universal, Syntax is Local: AI models already know how to solve problems; they just need a quick "cheat sheet" to learn the new language's rules.
Don't Overcomplicate: For new languages, a simple cheat sheet (Syntax-Constrained) is often better than a complex research team (Agents) because it's faster and cheaper.
Translation is Tricky: Sometimes, seeing the original code makes it harder to learn a new language because the AI gets confused by the old habits.

In short, CANGJIEBENCH is a map showing us how to teach AI new skills quickly without needing to retrain the whole brain, just by giving it the right rulebook.

1. Problem Statement

Large Language Models (LLMs) have achieved remarkable proficiency in high-resource programming languages (e.g., Python, C++) but struggle significantly with low-resource general-purpose languages. Existing research on low-resource languages primarily focuses on Domain-Specific Languages (DSLs) like Verilog or Solidity. However, DSLs conflate syntactic challenges with domain-specific knowledge, making it difficult to isolate whether a model's failure is due to a lack of syntax understanding or domain expertise.

Furthermore, there is a lack of benchmarks for general-purpose low-resource languages that are free from data contamination. Most "low-resource" languages (e.g., Lua, R) still exist in massive pre-training corpora, leading to data leakage where models merely memorize rather than generalize.

Cangjie, a modern general-purpose language developed by Huawei for the HarmonyOS ecosystem, represents an ideal testbed. It is nascent (released July 2025), lacks a large-scale public corpus, and possesses a unique syntax distinct from mainstream languages, offering a rigorous environment to test LLM generalization without pre-training bias.

2. Methodology

A. Dataset Construction: CANGJIEBENCH

To address the scarcity of Cangjie code, the authors constructed CANGJIEBENCH using a translation-based strategy rather than web crawling.

Source: Manually translated 248 high-quality samples from HumanEval (function-level) and ClassEval (class-level) from Python to Cangjie.
Zero Contamination: Since the data was manually translated by experts after the Cangjie release, it ensures no pre-training leakage.
Construction Principles:
- Type Adaptation: Mapped Python types to Cangjie equivalents (e.g., int $\to$ Int64, str $\to$ String).
- Algorithm Fidelity: Preserved the original algorithmic logic of Python solutions.
- Dependency Management: Excluded tasks requiring third-party libraries unavailable in Cangjie (e.g., sqlite3, PIL).
Tasks: The benchmark supports two tasks:
1. Text-to-Code: Generating Cangjie code from natural language descriptions.
2. Code-to-Code: Translating existing Python code into Cangjie.

B. Evaluation Framework

The authors evaluated diverse LLMs (including GPT-5, DeepSeek-V3, Qwen3, Kimi-K2) under four distinct paradigms to determine the most effective approach for generalization without parameter updates (fine-tuning):

Direct Generation: Zero-shot prompting with only the problem description.
Syntax-Constrained Generation: Augmenting prompts with expert-curated, concise grammar rules (20 categories covering 2,146 tokens) to guide the model via in-context learning.
Retrieval-Augmented Generation (RAG):
- RAG (Docs): Retrieving official documentation via query transformation.
- RAG (Code): Retrieving few-shot code snippets from a crawled corpus.
Agent: A CLI-based agent that autonomously queries documentation and iteratively refines code, simulating a human developer's research workflow.

Metrics: Pass@1 (functional correctness), Compile Rate (syntactic validity), and Token Usage (computational cost).

3. Key Contributions

First Cangjie Benchmark: Introduced CANGJIEBENCH, the first contamination-free benchmark for a low-resource general-purpose language, covering both generation and translation.
Novel Research Perspective: Treated Cangjie as a general-purpose language to strictly evaluate syntax learning capabilities independent of domain knowledge, distinguishing it from DSL-focused benchmarks.
Comprehensive Paradigm Evaluation: Systematically compared Direct, Syntax-Constrained, RAG, and Agent approaches, providing baselines for how LLMs adapt to unseen languages.
Discovery of Negative Transfer: Identified that Code-to-Code translation often underperforms Text-to-Code generation due to models overfitting to source language (Python) patterns.

4. Key Results

A. Performance Trends

Direct Generation Failure: Models performed poorly in zero-shot settings (Pass@1 < 5% for most models), with Compile Rates nearly identical to Pass@1. This confirms that models lack syntactic knowledge of Cangjie, not logical reasoning ability.
Syntax-Constrained Superiority: Injecting grammar rules yielded the best trade-off between accuracy and cost. For example, GPT-5's Pass@1 surged from 4.3% to 53.8% on Text-to-Code. This suggests LLMs possess the underlying algorithmic logic but lack surface-level syntax.
Agent Performance: Agent-based methods (specifically GPT-5 with Codex CLI) achieved State-of-the-Art (SOTA) accuracy (77.6% Pass@1) by iteratively researching documentation. However, this came at a massive computational cost (hundreds of thousands of tokens).
RAG Limitations: RAG methods underperformed compared to Syntax-Constrained generation. RAG (Code) failed because models struggled to generalize complex grammar from isolated snippets, and RAG (Docs) failed due to poor query generation by models lacking language knowledge.

B. The "Negative Transfer" Phenomenon

A critical finding is that Code-to-Code translation often performs worse than Text-to-Code generation.

Observation: Under Syntax-Constrained settings, GPT-5 dropped from 53.8% (Text-to-Code) to 38.1% (Code-to-Code).
Cause: Models tend to overfit to the source language's (Python) syntax patterns when translating line-by-line, failing to adapt to the target language's static typing and structural constraints. Text-to-Code allows the model to generate the target structure directly, avoiding this interference.

C. Cost-Efficiency Analysis

Syntax-Constrained is the most efficient, offering high accuracy with minimal token overhead.
Agent methods are highly inefficient; input tokens (reading context) account for ~99% of total usage, making them impractical for many real-world applications despite their high accuracy.

5. Significance and Future Directions

Generalization Boundaries: The study proves that LLMs can generalize to entirely new general-purpose languages via in-context learning (syntax rules) without fine-tuning, provided the logical reasoning is sound.
Practical Implications: For emerging languages, providing structured grammar rules is more effective than retrieving noisy code snippets.
Future Work:
- Developing automated methods to extract and inject minimal grammar rules.
- Exploring semantics-aligned translation (summarizing source logic before generating target code) to mitigate negative transfer.
- Expanding benchmarks to repository-level tasks involving multi-file dependencies, which currently pose near-zero success rates for models.

In conclusion, CANGJIEBENCH establishes a rigorous standard for evaluating LLMs on low-resource general-purpose languages, revealing that while models struggle with syntax, they can rapidly adapt when guided by explicit grammatical constraints, though they face significant challenges in cross-lingual translation due to source-language interference.