CLARC: C/C++ Benchmark for Robust Code Search

Imagine you are a librarian in a massive, chaotic library containing millions of books written in a secret code. Your job is to find the exact book a patron asks for, just by hearing a description of what they need.

For a long time, librarians (the AI models) have been getting really good at this, but mostly because they've been cheating. They've learned to look at the titles and spine colors of the books (the variable names and keywords) rather than actually reading the story inside to understand what it's about.

This paper introduces a new, much harder test called CLARC to see if these librarians are actually smart or just really good at guessing based on titles.

Here is the breakdown of what the researchers did, using some everyday analogies:

1. The Problem: The "Title-Dependent" Librarians

Most previous tests for code-searching AI used simple languages (like Python) and easy questions. The AI models learned that if a user asks for a "calculator," they should look for code with the word "calculator" or "add" in it.

The Flaw: If you change the title of the book to "The Math Machine" but keep the story exactly the same, the old librarians get confused and can't find it. They rely on lexical cues (the specific words used) rather than semantics (the actual meaning).

2. The Solution: The "CLARC" Challenge

The researchers built a new benchmark called CLARC (C/C++ Language Retrieval with Anonymized Code). Think of this as a "Stress Test" for librarians. They took real-world code (like the blueprints for actual software) and created three specific "traps" to see if the AI could still find the right book.

Trap A: The "Blindfold" (Identifier Anonymization)

Imagine taking a book and replacing every character's name with "Person A," "Person B," and "Object C."

The Test: The AI has to find a story about "a hero saving a princess" even if the text now says "Person A saves Person B."
The Result: The AI models crashed. They couldn't find the right book because the specific names were gone. This proved they weren't reading the story; they were just matching names.

Trap B: The "Translation" (Assembly & WebAssembly)

Imagine translating a complex novel into a language made entirely of numbers and symbols (like Morse code or binary), where the original words are completely gone.

The Test: The AI has to find a story about "cooking a cake" when the text is now just a list of chemical reactions and oven temperatures.
The Result: The AI got lost. It showed that these models are terrible at understanding code when it's stripped down to its raw, low-level instructions.

Trap C: The "Dependency" Puzzle

Some code is like a solo act; other code is like a band where the main singer needs a drummer and a guitarist to make sense.

The Test: The researchers tested if the AI could find the main song even if it had to look at the sheet music for the drummer and guitarist to understand the rhythm.
The Result: The AI struggled when the code was complex and relied on other helper functions, showing it has trouble seeing the "big picture."

3. The Findings: "They Are Cheating!"

The researchers tested six of the smartest AI models available today.

In normal conditions: The models were great, finding the right code 90%+ of the time.
In the "Blindfold" and "Translation" traps: Their performance plummeted. Some dropped to near-zero accuracy.

The Big Conclusion: The current state-of-the-art AI models are like students who memorize the answers to a test but don't understand the subject. If you change the wording of the question (anonymize the code) or ask the question in a different language (compile to Assembly), they fail. They rely on surface-level patterns (the words used) rather than deep understanding (the logic of the code).

4. Why This Matters

The paper isn't just saying "AI is dumb." It's saying, "We need to build better AI."

Security: If hackers obfuscate code (hide the names) to hide malware, current AI tools might not be able to detect it because they rely on the names.
Real-World Use: Real software is complex. If we want AI to help developers, it needs to understand the logic, not just the labels.

Summary Analogy

Think of the current AI models as a super-fast barcode scanner.

If you scan a book with a barcode, it finds it instantly.
But if you take the barcode off (anonymize the code) or translate the book into Braille (Assembly), the scanner stops working because it doesn't actually "know" what the book is about.

CLARC is the new test that forces the scanner to learn how to read the book itself, not just scan the barcode. The paper shows that right now, our scanners are still just scanning barcodes, and we have a long way to go before they can truly read.

Here is a detailed technical summary of the paper CLARC: C/C++ Benchmark for Robust Code Search.

1. Problem Statement

Current code search benchmarks suffer from three critical limitations that hinder the evaluation of real-world robustness:

Language Bias: Most existing datasets prioritize Python, neglecting systems programming languages like C/C++, which are crucial for low-level and embedded development.
Lack of Compilability: Many benchmarks contain code snippets that are incomplete (missing headers, dependencies, or helper functions), failing to reflect professional development practices where context is essential.
Superficial Robustness Testing: Existing benchmarks rarely test model resilience against textual perturbations (e.g., variable renaming, obfuscation) or low-level representations (Assembly, WebAssembly). Consequently, high performance scores often stem from reliance on superficial lexical features (identifier names) rather than deep semantic understanding.

2. Methodology: The CLARC Pipeline

The authors introduce CLARC (C/C++ LAnguage Retrieval with Anonymized Code), a comprehensive benchmark and an automated pipeline designed to address these gaps.

A. Data Construction

Source: Extracted from 144 real-world GitHub C/C++ repositories (45 for evaluation, 99 for training).
Compilability Guarantee: The pipeline enforces a strict compilation environment using a whitelist of standard library headers. It extracts functions along with their full dependency context (call graphs, definitions) to ensure every snippet is fully compilable.
Categorization by Dependency Complexity:
- Group 1: Functions relying solely on standard library types/functions.
- Group 2: Functions using standard libraries but with custom-defined variable types.
- Group 3: Functions invoking user-defined helper functions. (Further split into Short [separate snippets] and Long [merged snippets]).
Query Generation: Natural language queries are generated automatically using LLMs (o3-mini, grok-4). To ensure quality, the authors employed a hypothesis testing framework where human experts (5+ years experience) and LLMs generated descriptions for the same functions. Statistical analysis (bootstrap testing, Krippendorff's $\alpha$ ) confirmed that LLM-generated queries are statistically comparable to or better than human annotations.

B. Robustness Settings

To stress-test retrieval models, CLARC introduces four distinct settings:

Standard: Original code with original identifiers.
Neutralized: Identifiers replaced with generic placeholders (e.g., func_a, var_b) to preserve structure but remove semantic naming cues.
Randomized: Identifiers replaced with random strings to eliminate all lexical information.
Low-Level Compilation: Code compiled into x86 Assembly and WebAssembly (Wasm). This removes high-level syntax and identifiers entirely, forcing models to rely on instruction semantics.

3. Key Contributions

CLARC Dataset: A fully compilable C/C++ dataset with 6,717 query-code pairs (1,245 for evaluation, 5,472 for training), categorized by dependency complexity.
Automated Pipeline: A scalable, reproducible pipeline for generating high-quality, compilable benchmarks with rigorous human-validated query generation.
Robustness Evaluation Framework: A novel methodology to evaluate code search models under identifier anonymization and low-level compilation, isolating the impact of lexical features vs. semantic understanding.
Empirical Findings: Comprehensive evaluation of six state-of-the-art models revealing significant vulnerabilities in current code search systems.

4. Experimental Results

The authors evaluated six models: BM25 (baseline), CodeT5+, OASIS, Nomic-emb-code, OpenAI-text-embedding-large, and Voyage-code-3.

Standard Setting: Modern models (Nomic, OASIS, Voyage) significantly outperformed older models (CodeT5+) and BM25, achieving high NDCG scores (e.g., ~88-94% for Group 2).
Anonymization Impact (Neutralized/Randomized):
- Sharp Performance Drop: All models suffered drastic declines in performance when identifiers were anonymized. For example, Nomic-emb-code's NDCG on Group 2 dropped from 93.61 (Standard) to 55.27 (Randomized).
- Lexical Reliance: The results indicate that even SOTA models rely heavily on identifier names rather than code semantics. When names are removed, the ability to retrieve correct code collapses.
- Group 3 Sensitivity: Performance degradation was most severe in Group 3 (complex dependencies), suggesting models struggle to infer logic without descriptive names, even when helper functions are present.
Low-Level Language Impact (Assembly/Wasm):
- Severe Degradation: Models performed poorly on Assembly and WebAssembly, with NDCG scores often dropping below 35%.
- Limited Low-Level Capability: This highlights a fundamental inability of current embedding models to generalize to low-level representations, likely due to the explosion in token count and the loss of high-level semantic structure.
Fine-tuning Limitations:
- Fine-tuning models (CodeT5+, OASIS) on the training set improved standard retrieval metrics but failed to close the gap between standard and anonymized settings.
- Fine-tuning on neutralized data led to over-specialization, degrading performance on standard tasks. This suggests standard supervised learning cannot easily overcome the robustness vulnerability.

5. Significance and Future Work

Paradigm Shift: CLARC demonstrates that high scores on existing benchmarks do not equate to robust code understanding. The field must move beyond lexical matching toward true semantic reasoning.
Security Implications: The findings are critical for security applications (e.g., malware analysis, code obfuscation detection), where models must function without relying on variable names.
Future Directions: The paper calls for the development of models capable of reasoning over low-level instructions and robust training strategies that do not rely on superficial textual cues. The dataset and pipeline are publicly available to facilitate this research.

In summary, CLARC exposes a critical fragility in current code search technologies: they are "brittle" to identifier changes and low-level transformations. The benchmark serves as a necessary stress test to drive the next generation of semantically robust code retrieval models.