Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

Imagine you have a brilliant, world-class chef (a Large Language Model) who can cook incredible meals in French, Italian, and American cuisine. These are the "high-resource" languages like Python and JavaScript. But if you ask this chef to cook a traditional dish from a small, remote village—say, a specific type of stew from a tiny island in the Pacific (a "low-resource" language like Fortran, Julia, or R)—they stumble. They might not know the ingredients, or worse, they might try to cook it using French techniques that just don't work.

The problem isn't just that the chef hasn't read many cookbooks from that island (lack of training data). The bigger issue is that every time you want to teach them a new island's cuisine, you have to hire a whole new team of translators, build a new kitchen, and write a new set of rules for how to taste the food. It's expensive, slow, and tedious.

Enter "Agnostics."

The paper introduces a new method called Agnostics that solves this by changing the rules of the game. Instead of asking the chef, "Did you cook this exactly like a French recipe?", Agnostics asks a much simpler question: "Does the food taste right?"

Here is how it works, broken down into simple analogies:

1. The "Black Box" Tasting Test

In the old way, to check if a chef cooked a dish correctly, you had to inspect their recipe step-by-step. If they used the wrong knife or the wrong spice, you failed them. This required a human expert who knew that specific language perfectly.

Agnostics uses a "Black Box" approach.

The Setup: You give the chef a list of ingredients (Input) and a description of the final dish (Expected Output).
The Test: The chef cooks the dish. You don't care how they cooked it or what language they used. You just taste the final dish.
The Verdict: If the taste matches the description, the chef gets a gold star (a reward). If it tastes like mud, they get nothing.

Because the test only cares about the result (the taste), the chef can use any language to cook it. They can use a French knife, a Japanese wok, or a Swiss army knife. As long as the final stew tastes right, they win.

2. The "Universal Kitchen" (The Environment)

Usually, to teach a model a new language, you need to build a custom kitchen for that language. Agnostics builds a Universal Kitchen.

It's a container (like a shipping container) that holds everything needed to run code.
To add a new language (like R or OCaml), you just drop in a tiny, 4-line instruction manual (a YAML config file) that says: "Here is how to turn on the stove and how to serve the food."
Suddenly, the Universal Kitchen can handle Lua, Julia, R, OCaml, and Fortran without needing a complete rebuild.

3. The "Trial and Error" Coach (Reinforcement Learning)

The paper uses a technique called Reinforcement Learning with Verifiable Rewards (RLVR). Think of this as a coach who doesn't give you a textbook answer but lets you practice until you get it right.

The model tries to solve a problem.
The "Tasting Test" (the verifier) checks the result.
If it's right, the model gets a reward and learns, "Hey, that way of thinking worked!"
If it's wrong, the model gets no reward and tries again, adjusting its strategy.

Because the test is purely about the output, the model learns the logic of the language rather than just memorizing syntax. It learns how to think in that language.

The Results: Small Models, Big Wins

The most exciting part of this paper is what happened when they tried it.

They took a small, 4-billion-parameter model (think of it as a smart intern) and trained it on these low-resource languages using Agnostics.
The Result: This "intern" started performing as well as, or even better than, massive 70-billion-parameter models (the "Master Chefs") that had been trained on everything.
They tested this on five difficult languages (Lua, Julia, R, OCaml, Fortran) and the small model became a master of all of them.

Why This Matters

Before Agnostics, if you wanted an AI to help a scientist write code in Fortran (used for weather modeling) or R (used for statistics), you had to hope the AI was already good at it, or spend months building custom tools.

With Agnostics, you can take any programming language, write a tiny 4-line config file, and instantly start training an AI to be an expert in it. It turns the process of teaching AI new languages from "building a new factory for every product" into "just changing the recipe card."

In short: Agnostics stops worrying about how the code is written and starts focusing on what the code does. It's the difference between grading a student on their handwriting versus grading them on whether they actually solved the math problem. And it turns out, once you stop caring about the handwriting, the students learn much faster.

1. Problem Statement

Large Language Models (LLMs) demonstrate high proficiency in high-resource programming languages (e.g., Python, JavaScript) but struggle significantly with low-resource languages (e.g., Fortran, Julia, R, OCaml, Lua). This gap exists due to two primary factors:

Data Scarcity: Pre-training corpora (like The Stack V2) are heavily skewed toward popular languages, leaving low-resource languages with insufficient training data.
Post-Training Bottlenecks: Current methods for improving LLMs via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) require language-specific engineering. Creating high-quality datasets, test harnesses, and verification environments for every new language is labor-intensive and often infeasible for niche languages. Existing approaches (like MultiPL-T) rely on translating Python test cases, which often results in unidiomatic code or requires complex compiler-like translators for each language.

2. Methodology: The Agnostics Framework

The authors propose Agnostics, a language-agnostic post-training pipeline that eliminates per-language engineering by judging code solely on its externally observable behavior (Input/Output) rather than internal syntax or function signatures.

The framework consists of four key stages:

A. Data Preparation (Language-Agnostic Reformulation)

Concept: Convert existing coding datasets (which often use function signatures and unit tests) into a uniform I/O format.
Process: An LLM is used to rewrite problems so the solution must read from stdin and write to stdout. The model is prompted to explicitly define input/output formats (e.g., delimiters, decimal precision) to ensure unambiguous verification.
Datasets: The authors created three new datasets based on MBPP, Open-R1 Codeforces, and LiveCodeBench, reformatted as Ag-MBPP-X, Ag-Codeforces-X, and Ag-LiveCodeBench-X.

B. Language Configuration

Instead of writing complex translators, users provide a short YAML configuration file (4–5 lines) for a target language.
This file specifies:
1. Prompt Prefix: Instructions to the model to generate code in the specific language (e.g., handling R's specific I/O quirks).
2. Execution Environment: Shell commands to install the toolchain, compile (if necessary), and run the code.
This allows the system to be adapted to a new language in under an hour.

C. Reinforcement Learning with Verifiable Rewards (RLVR)

Algorithm: The system uses Group Relative Policy Optimization (GRPO).
Reward Mechanism:
1. The model generates a group of candidate programs.
2. A universal execution sandbox (based on OCI containers) extracts the code, compiles it, and runs it against the I/O test cases.
3. Reward: $R=1$ if the program passes all I/O tests; $R=0$ otherwise.
4. Safety: The sandbox enforces strict timeouts (for compilation and execution) and output size limits (5MB buffer) to prevent infinite loops or resource exhaustion.
Key Insight: The verifier language (Python/shell) is independent of the target language being learned.

D. Training Infrastructure

Built on Ray for distributed computing, separating the GPU-intensive generation phase from the CPU-intensive execution/verification phase.
Uses warm container pools to minimize the overhead of spawning new environments.

3. Key Contributions

Agnostics Pipeline: A novel, language-agnostic post-training pipeline that enables RL training for any programming language with minimal configuration.
State-of-the-Art (SOTA) Models: The authors released the best-performing open-weight models ( $\le$ 16B parameters) for Lua, Julia, R, OCaml, and Fortran.
New Datasets & Benchmarks:
- Ag-Codeforces-X: A large-scale competitive programming dataset reformatted for I/O.
- Ag-LiveCodeBench-X: A new, harder multi-language benchmark derived from LiveCodeBench, specifically designed to challenge frontier models on low-resource languages.
Open Source: Full release of training code, datasets, and ready-to-use YAML configurations, making RL training for new languages as simple as a few lines of configuration.

4. Experimental Results

The authors evaluated Agnostics on five low-resource languages using Qwen-3 (4B and 8B), DeepSeek Coder 6.7B, Phi-4 Mini, and SmolLM3.

Performance Gains:
- Qwen-3 4B trained with Agnostics on Ag-Codeforces-X achieved performance rivaling 16B–70B open-weight models (e.g., Llama 3.3 70B, Qwen 3 32B) on low-resource languages.
- Pass@1 Improvements: On the new Ag-LiveCodeBench-X benchmark, models saw 1.5x–2x improvements over their base versions.
  - Example: Qwen3-4B-CF-Fortran improved from 0% to 15% pass@1, outperforming the 70B Llama 3.3 model (3%).
  - Example: Qwen3-4B-CF-OCaml improved from 1% to 7%, surpassing the 32B Qwen 3 model (2%).
Generalization: Models trained on I/O-based competitive programming tasks also showed significant improvements on MultiPL-E (which uses function signatures), proving the method generalizes beyond the specific training format.
Scalability: The method scales to larger models (8B) and different model families (DeepSeek, Phi, SmolLM), consistently outperforming baselines.
Comparison to Alternatives:
- vs. Distillation: Agnostics (RL) significantly outperformed distillation from a larger "teacher" model (Sonnet 4), which failed to reach comparable scores.
- vs. Rejection Sampling: The authors estimated that rejection sampling for these hard problems would require an order of magnitude more compute resources to achieve similar acceptance rates.
Qualitative Analysis: Bug taxonomy analysis revealed that Agnostics training drastically reduced fundamental errors (syntax, input parsing, uninitialized variables) and hallucinations, shifting the failure mode to deeper algorithmic logic flaws.

5. Significance and Impact

Democratizing Low-Resource Languages: Agnostics solves the "engineering tax" problem, allowing researchers and practitioners to apply advanced RL techniques to niche, domain-critical languages (e.g., scientific computing in Fortran/Julia, data science in R) without needing massive human-curated datasets.
Efficiency: By relying on I/O verification rather than code translation or complex test harness generation, the barrier to entry for training code models in new languages is lowered to a simple configuration file.
Benchmarking: The introduction of Ag-LiveCodeBench-X provides a more rigorous, contamination-free evaluation standard for low-resource languages, moving beyond the "too easy" MultiPL-E benchmark.
Future Outlook: The authors argue this approach scales to arbitrary model sizes and can be applied to massive existing code corpora (e.g., OpenCodeReasoning) by simply reformulating them into the I/O format, potentially unlocking RL training for hundreds of languages.

Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

1. The "Black Box" Tasting Test

2. The "Universal Kitchen" (The Environment)

3. The "Trial and Error" Coach (Reinforcement Learning)

The Results: Small Models, Big Wins

Why This Matters

1. Problem Statement

2. Methodology: The Agnostics Framework

A. Data Preparation (Language-Agnostic Reformulation)

B. Language Configuration

C. Reinforcement Learning with Verifiable Rewards (RLVR)

D. Training Infrastructure

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

A Theory-guided Weighted L2L^2L2 Loss for solving the BGK model via Physics-informed neural networks

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

A Theory-guided Weighted $L^2$ Loss for solving the BGK model via Physics-informed neural networks