Training with Pseudo-Code for Instruction Following

Imagine you are trying to give a very specific, complicated set of instructions to a brilliant but slightly literal-minded assistant. You say, "Write me a story about a cat, but make sure it's exactly 300 words, use no commas, and highlight three sections."

Sometimes, the assistant gets confused. They might write a great story but forget the "no commas" rule, or they might highlight the wrong parts. They understand the spirit of your request, but they struggle with the structure of it.

This paper is about teaching Large Language Models (the "assistants") a new way to think before they speak. Instead of just listening to your natural language request and immediately trying to answer, the paper teaches them to first translate your request into Pseudo-Code.

Here is the breakdown of their idea using some simple analogies:

1. The Problem: The "Overwhelmed Chef"

Think of a Large Language Model (LLM) like a world-class chef who has tasted every dish in history. However, if you give them a complex order like, "Make a lasagna, but layer it with chocolate instead of cheese, cut it into triangles, and serve it on a Tuesday," they might get confused. They know how to make lasagna, and they know what chocolate is, but combining all those specific constraints at once is hard. They might forget the "triangles" part or the "Tuesday" part.

2. The Old Solution: "Inference-Time Prompting" (The Sticky Note)

Previously, researchers tried to fix this by giving the chef a "cheat sheet" (few-shot prompting) every time they ordered. They would say, "Hey chef, remember: when I ask for a weird dish, first write down a recipe in code before you cook."

The Flaw: This is tedious. You have to write that cheat sheet every single time. Also, if the chef forgets to look at the cheat sheet, they mess up. It's like trying to teach a dog to sit by holding a treat in front of its nose every time—it works in the moment, but the dog doesn't actually learn the behavior.

3. The New Solution: "Training-Time Pseudo-Code" (The Internal Monologue)

The authors of this paper decided to change the training, not the prompting. They taught the chefs (the models) a new habit during their "cooking school" (training phase).

They said: "From now on, whenever you get an order, you must first write a recipe in a structured, code-like language before you start cooking."

The Analogy: Imagine the chef is forced to write a step-by-step flowchart on a whiteboard before touching a knife.
- Natural Language Request: "Write a story about a cat."
- The Chef's Internal Pseudo-Code:
```
1. Define topic: "Cat"
2. Constraint: Word count = 300
3. Constraint: No commas allowed
4. Constraint: Highlight 3 sections
5. Execute: Write story
```
- The Result: Because the chef had to write the constraints down in a rigid, logical format first, they are much less likely to forget them when they actually write the story.

4. How They Did It (The "Repair Shop")

The researchers didn't just ask the models to guess the code. They built a pipeline:

Generate: They used a super-smart model to turn human instructions into this "pseudo-code recipe."
Evaluate: They checked if the recipe actually worked. Did following the recipe produce the right answer?
Repair: If the recipe was buggy (like a cooking instruction that said "add salt" but forgot to say when), they fixed it. They did this automatically, creating a massive library of "Instruction + Pseudo-Code Recipe + Final Answer" pairs.

Then, they trained six different models on this new library.

5. The Results: "The Organized Thinker"

When they tested these new models, the results were impressive:

Better at Following Rules: The models became much better at following complex, multi-part instructions (like the "no commas" or "highlight 3 sections" rules). They improved by 8% to 21% on instruction-following tests.
Didn't Lose Smarts: A common fear is that teaching a model to think in code might make it worse at other things, like math or common sense. But the paper found the opposite: the models stayed just as good at math and reasoning, and in some cases, got even better.
No Extra Work for You: The best part? When you talk to these new models, you don't have to write any code. You just ask your question in normal English. The model internally translates it to pseudo-code, solves it, and gives you the answer. It's a "drop-in replacement."

The Big Picture

Think of this like teaching a student to outline before writing an essay.

Before: The student just started writing immediately. They often forgot the prompt's requirements.
After: The student is trained to stop, write a structured outline (the pseudo-code), check their constraints, and then write the essay.

The paper proves that forcing a model to "think in code" (even if it's just a simplified, human-readable pseudo-code) acts as a powerful organizer for its brain, helping it handle complex, tricky instructions much more reliably than before.

Here is a detailed technical summary of the paper "Training with Pseudo-Code for Instruction Following".

1. Problem Statement

Despite rapid advancements in Large Language Models (LLMs), they continue to struggle with instruction following, particularly when instructions involve:

Compositional structure: Complex instructions requiring multiple steps, logical conjunctions, or nested conditions.
Strict constraints: Adhering to specific output formats, length limits, or avoiding distractions.
Ambiguity: Natural language instructions can be vague, leading to misinterpretation of task structure.

While previous research suggested that expressing instructions in pseudo-code improves model performance, existing approaches rely on inference-time prompting (e.g., few-shot examples or Chain-of-Code). These methods have significant limitations:

They are tedious and unintuitive for non-expert users.
They often require external code execution or emulation.
They are brittle and rely on specific prompt engineering at test time.

The authors propose a training-time solution to embed the ability to interpret and re-express instructions in pseudo-code directly into the model's weights, eliminating the need for complex inference-time scaffolding.

2. Methodology

A. Core Approach

The authors introduce a fine-tuning strategy where models are trained to re-express natural language (NL) instructions into pseudo-code before generating the final answer.

Input: Natural Language Instruction.
Intermediate Step: The model generates a structured pseudo-code representation of the task (acting as a "thinking" step).
Output: The final answer to the user's request.
Key Distinction: Unlike Chain-of-Thought (CoT) which focuses on reasoning steps, this method focuses on structural decomposition of the instruction itself. The pseudo-code does not contain the solution or reasoning steps, only the task definition and constraints.

B. Data Construction Pipeline

Since manually creating pseudo-code for thousands of instruction-tuning samples is prohibitively expensive, the authors developed an automated Generate-Evaluate-Repair pipeline:

Generate: A powerful teacher model (Mixtral-8x7B-Instruct-v0.1) uses 1-shot prompting to convert NL instructions into pseudo-code.
Evaluate: The generated pseudo-code is executed (via inference) to produce an output. This output is compared against the ground truth (derived from the original NL instruction).
Repair: If the pseudo-code execution fails to match the ground truth (but the original NL instruction would have succeeded), an LLM-judge is prompted to provide feedback and repair the pseudo-code.
- Note: For datasets with open-ended outputs (e.g., GPT4-Alpaca), the Repair step is skipped to prevent "solution leakage" into the pseudo-code, ensuring the model learns instruction structure rather than memorizing answers.

C. Training Mixture

The authors fine-tuned six base models (ranging from 7B to 8B parameters, including Llama 3.1, Mistral, Qwen2.5, Granite, and Olmo) using a mixture of datasets from Tulu v2, augmented with the generated pseudo-code pairs. The training data includes:

Code Alpaca, GPT4-Alpaca, WizardLM Evol-Instruct.
Super-Natural Instructions (SNI v2).
CoT (Chain-of-Thought) subsets and Science subsets.

3. Key Contributions

Training-Time Pseudo-Code Integration: A novel method to fine-tune LLMs to internally translate NL instructions into structured pseudo-code, improving execution of complex, constrained, and nested instructions without inference-time prompting.
Scalable Data Pipeline: An automated Generate-Evaluate-Repair loop that creates high-quality pseudo-code supervision data at scale, reducing the reliance on manual curation.
Comprehensive Empirical Evaluation: Extensive experiments across six base models and twelve benchmarks, demonstrating that pseudo-code training yields significant gains in instruction following while preserving or improving performance on reasoning and code tasks.

4. Experimental Results

The models were evaluated on four categories: Instruction Following, Common Sense/Reasoning, Mathematics, and Code Tasks.

A. Instruction Following (Primary Gain)

Performance: Pseudo-code (PC) trained models achieved relative gains of 8–21% on instruction-following benchmarks (IFEval, KCIF, ComplexBench, FollowBench) compared to Natural Language (NL) trained baselines.
Complex Instructions: PC models showed superior performance on:
- Nested/Chained Instructions: Better handling of "And," "Chain," and "Selection" logic in ComplexBench.
- Distractors: Improved ability to ignore irrelevant instructions in KCIF.
- Fine-Grained Constraints: Significant improvements in FollowBench, especially at higher levels of constraint complexity (Levels 4 and 5).
- Formatting: Notable gains in JSON, Markdown, and bullet formatting (11–27% relative improvement).

B. Reasoning and Mathematics

Preservation: Performance on mathematical (GSM8K, GSM8K Platinum) and commonsense reasoning (ARC, HellaSwag, PiQA, Winogrande) tasks was largely preserved.
Improvement: In several cases (e.g., Granite 3.1 and Mistral 7B), PC training actually improved performance on these tasks, suggesting that the structural clarity of pseudo-code aids general reasoning.

C. Code Generation

Despite the training data focusing on instruction structure rather than solution code, PC-trained models showed improved performance on HumanEvalPack benchmarks (Code Repair, Synthesis, Explanation).
Granite 8B Code showed the largest average gain (0.301 NL $\to$ 0.365 PC), particularly in Rust code synthesis and documentation-based repair.

D. Comparison with Inference-Time Methods

The PC-trained models outperformed inference-time code prompting baselines (Code-Prompted/CP), which rely on few-shot examples during generation. This confirms that internalizing the pseudo-code capability via fine-tuning is more robust than external prompting.

5. Significance and Conclusion

Significance:

Seamless Integration: The resulting models interact with users via standard natural language, making them a "drop-in replacement" for existing instruction-tuned models. Users do not need to learn to write pseudo-code.
Structural Clarity: The results validate the hypothesis that translating ambiguous natural language into a structured, explicit representation (pseudo-code) significantly reduces task ambiguity, leading to better adherence to constraints and complex logic.
Efficiency: By moving the "translation" step to training time, the approach avoids the computational overhead and latency of inference-time code execution or complex prompting strategies.

Future Work:
The authors suggest exploring hybrid approaches (combining NL and PC), applying this to larger models, and extending pseudo-code guidance to pre-training. They also note limitations regarding real-world applications like tool calling and multi-turn dialogues, which remain unexplored.

In summary, this paper demonstrates that teaching LLMs to "think" in pseudo-code during training is a highly effective strategy for enhancing instruction following, offering a robust, scalable, and user-friendly alternative to complex inference-time prompting techniques.