Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning

Imagine you are trying to teach a brilliant but very narrow-minded student how to solve problems.

The Old Way (Math-Only Training):
Previously, researchers taught these AI students almost exclusively Math. Math is great for training because it has clear right and wrong answers. If the student gets $2+2=4$ , they get a gold star. If they get $5$, they get a red X. It's easy to grade.

But here's the problem: If you only teach a student math, they become a math genius who struggles to write a poem, understand a legal contract, or figure out why a friend is upset. They lack "general common sense" because they've never practiced those types of thinking.

The New Solution: NEMOTRON-CROSSTHINK
The researchers at NVIDIA and CMU came up with a new training camp called NEMOTRON-CROSSTHINK. Instead of just doing math drills, they threw the student into a "multiverse" of different challenges.

Here is how they did it, using simple analogies:

1. The "Gym" with Mixed Equipment

Imagine a gym.

The Old Gym: Had only weightlifting machines. You got strong, but only in your arms.
The NEMOTRON Gym: Has weightlifting, but also yoga, swimming, rock climbing, and chess.
The AI is trained on Math (weightlifting) plus General Reasoning (yoga, law, science, history). By mixing these, the AI learns to be strong in everything, not just numbers.

2. The "Answer Sheet" Problem (Templates)

In the real world, some questions are multiple-choice (like a quiz), and some are open-ended (like an essay).

The Issue: If you ask an AI an open-ended question, it might ramble. If you ask a multiple-choice question, it might just guess. This makes it hard for the teacher (the computer) to know if the AI is actually thinking or just lucky.
The Fix: The researchers put strict templates on the answers.
- Analogy: Imagine telling the student, "You must write your answer in a specific box, and you can only use 10 words."
- This forces the AI to be concise and precise. It stops the AI from "hallucinating" (making things up) or guessing randomly. It turns a messy essay into a clean, verifiable answer that the computer can easily grade.

3. The "Hard Mode" Filter

Not all practice questions are created equal. Some are too easy (a 5-year-old could answer them), and some are too hard.

The Strategy: The researchers used a "filter." They asked a smaller, weaker AI to try the questions first.
- If the weak AI got it right? Discard it. It's too easy; it won't help the big AI learn.
- If the weak AI got it wrong? Keep it. This is the "Goldilocks" zone—challenging enough to force the big AI to stretch its brain.
Analogy: It's like a coach telling a pro athlete, "Don't practice lifting 5 lbs; that's easy. Lift 200 lbs. That's where you get stronger."

4. The Result: The "Smart & Efficient" Thinker

When they tested this new AI, the results were amazing:

Smarter: It got much better at math (up 30%!) and also got much better at non-math stuff like law, science, and general knowledge (up 12-15%).
Faster & Cheaper: This is the coolest part. The AI learned to think more efficiently.
- Analogy: Imagine two people solving a puzzle. One talks out loud for 10 minutes, trying every wrong piece. The other looks at the puzzle, thinks for a second, and places the right piece.
- The NEMOTRON AI did the latter. It used 28% fewer words (tokens) to get the right answer. It didn't waste time rambling; it went straight to the point.

Why This Matters

Before this, AI researchers were stuck in a loop: "We can only train AI on Math because it's the only thing we can grade easily."

NEMOTRON-CROSSTHINK broke that loop. It showed that if you organize the data correctly (using templates and filters), you can teach AI to be a generalist—a thinker that is just as good at writing a legal brief as it is at solving a calculus problem, all while using less computer power.

In a nutshell: They took a math genius, forced it to study law, history, and science, taught it to answer concisely, and filtered out the easy stuff. The result? A super-smart, super-efficient AI that can handle almost any problem you throw at it.

1. Problem Statement

While Reinforcement Learning (RL) has successfully enhanced mathematical reasoning in Large Language Models (LLMs) due to well-defined rules and verifiable rewards, extending these methods to general-purpose reasoning (GPR) domains (e.g., law, social sciences, humanities) remains challenging.

Key Challenges:
- Lack of Verifiable Rewards: Unstructured domains often lack clear "correct" answers, making rule-based reward modeling difficult.
- Data Scarcity & Blending: Existing RL approaches often rely heavily on math data. There is a lack of systematic strategies for blending multi-domain data (math vs. non-math) and question formats (Multiple Choice vs. Open-Ended) to maximize generalization.
- Inefficiency: Models trained on specific domains often fail to adapt their reasoning depth (verbosity) to the task, leading to inefficient token usage.

2. Methodology: NEMOTRON-CROSSTHINK

The authors propose NEMOTRON-CROSSTHINK, a framework designed to scale self-learning across diverse reasoning tasks by integrating multi-domain corpora into RL training. The pipeline consists of four core stages:

A. Data Curation & Synthesis

The framework aggregates data from two primary sources:

Open-Source Datasets: Including MMLU, Natural Reasoning, MATH, and Numina-Math.
Synthetic Data Generation:
- NEMOTRON-CROSSTHINK-QA: Synthesized from CommonCrawl and textbooks using a "Persona-Driven" approach. It generates diverse MCQs and Open-Ended questions across STEM, humanities, law, and social sciences.
- NEMOTRON-CROSSTHINK-MATH: Generated using persona-based prompts to create novel math problems.
- Decontamination: Rigorous filtering ensures synthetic data does not overlap with evaluation benchmarks (e.g., GPQA, MMLU-PRO).

B. Template Application & Answer Space Control

To enable verifiable rewards in non-deterministic domains, the framework applies structured templates:

Format Unification: Converts data into Multiple Choice Questions (MCQ) and Open-Ended formats.
Constraint: For Open-Ended questions, answers are constrained to short lengths (e.g., $\le$ 10 words) to facilitate rule-based verification.
Filtering: Removes samples where the correct answer is not present in the options (for MCQs) or where the answer is unverifiable.

C. Data Blending Strategies

The authors investigate optimal data mixing ratios across three paradigms:

Data Source: Varying ratios of General Purpose Reasoning (GPR) vs. Mathematical Reasoning (MR) data.
Question Type: Balancing MCQ vs. Open-Ended formats.
Data Usefulness: Weighting datasets based on their performance in downstream tasks.

Key Finding: A 2:1 ratio of GPR to Math data ( $B_{gpr\uparrow}$ ) yielded the best overall performance, outperforming math-only blends.

D. Reinforcement Learning (RL)

Algorithm: Uses Group Relative Policy Optimization (GRPO).
Reward Function: A rule-based reward $R = R_{acc} \land R_{format}$ $R = R_{a cc} \land R_{f or ma t}$ .
- $R_{acc}$ : Checks for exact string match with the ground truth.
- $R_{format}$ : Ensures the output follows specific tags (e.g., <thought>, \boxed{}).
Difficulty Filtering: A model-driven approach filters out "easy" questions (those solvable by a smaller 7B model in zero-shot), retaining only "hard" samples for training larger models (32B).

3. Key Contributions

Framework for Cross-Domain RL: Introduces the first systematic framework to incorporate multi-domain, multi-format data into RL, demonstrating that non-math data significantly boosts generalization.
Template-Driven Verifiability: Proves that applying structured templates (MCQ/Open-Ended) and short-form answer constraints allows for stable, rule-based reward modeling in unstructured domains.
Efficiency via Dynamic Reasoning: Demonstrates that multi-domain training enables models to dynamically adjust response length (concise for GPR, detailed for Math), reducing inference costs.
Difficulty-Based Filtering: Proposes a simple, effective method to filter training data by removing examples solvable by smaller models, leading to significant accuracy gains on harder tasks.
Open Release: Releases 287.4K high-quality, multi-domain data samples curated for verifiable reward modeling.

4. Experimental Results

The framework was evaluated on Qwen-2.5-7B and Qwen-2.5-32B models across math and general-purpose benchmarks.

Performance Gains

Math Benchmarks:
- MATH-500: +30.1% improvement.
- AMC23: +27.5% improvement.
General-Purpose Reasoning (Non-Math):
- MMLU-PRO: +12.8% improvement.
- AGIEVAL: +15.1% improvement.
- GPQA-DIAMOND: +11.3% improvement.
- SUPERGPQA: +3.8% improvement.
Comparison: The best blend ( $B_{gpr\uparrow}$ ) outperformed the math-centric baseline (Open-Reasoner-Zero) by ~5% on average and significantly outperformed math-only training on non-math tasks.

Efficiency & Token Usage

Token Reduction: The multi-domain trained model used 28% fewer tokens for correct answers compared to math-only baselines.
Dynamic Adaptation:
- For GPR tasks, the model generated concise answers (Mean ~385 tokens).
- For Math tasks, it appropriately increased verbosity (Mean ~622 tokens).
- In contrast, math-only baselines remained verbose even on GPR tasks.

Ablation Studies

Question Format: Converting all questions to Open-Ended format improved accuracy by 1.21% over mixed formats, reducing the ability to "guess" via multiple-choice options.
Answer Format: Using Short-form answers (e.g., "A" instead of "(A) Sky is blue") improved accuracy by 1.20% by reducing ambiguity in rule-based rewards.
Filtering: Filtering for "hard" samples (removing those solvable by a 7B model) provided an additional 2.15% average accuracy gain for the 32B model.

5. Significance

Beyond Math: This work breaks the dependency of RL on mathematical reasoning, proving that diverse reasoning strategies (narrative, heuristic, contextual) learned from non-math domains transfer effectively to improve overall reasoning capabilities.
Scalability: The approach is architecture-agnostic (validated on Qwen and Nemotron-H) and scales effectively to larger models (32B) through difficulty filtering.
Cost-Efficiency: By teaching models to be "concise where appropriate," the framework reduces inference costs and latency, a critical factor for real-world deployment.
Reproducibility: The release of the dataset and the clear methodology for verifiable reward modeling in non-deterministic domains provides a practical recipe for future research in general-purpose RL.

In summary, NEMOTRON-CROSSTHINK demonstrates that thoughtful data blending, format standardization, and difficulty filtering can create LLMs that are not only more accurate across diverse domains but also more efficient and adaptable in their reasoning processes.