CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

The Big Problem: The "Cheat Sheet" vs. The "Real Understanding" Gap

Imagine you are training a student (an AI) to solve math problems. Currently, these students are incredibly good at pattern matching.

If you show them a problem that looks like a "pizza slice" problem, they instantly recall the "pizza formula" they memorized and apply it. They get the right answer, but they don't actually understand why the formula works or what a "pizza" (or a mathematical concept) really is.

The researchers found a funny flaw in these students:

The Test: Ask the student to define "Linear Independence" (a math concept). They recite the textbook definition perfectly.
The Trap: Give them a problem that requires using that concept, but change the wording slightly so the "pizza formula" doesn't fit.
The Result: The student fails. They can't connect the definition they just recited to the actual problem. They are stuck using "cheat codes" (surface patterns) instead of genuine understanding.

This is called the Definition–Application Gap. The AI knows the words, but it doesn't know how to use them.

The Solution: CORE (Concept-Oriented Reinforcement)

The authors created a new training method called CORE. Think of CORE not as teaching the student more facts, but as forcing them to stop and think about the tools they are using before they start building.

Here is how CORE works, broken down into three simple steps:

1. The "Toolbox" (Data Curation)

Instead of just giving the AI thousands of random math problems, the researchers went to a classic, high-quality math textbook. They created a special "Toolbox" where every problem is explicitly linked to the specific concept (the tool) needed to solve it.

Analogy: Instead of just throwing the student into a kitchen and saying "Make dinner," they give them a recipe card that says: "This dish requires the Knife (Concept A) and the Pan (Concept B)."

2. The "Intervention" (The Training Magic)

This is the core of the paper. When the AI tries to solve a problem and gets it wrong, CORE doesn't just say "Wrong, try again." It intervenes in three clever ways:

CORE-Base (The Direct Lesson): The AI is trained directly on these "Toolbox" problems. It learns to associate the problem type with the specific concept needed.
CORE-CR (The "Hint" Intervention): Imagine the AI is stuck. CORE says, "Okay, you failed. Here is the specific concept you needed (e.g., 'Remember the Rational Root Theorem'). Now, try solving it again using that hint."
- If the AI gets it right with the hint, CORE replaces the "failed attempt" with the "successful hint-based attempt" in its memory. It teaches the AI: "When you see this, grab this tool first."
CORE-KL (The "Ghost" Guidance): This is a bit more subtle. The AI tries to solve the problem on its own. Simultaneously, a "ghost" version of the AI (one that has the concept hint) solves it perfectly. CORE forces the real AI to mimic the thought process of the ghost, even though the real AI didn't have the hint. It's like a dance instructor guiding your hands so you learn the rhythm, even if you can't see the music sheet yet.

3. The "No Cheating" Rule (Evaluation)

The most important part of the test is that during the final exam, the AI is NOT allowed to see the concept hints.

If the AI gets the answer right, it proves it has truly internalized the concept. It's no longer relying on the cheat sheet; it has learned the skill.

Why This Matters (The Results)

The researchers tested this on several different AI models (like Qwen, Llama, and DeepSeek). Here is what happened:

Before CORE: The AI was like a parrot. It could repeat definitions and solve standard problems, but if you changed the wording slightly, it got confused.
After CORE: The AI became more like a mechanic. It didn't just memorize how to fix a specific car model; it understood how engines work.
- It solved harder problems it had never seen before.
- It was less likely to get tricked by "distractors" (fake clues in the question).
- It improved its ability to pick the right "tool" for the job.

The Takeaway

Think of current AI math skills as rote memorization. You can memorize the steps to solve a specific puzzle, but if the puzzle changes shape, you are lost.

CORE changes the training so the AI learns principles. It forces the AI to pause, identify the right mathematical concept (the "tool"), and apply it deliberately. It bridges the gap between "I know the definition" and "I know how to use it," turning a pattern-matching machine into a genuine reasoning engine.

And the best part? They didn't need to rebuild the AI's brain (architecture). They just changed how they taught it.

1. Problem Statement

Large Language Models (LLMs) have demonstrated impressive capabilities in solving competition-level math problems. However, they often fail at genuine conceptual reasoning. Instead of identifying and applying the correct mathematical concepts (e.g., linear independence, continuity), models tend to rely on:

Surface Pattern Matching: Exploiting formatting, keywords, or recurring step patterns found in training data.
Procedural Mimicry: Chaining routine algebraic steps without understanding the underlying logic.

This creates a "Definition–Application Gap": Models can often recite definitions correctly (parametric knowledge) but fail to apply them when solving problems, especially under perturbations or in novel contexts. Existing Reinforcement Learning with Verifiable Rewards (RLVR) pipelines (like GRPO) optimize for final answer correctness but provide coarse signals that do not guide which concept to use or how to integrate it into the reasoning process.

2. Methodology: The CORE Framework

The authors propose CORE (Concept-Oriented REinforcement), an RL training framework designed to inject explicit mathematical concepts into the training signal. The framework operates in three stages:

A. Dataset Curation & Gap Diagnostics

Source: The authors curated a high-quality, low-contamination dataset from the textbook Advanced Algebra (3rd Edition). This source links 236 concept definitions to 1,403 examples and 140 multiple-choice exercises.
Synthetic Probes: Using an LLM (Qwen2.5-72B), they generated 1,110 high-quality "Concept Probes" (quizzes) tied to specific definitions.
Diagnostic Findings: Experiments revealed that while models (e.g., GPT-4o) could recite definitions, they failed to apply them correctly when problem structures were slightly altered (e.g., swapping numerator/denominator in the Rational Root Theorem). Robust evaluation (permuting options) showed a massive performance drop, confirming reliance on superficial heuristics.

B. The Three Training Recipes

CORE integrates with standard policy-gradient algorithms (demonstrated using GRPO) via three distinct strategies:

CORE-Base (Direct Supervision):
- Trains the policy directly on the curated concept-aligned quizzes using standard GRPO.
- Goal: Implicitly learn concepts from the question-answer pairs.
CORE-CR (Concept-Guided Trajectory Replacement):
- Mechanism: During a rollout, if all generated trajectories in a group fail, the system intervenes.
- Intervention: It constructs a new prompt by concatenating the original question with the ground-truth concept definition ( $p_c = c_q \oplus q$ ).
- Action: It generates new, concept-primed trajectories and replaces the failed original trajectories in the batch.
- Reward: These new trajectories receive an augmented reward ( $R' = R + r_{bonus}$ ) to incentivize learning from the concept-guided path.
CORE-KL (Concept-Guided KL-Regularization):
- Mechanism: Also triggered by group failures. Instead of replacing trajectories, it adds a regularization term to the loss function.
- Objective: Minimizes the forward KL-divergence between the model's standard policy ( $\pi_\theta(\cdot|q)$ ) and a concept-primed policy ( $\pi_\theta(\cdot|p_c)$ ).
- Goal: Forces the model's internal reasoning process on the raw problem to align with the reasoning process it would use if explicitly given the concept, without requiring the concept text at inference time.

3. Key Contributions

Bridging the Gap: The first framework to explicitly turn mathematical concepts into controllable, fine-grained supervision signals within an RLVR pipeline.
Algorithm Agnostic: CORE wraps around standard RL algorithms (GRPO, PPO) without requiring architectural changes to the model.
Self-Supervised Capability: Demonstrated that CORE works even when the "teacher" (generator) and "student" (learner) are the same model size, proving the gains come from the intervention logic rather than knowledge distillation from a superior model.
Robustness: Introduced a "Robust Evaluation" protocol (permuting options) to rigorously test conceptual understanding vs. heuristic reliance.

4. Experimental Results

The framework was evaluated on Qwen2-Math-7B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen2.5-Math-1.5B, and Llama-3-8B-Instruct.

In-Domain Performance:
- On the Textbook test set, CORE variants improved accuracy by up to 9.3% over vanilla baselines.
- On TheoremQA, gains reached 9.6%.
Out-of-Domain Generalization:
- Consistent improvements were observed across diverse benchmarks (GSM8K, MATH, MMLU-STEM, SVAMP, TabMWP).
- Llama-3-8B-Instruct: CORE-CR improved TabMWP by 3.3% and MMLU-STEM by 1.4%.
- DeepSeek-R1-DQ-1.5B: Showed stability gains on MMLU-STEM (+1.3%) and SVAMP (+1.2%).
Ablation Studies:
- Not just GRPO: Random-reward GRPO and increased rollout counts (Top-4 of 6) did not replicate CORE's gains, confirming the improvement stems from the concept injection.
- Not just Distillation: Self-supervised experiments (using the same model to generate and learn) showed significant gains, proving the framework extracts intrinsic reasoning signals.
- Mechanism Shift: Analysis of the "Diagnostic Subset" showed that 52.6% of new correct answers were due to explicit concept selection rather than heuristic guessing.
- Robustness: CORE-trained models showed significantly smaller accuracy drops when irrelevant concepts were prepended as distractors compared to vanilla models.

5. Significance

From Pattern Matching to Reasoning: CORE successfully shifts LLM behavior from memorizing solution templates to grounding reasoning in mathematical definitions.
Generalizability: The framework is applicable to both base and instruction-tuned models and does not rely on external "expert" teachers, making it scalable.
Future Direction: The work suggests that future RL training for reasoning tasks should move beyond outcome-based rewards (correct/incorrect) toward process-oriented, concept-aligned supervision to achieve genuine intelligence in specialized domains.

In summary, CORE provides a lightweight, effective method to bridge the gap between knowing a definition and applying it, significantly enhancing the mathematical reasoning capabilities of LLMs across various benchmarks.