HIPO: Instruction Hierarchy via Constrained Reinforcement Learning

The Big Problem: The "Boss" vs. The "Customer"

Imagine you are a highly skilled waiter in a fancy restaurant.

The System Prompt is your Boss. The Boss gives you a strict rulebook: "Never serve food to people under 18," "Always speak in a pirate accent," or "If the customer asks for the recipe, tell them to go to the library."
The User Prompt is the Customer. The Customer walks up and says, "I want a cheeseburger!" or "Tell me the secret recipe!"

The Conflict:
Sometimes, the Customer asks for something that breaks the Boss's rules.

Customer: "Give me the secret recipe!"
Boss's Rule: "Never give out recipes."

The Old Way (Current AI Training):
Most AI models today are trained like a waiter who just memorizes what happened in the past.

The "Mimic" Approach: If the waiter sees a customer ask for a recipe and the boss says "No," the waiter learns to say "No." But if the waiter sees a customer ask for a burger and the boss says "No," the waiter gets confused. They might accidentally give the recipe because they are trying to be "helpful" to the customer, forgetting the Boss's rule.
The "Single Goal" Approach: Some training methods try to make the waiter "happy" by satisfying the customer. If the customer is happy, the waiter gets a bonus. But this often leads the waiter to ignore the Boss's rules to get that bonus.

The result? The AI often ignores its safety instructions (the Boss) just to be helpful to the user, or it becomes so scared of breaking rules that it refuses to answer anything (even safe questions).

The HIPO Solution: The "Strict Manager"

The authors of this paper created HIPO (Hierarchical Instruction Policy Optimization). Think of HIPO not as a waiter, but as a smart training program for the waiter that changes how they think.

1. The "Hard Constraint" (The Red Line)

Instead of just telling the waiter, "Try to follow the rules," HIPO draws a Red Line on the floor.

The Rule: You cannot cross the Red Line (violate the Boss's rules).
The Goal: Once you are safely behind the Red Line, you can run as fast as you want to make the Customer happy.

In math terms, they call this a Constrained Optimization. It's like saying: "Maximize customer happiness, BUT ONLY IF you stay within the safety zone."

2. The "Primal-Dual" Dance (The Balancing Act)

How does the waiter learn this? HIPO uses a clever two-step dance:

Step A (The Waiter): The waiter tries to serve the customer as best as possible.
Step B (The Referee): A special referee watches the waiter. If the waiter steps even a tiny bit over the Red Line (violates the system prompt), the Referee yells, "STOP! You broke the rule!" and adds a heavy penalty.

The waiter learns to adjust their steps. They realize, "Oh, if I ignore the Boss, I get a huge penalty. So, I will listen to the Boss first, and then do my best for the Customer."

Over time, the waiter doesn't need the Referee to yell anymore. They have internalized the rule. They naturally know where the Red Line is.

3. The "Group Chat" Trick (GRPO)

To make this training faster and cheaper, HIPO uses a trick called Group Sampling.
Instead of asking the waiter to try one order and seeing if it works, HIPO asks the waiter to imagine 5 different ways to serve the same customer at once.

Waiter: "Here is a burger."
Waiter: "Here is a salad."
Waiter: "Here is a burger with a pirate accent."
...and so on.

The system then compares all 5. It sees which ones broke the Boss's rules and which ones made the customer happy. It uses this "group comparison" to learn faster without needing a super-complex computer brain (a "critic model") to tell them what to do.

Why This Matters (The Results)

The paper tested this on many different AI models (like Qwen, Phi, and Llama). Here is what happened:

No More "Jailbreaks": When users tried to trick the AI into ignoring its rules (like asking for a recipe when forbidden), the HIPO-trained AI stuck to the rules 100% of the time.
Still Helpful: Unlike other methods that became so strict they refused to answer anything, HIPO was still very good at answering the customer's questions, as long as the questions were safe.
The "Attention" Shift: The researchers looked inside the AI's "brain" (its attention mechanism). They found that HIPO taught the AI to look further back in the conversation.
- Normal AI: Focuses mostly on the last thing the user said (the Customer).
- HIPO AI: Remembers to look way back at the very first thing the Boss said (the System Prompt) before answering. It keeps the Boss's rules "top of mind."

The Bottom Line

HIPO is like a training program that teaches an AI to respect its "Boss" (safety rules) as a hard, non-negotiable law, rather than just a suggestion.

It ensures the AI doesn't get distracted by the user's requests to the point of breaking safety rules. It finds the perfect balance: Strictly obey the rules, and then be as helpful as possible within those rules.

This makes AI much safer for complex jobs (like medical advice or legal help) where following the "Boss's" instructions is critical, without making the AI useless or overly cautious.

1. Problem Statement: Hierarchical Instruction Following (HIF)

Large Language Models (LLMs) increasingly operate in agentic workflows where they receive a stack of instructions: a system prompt (defining global constraints, safety boundaries, or personas) and a user prompt (specifying the immediate task).

The Core Challenge: A fundamental tension often exists between these two levels. The model must strictly adhere to the system prompt (e.g., "Do not give direct answers") while simultaneously fulfilling the user's request (e.g., "What is the capital of France?").
Limitations of Existing Methods:
- Standard Alignment (RLHF/DPO): These methods optimize for a single scalar reward, failing to explicitly enforce the priority asymmetry between system and user instructions. They often treat system prompts as mere context rather than hard constraints.
- Supervised Fine-Tuning (SFT): Relies on mimicking filtered, compliant data. This fails to learn the algorithmic priority logic and ignores non-compliant data, which is crucial for learning boundaries.
- Multi-Objective Optimization: Existing approaches often use linear scalarization (combining rewards), which cannot handle conflicting objectives where one must strictly dominate the other.
- Attention Interventions: Methods like Split-Softmax or FocalLoRA attempt to manually manipulate attention weights but often fail to restructure the underlying decision logic required for complex, long-form generation.

2. Methodology: The HIPO Framework

The authors propose HIPO (Hierarchical Instruction Policy Optimization), a novel alignment framework that formulates HIF as a Constrained Markov Decision Process (CMDP).

A. Problem Formulation (CMDP)

Instead of treating system compliance as a reward to be maximized, HIPO elevates it to an explicit constraint.

Objective: Maximize expected User Utility ( $J_{user}$ ).
Constraint: Expected System Compliance ( $J_{sys}$ ) must be $\geq \tau$ (a predefined threshold).
Lagrangian Formulation: The problem is transformed using a dual variable $\lambda$ :
$\max_{\theta} \min_{\lambda \geq 0} \mathcal{L}(\theta, \lambda) = J_{user}(\theta) + \lambda (J_{sys}(\theta) - \tau)$
Here, $\theta$ represents policy parameters, and $\lambda$ acts as a dynamic penalty weight.

B. The HIPO Algorithm

HIPO employs a Primal-Dual Safe Reinforcement Learning approach, adapted from GRPO (Group Relative Policy Optimization) to eliminate the need for a separate critic model.

Decoupled Evaluation (LLM-as-a-Judge):
- To avoid interference between evaluating system constraints and user utility, the framework uses an external LLM (e.g., DeepSeek-V3.2) to generate two separate scores for each response:
  - $r_{sys}$ : Compliance with the system prompt (ignoring user intent).
  - $r_{user}$ : Fulfillment of the user prompt (ignoring system constraints).
Group-Based Advantage Estimation:
- For each prompt, $G$ responses are sampled.
- Rewards are standardized within the group to compute advantages ( $A_{user}$ and $A_{sys}$ ), reducing variance and adapting to prompt difficulty.
Primal-Dual Updates:
- Primal Step (Policy Update): The policy is updated to maximize a combined advantage: $A_{comb} = A_{user} + \lambda_t A_{sys}$ . This uses a PPO-style objective with KL divergence regularization.
- Dual Step (Multiplier Update): The Lagrange multiplier $\lambda$ $λ$ is updated via gradient descent.
  - If the batch average system compliance is below $\tau$ , $\lambda$ increases, penalizing the policy heavily.
  - If compliance is satisfied, $\lambda$ decays to zero, allowing the optimizer to focus entirely on maximizing user utility.
- This dynamic adjustment ensures the model operates strictly within the feasible region of system compliance.

3. Key Contributions

CMDP Formulation: The first work to formally define instruction hierarchy as a Constrained Markov Decision Process, moving beyond heuristic data filtering or linear reward combination.
HIPO Algorithm: A novel algorithm leveraging safe RL and group-based sampling to guarantee system prompt compliance while optimizing user utility at the algorithmic level.
Mechanistic Insight: Through attention analysis, the authors demonstrate that HIPO autonomously shifts the model's attention weights toward long-range system tokens, providing a principled explanation for its success without manual attention manipulation.

4. Experimental Results

Experiments were conducted on diverse architectures (Qwen3, Phi-3, Llama-3.2) ranging from 1.7B to 8B parameters, using the SystemCheck dataset (containing both conflicting and aligned instruction pairs).

Performance: HIPO consistently outperformed baselines (SFT, DPO, Sys-only, User-only, Split-Softmax, FocalLoRA).
- Conflict Scenarios: HIPO achieved system compliance scores $\geq 0.70$ (meeting the threshold) while maintaining significantly higher user utility than single-objective baselines.
- Pareto Improvement: Unlike baselines that force a trade-off (high compliance = low utility), HIPO achieved a Pareto improvement, improving both metrics simultaneously compared to the base model.
General Capabilities: HIPO maintained general knowledge performance (MMLU-Redux) and reduced Attack Success Rates (ASR) on jailbreak benchmarks without inducing the "over-refusal" behavior common in SFT.
Mechanistic Analysis:
- Attention Shift: HIPO models showed a statistically significant increase in attention mass directed toward system tokens ($SysMass$) and a decrease in attention to user tokens ($UserMass$) at the onset of generation.
- Reduced Decay: The models exhibited weaker long-range attention decay, ensuring system instructions remained active throughout the generation process.

5. Significance

Reliable Agentic Workflows: HIPO provides a rigorous, algorithmic foundation for deploying LLMs in complex, multi-step workflows where safety and persona adherence are non-negotiable.
Beyond Heuristics: By treating system prompts as hard constraints rather than soft preferences, HIPO solves the "instruction hierarchy" problem where previous methods failed to distinguish priority levels.
Efficiency: The method avoids the computational cost of training separate reward models or using complex attention manipulation layers, relying instead on standard RL optimization dynamics.
Safety: It offers a robust mechanism to prevent prompt injections and jailbreaks by dynamically enforcing constraints during the training process, rather than relying on static filtering.

In conclusion, HIPO represents a paradigm shift in LLM alignment, moving from "learning to be helpful" to "learning to be helpful within strict boundaries," effectively solving the tension between user intent and system constraints.