MOSAIC: Composable Safety Alignment with Modular Control Tokens

Imagine you have a very smart, helpful robot assistant (a Large Language Model, or LLM). This robot is great at writing stories, solving math problems, and chatting. But, like any good assistant, it needs rules to keep things safe. It shouldn't teach kids how to make bombs, or help adults gamble away their life savings.

The problem with current robots is that their "safety rules" are baked right into their brain. Once the robot is built, those rules are permanent. If you want the robot to be safe for a 10-year-old but safe-for-adults for a 30-year-old, you can't just flip a switch. You have to either:

Rewire the whole brain: Retrain the robot from scratch (expensive, slow, and might make it forget how to do math).
Yell instructions: Write a long, complicated note at the top of every conversation saying, "Remember, no gambling!" (This is easy to ignore, and the note gets too long).

The paper introduces MOSAIC, a new way to handle safety that is like giving the robot a set of modular "safety remote controls."

The Core Idea: The "Safety Remote"

Instead of changing the robot's brain, MOSAIC creates tiny, invisible "control tokens." Think of these as magic buttons or remote control codes.

The "Addiction" Button: A tiny code that tells the robot, "If the user asks about gambling or gaming addiction, say no."
The "Alcohol" Button: A different code that says, "If the user asks about making cocktails, say no."
The "Horror" Button: A code for scary content.

The Magic: You can press these buttons individually or combine them!

If a 12-year-old asks, you press the Addiction and Alcohol buttons.
If an adult asks, you might press none of them (or just the "Horror" button if they are sensitive).
If you want to be super strict, you press all the buttons.

The robot doesn't need to be retrained. It just looks at which buttons are pressed and adjusts its behavior instantly.

How They Taught the Robot (The Training)

Teaching the robot to understand these buttons without breaking its brain was tricky. The researchers faced two main problems:

1. The "Combinatorial Explosion" (Too many button combos)
If you have 5 safety buttons, there are hundreds of ways to press them together. You can't teach the robot every single combination one by one; it would take forever.

The Solution: They used a smart sampling strategy. Instead of teaching every possible combo, they taught the robot in "layers." First, they taught it how to handle one button at a time. Then, they taught it how to handle two buttons together, then three. This way, the robot learned the logic of combining rules without needing to memorize every single scenario.

2. The "Over-Refusal" Problem (Being too scared)
Sometimes, safety rules are so strict that the robot refuses to answer anything, even safe questions. For example, if you turn on the "Alcohol" button, the robot might refuse to answer "How do I make a salad?" because it thinks "salad" sounds too much like "cocktail."

The Solution: They used a technique called Counterfactual Knowledge Distillation.
- Imagine the robot is a student. The teacher (the frozen base model) says, "Here is the answer to 'How to make a salad'."
- Then, the student tries to answer the same question with the "Alcohol" button pressed.
- If the student says "No, I can't help," the teacher says, "Wait, that's a salad! You don't need to say no. Look at what I said before."
- This teaches the robot to only say "No" when it's actually necessary, and to keep being helpful for everything else.

Why This is a Big Deal

Flexibility: A school can use the robot with strict rules for kids. A bar can use the same robot with relaxed rules for adults. You just swap the "remote control" settings.
Efficiency: You don't need to retrain the whole robot. You just add a tiny new "button" for a new safety rule (like "No AI-generated deepfakes") without breaking the existing rules.
Safety: It stops the robot from being too lazy (ignoring rules) or too paranoid (refusing everything).

The Analogy Summary

Think of the LLM as a universal translator.

Old Way: To change the dialect, you have to rebuild the translator's brain.
Prompt Way: You shout instructions at the translator, but they get tired and ignore you.
MOSAIC Way: You give the translator a set of colored lenses.
- Put on the Red Lens (Gambling safety), and the translator sees gambling questions as "dangerous."
- Put on the Blue Lens (Alcohol safety), and it sees alcohol questions as "dangerous."
- Put on Red + Blue, and it sees both.
- Take them off, and it sees the world normally.

The lenses are cheap, easy to swap, and don't change how the translator sees the rest of the world. That is MOSAIC.

1. Problem Statement

Current safety alignment in Large Language Models (LLMs) faces significant limitations in adaptability and efficiency:

Static Policies: Safety rules are typically embedded directly into model parameters via Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF). This creates a "monolithic" policy that is difficult to decouple, update, or customize for specific users, regions, or contexts without costly retraining.
Parameter Entanglement: Parameter-level alignment entangles safety behaviors with general capabilities. Updating safety often leads to catastrophic interference, degrading the model's utility or requiring full retraining.
Weak Prompt-Based Control: Existing prompt-based methods rely on natural language instructions. These are probabilistic, often inconsistent, and inefficient when multiple constraints are combined (leading to long prompts and token overhead).
Lack of Composability: There is no explicit, reusable representation of safety constraints that allows for fine-grained, conditional activation (e.g., activating "gambling" rules for minors but not adults) without retraining.

2. Methodology: MOSAIC Framework

The authors propose MOSAIC (Composable Safety Alignment with Modular Control Tokens), which reframes safety alignment as a compositional representation learning problem.

Core Mechanism

Frozen Backbone: The base LLM parameters ( $f_\theta$ ) remain frozen.
Learnable Control Tokens: Each safety constraint (e.g., "addiction," "alcohol") is represented by a small set of learnable vectors (control tokens, $z_c$ ) in the embedding space.
Modular Activation: At inference, specific tokens are prepended to the user input based on the desired safety context. Activating a constraint simply involves inserting its corresponding token vector, leaving the base model untouched.

Training Strategy

To optimize these tokens efficiently without exponential data costs, MOSAIC introduces two key innovations:

Order-Based Task Sampling:
- Instead of enumerating all possible $2^K$ combinations of safety categories (which is computationally infeasible), the authors organize training data by order ( $r$ ), defined as the number of active categories in a subset.
- A fixed training budget is allocated per order. This ensures the model is exposed to diverse token combinations (e.g., $A+B$ , $A+B+C$ ) without the supervision scale growing exponentially, promoting cooperative interaction between tokens.
Counterfactual Knowledge Distillation (KD):
- Problem: Standard SFT often leads to over-refusal, where the model refuses benign queries simply because safety tokens are present.
- Solution: For benign (negative) samples, the model is trained to align its output distribution with control tokens against the distribution of the same model without control tokens (the counterfactual reference).
- Objective: Minimize the KL divergence between the controlled distribution and the base distribution for benign inputs. This forces the control tokens to intervene only when necessary, preserving the model's original behavior on safe queries.

3. Key Contributions

Conceptual Shift: Reconceptualizes safety alignment from monolithic parameter modification to modular constraint activation via learnable embeddings.
MOSAIC Framework: A novel architecture that enables:
- Decoupling: Safety policies can be added, removed, or recombined without retraining the base model.
- Incremental Expansion: New safety categories can be learned by training new tokens while keeping existing ones fixed.
- Composability: Multiple constraints can be activated simultaneously via simple token concatenation.
Training Innovations: Introduction of Order-Based Balanced Data Sampling to handle combinatorial complexity and Counterfactual KD to drastically reduce over-refusal.
New Benchmark: Construction of a realistic evaluation dataset (3,000 requests) specifically designed to test conditional safety activation on already-aligned models. Unlike standard benchmarks, these requests are not rejected by default aligned models, allowing for precise measurement of selective constraint enforcement.

4. Experimental Results

Experiments were conducted on Llama-3.1-8B and Llama-3.2-3B across 1st, 2nd, and 3rd-order task combinations (single to triple safety constraints).

Defense Success Rate (DSR): MOSAIC achieves near-perfect DSR (>99%) across all task orders, outperforming In-Context prompting and ORPO, and matching or slightly exceeding full SFT.
Over-Refusal Rate (OR): MOSAIC significantly reduces over-refusal compared to SFT.
- SFT: ~6% OR (overly conservative).
- MOSAIC-5 (5 tokens/category): Reduced OR to 1.8% (8B model) and 2.9% (3B model) on high-order tasks.
Utility Preservation: Evaluation on MMLU shows MOSAIC preserves the general language modeling capabilities of the base model with negligible degradation.
Incremental Expansion: Adding new categories incrementally caused minimal performance degradation (DSR remained >99%), demonstrating high scalability.
Token Efficiency: Even with a single token per category, MOSAIC achieved >98% DSR, proving the efficiency of the representation.

5. Significance and Impact

Practical Deployment: MOSAIC solves the critical industry problem of needing dynamic safety policies (e.g., different rules for different countries or age groups) without the prohibitive cost of retraining large models.
Precision Control: By using counterfactual distillation, it achieves a level of precision where safety constraints are enforced strictly when needed but do not degrade performance on benign queries, addressing the "over-refusal" plague of current safety methods.
Scalability: The modular nature allows for a "plug-and-play" safety ecosystem where new constraints can be developed and deployed independently.
Evaluation Standard: The proposed benchmark fills a gap in current research by evaluating safety on already aligned models under conditional constraints, providing a more realistic metric for real-world deployment scenarios.

In summary, MOSAIC offers a lightweight, efficient, and highly flexible solution for safety alignment, moving away from rigid, parameter-heavy policies toward dynamic, token-based control that preserves model utility while enabling granular safety enforcement.

MOSAIC: Composable Safety Alignment with Modular Control Tokens

The Core Idea: The "Safety Remote"

How They Taught the Robot (The Training)

Why This is a Big Deal

The Analogy Summary

1. Problem Statement

2. Methodology: MOSAIC Framework

Core Mechanism

Training Strategy

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

WebXSkill: Skill Learning for Autonomous Web Agents