MOSAIC: Composable Safety Alignment with Modular Control Tokens

The paper introduces MOSAIC, a modular framework that enables flexible, context-dependent safety alignment in large language models by optimizing learnable control tokens on a frozen backbone, thereby achieving strong defense performance with reduced over-refusal and preserved utility compared to static or prompt-based methods.

Jingyu Peng, Hongyu Chen, Jiancheng Dong, Maolin Wang, Wenxi Li, Yuchen Li, Kai Zhang, Xiangyu Zhao

Published 2026-03-18
📖 5 min read🧠 Deep dive

Imagine you have a very smart, helpful robot assistant (a Large Language Model, or LLM). This robot is great at writing stories, solving math problems, and chatting. But, like any good assistant, it needs rules to keep things safe. It shouldn't teach kids how to make bombs, or help adults gamble away their life savings.

The problem with current robots is that their "safety rules" are baked right into their brain. Once the robot is built, those rules are permanent. If you want the robot to be safe for a 10-year-old but safe-for-adults for a 30-year-old, you can't just flip a switch. You have to either:

  1. Rewire the whole brain: Retrain the robot from scratch (expensive, slow, and might make it forget how to do math).
  2. Yell instructions: Write a long, complicated note at the top of every conversation saying, "Remember, no gambling!" (This is easy to ignore, and the note gets too long).

The paper introduces MOSAIC, a new way to handle safety that is like giving the robot a set of modular "safety remote controls."

The Core Idea: The "Safety Remote"

Instead of changing the robot's brain, MOSAIC creates tiny, invisible "control tokens." Think of these as magic buttons or remote control codes.

  • The "Addiction" Button: A tiny code that tells the robot, "If the user asks about gambling or gaming addiction, say no."
  • The "Alcohol" Button: A different code that says, "If the user asks about making cocktails, say no."
  • The "Horror" Button: A code for scary content.

The Magic: You can press these buttons individually or combine them!

  • If a 12-year-old asks, you press the Addiction and Alcohol buttons.
  • If an adult asks, you might press none of them (or just the "Horror" button if they are sensitive).
  • If you want to be super strict, you press all the buttons.

The robot doesn't need to be retrained. It just looks at which buttons are pressed and adjusts its behavior instantly.

How They Taught the Robot (The Training)

Teaching the robot to understand these buttons without breaking its brain was tricky. The researchers faced two main problems:

1. The "Combinatorial Explosion" (Too many button combos)
If you have 5 safety buttons, there are hundreds of ways to press them together. You can't teach the robot every single combination one by one; it would take forever.

  • The Solution: They used a smart sampling strategy. Instead of teaching every possible combo, they taught the robot in "layers." First, they taught it how to handle one button at a time. Then, they taught it how to handle two buttons together, then three. This way, the robot learned the logic of combining rules without needing to memorize every single scenario.

2. The "Over-Refusal" Problem (Being too scared)
Sometimes, safety rules are so strict that the robot refuses to answer anything, even safe questions. For example, if you turn on the "Alcohol" button, the robot might refuse to answer "How do I make a salad?" because it thinks "salad" sounds too much like "cocktail."

  • The Solution: They used a technique called Counterfactual Knowledge Distillation.
    • Imagine the robot is a student. The teacher (the frozen base model) says, "Here is the answer to 'How to make a salad'."
    • Then, the student tries to answer the same question with the "Alcohol" button pressed.
    • If the student says "No, I can't help," the teacher says, "Wait, that's a salad! You don't need to say no. Look at what I said before."
    • This teaches the robot to only say "No" when it's actually necessary, and to keep being helpful for everything else.

Why This is a Big Deal

  • Flexibility: A school can use the robot with strict rules for kids. A bar can use the same robot with relaxed rules for adults. You just swap the "remote control" settings.
  • Efficiency: You don't need to retrain the whole robot. You just add a tiny new "button" for a new safety rule (like "No AI-generated deepfakes") without breaking the existing rules.
  • Safety: It stops the robot from being too lazy (ignoring rules) or too paranoid (refusing everything).

The Analogy Summary

Think of the LLM as a universal translator.

  • Old Way: To change the dialect, you have to rebuild the translator's brain.
  • Prompt Way: You shout instructions at the translator, but they get tired and ignore you.
  • MOSAIC Way: You give the translator a set of colored lenses.
    • Put on the Red Lens (Gambling safety), and the translator sees gambling questions as "dangerous."
    • Put on the Blue Lens (Alcohol safety), and it sees alcohol questions as "dangerous."
    • Put on Red + Blue, and it sees both.
    • Take them off, and it sees the world normally.

The lenses are cheap, easy to swap, and don't change how the translator sees the rest of the world. That is MOSAIC.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →