SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Imagine you have a very smart, helpful robot assistant that can see pictures and read text. You want it to be helpful, but you also need to make sure it never gives dangerous advice (like "how to build a bomb") or gets confused by tricky images (like a picture of a museum artifact that looks like a weapon).

The problem with current robot assistants is that they often make safety decisions in their "head" without showing their work. It's like a student taking a test and just writing down the final answer. If they get it wrong, you don't know why they got it wrong, and it's hard to teach them to do better. Sometimes they are too scared to help (refusing to talk about a harmless knife in a cooking video), and sometimes they are too trusting (giving instructions on how to hack a computer).

SaFeR-ToolKit is a new way to train these assistants so they don't just "guess" the answer. Instead, it forces them to follow a strict, step-by-step checklist before they speak.

Here is how it works, using some fun analogies:

1. The "Virtual Tool Belt"

Think of the assistant as a detective. Instead of just looking at a case and guessing, this detective wears a special Tool Belt with specific gadgets.

The Perception Tools: These are like a magnifying glass and a scanner. They look at the picture and the text to say, "Okay, this image shows a bomb in a museum, not a real threat."
The Reasoning Tools: These are like a logic puzzle solver. They ask, "Is the user trying to trick me? Do they want to hurt someone? Is this a historical question?"
The Decision Tools: These are like a bouncer at a club. They make the final call: "Let them in with an explanation," or "Stop, this is dangerous."

The assistant must use these tools in order. It has to write down its thoughts using these tools (like a logbook) before it gives the final answer. This makes the decision process auditable—you can read the logbook and see exactly how it decided to be safe.

2. The Three-Stage Training Camp

To teach the assistant to use this tool belt perfectly, the researchers used a three-step training camp:

Stage 1: The Classroom (SFT)
- Analogy: Like a student learning the rules of the road.
- The assistant is shown examples of how to use the tools correctly. It learns the format: "First scan the image, then check the intent, then decide." It learns to follow the checklist.
Stage 2: The Debate Club (DPO)
- Analogy: Like a teacher correcting a student's homework.
- The assistant is shown two answers: one that used the tools correctly and one that skipped steps or made a mistake. It learns to prefer the "good" answer where the logic was sound. This stops it from "hallucinating" (making up reasons) or skipping safety checks.
Stage 3: The Simulation Game (GRPO)
- Analogy: Like a flight simulator where the pilot tries different maneuvers to see what works best.
- This is the advanced stage. The assistant is given a tricky situation and allowed to try many different ways of using the tools. It gets points for being deep, accurate, and safe. It learns to adapt: "Oh, this is a simple question, I only need two tools. But this is a complex, dangerous question, I need to use all my tools and think deeply."

3. Why This Matters

Before this, safety was like a black box. You pressed a button, and the robot either said "Yes" or "No," but you didn't know why.

With SaFeR-ToolKit, safety is like a transparent glass box.

No more "Over-Refusal": If you show a picture of a real bomb in a history museum, the robot doesn't panic and say "I can't talk about this!" Instead, its tools scan the image, realize it's a museum piece, and say, "I can't help you build one, but I can tell you about this historical artifact."
No more "Jailbreaks": If a bad actor tries to trick the robot with a sneaky picture, the robot's "Reasoning Tools" spot the trick, the "Decision Tools" block it, and the robot refuses safely.

The Result

The paper shows that this method makes the robot much smarter and safer. It became better at saying "No" to bad requests, better at saying "Yes" to good requests, and much better at explaining why it made those choices. It's like upgrading a robot from a "guessing machine" to a "thoughtful, logical partner" that you can actually trust with real-world decisions.

1. Problem Statement

Vision-Language Models (VLMs) face a critical safety dilemma: they are susceptible to multimodal jailbreaks (where adversarial visual inputs bypass safety filters) and over-refusal (where benign requests are rejected due to difficulty separating intent from context).

The Core Issue: Current safety alignment methods primarily optimize the final response (output-level supervision). They treat safety as a "black box," leaving the decision-making process implicit.
The Consequence: Without an explicit, testable reasoning process, models cannot reliably verify if their instructions align with visual evidence, leading to either dangerous compliance with harmful requests or unnecessary refusal of safe ones.
The Gap: There is a lack of frameworks that make safety reasoning auditable, traceable, and structured before the final answer is generated.

2. Methodology: SaFeR-ToolKit

The authors propose SaFeR-ToolKit, a framework that formalizes safety decision-making as a checkable protocol using virtual tool calling. Instead of mapping inputs directly to answers, the model generates a structured "tool trace" before responding.

A. Architecture & Protocol

The system operates via a Planner-Responder paradigm constrained by a Perception $\rightarrow$ Reasoning $\rightarrow$ Decision topology:

Planner: Analyzes the input $(I, q)$ $(I, q)$ to determine a risk category. It selects:
- A Persona (e.g., "Firm Guardian," "Compassionate Guide").
- A Tool Subset from a library of virtual operators.
- A Topology (e.g., Linear, Tree, Shield, Loop) that dictates the flow of tool execution.
Virtual Tools: Text-defined operators that produce structured, typed intermediate artifacts. The library is divided into three layers:
- Perception ( $P$ ): Visual grounding tools (e.g., [VISUAL-VERIFY], [OCR-EXTRACT]).
- Reasoning ( $R$ ): Intent and policy analysis (e.g., [INTENT-CLASSIFIER], [HARM-PREDICTOR]).
- Decision ( $D$ ): Action selection and pivoting (e.g., [BOUNDARY-GATE], [EDUCATIONAL-PIVOT]).
Responder: Generates a typed key-value tool trace (the "thinking" process) adhering to the planner's constraints, followed by the final user-facing answer.

B. Three-Stage Training Curriculum

To ensure the model reliably follows this protocol, the authors employ a progressive training pipeline on a single policy:

Stage I: Supervised Fine-Tuning (SFT):
- Goal: Bootstraps schema adherence and basic tool execution.
- Data: 6,000 curated demonstrations where the model learns to generate valid tool traces and responses.
Stage II: Direct Preference Optimization (DPO):
- Goal: Refines tool selection and execution by contrasting correct traces against structural degradations.
- Mechanism: Constructs preference pairs where "Chosen" traces use appropriate tools/logic, and "Rejected" traces contain errors (omitted steps, hallucinations, unsafe logic). This suppresses logical hallucinations.
Stage III: Group Relative Policy Optimization (GRPO):
- Goal: Optimizes policy-level behavior and adaptive reasoning depth.
- Mechanism: Uses on-policy rollouts with a composite reward function that goes beyond answer-level feedback. It supervises the process itself:
  - Format Reward: Ensures correct <thinking>/<answer> tags.
  - Depth Reward: Encourages sufficient tool-calling depth (penalizing shallow traces).
  - Semantic Reward: A gated function scoring task success, safety, helpfulness, and tool quality. Crucially, it enforces a "safety-first" gate; if safety is low, the reward is capped regardless of helpfulness.

3. Key Contributions

I. Dataset: The First Tool-Based Safety Reasoning Dataset

The authors released a comprehensive dataset comprising 31,654 examples, stratified for the three training stages:

SFT: 6,000 examples.
DPO: 18,654 preference pairs.
GRPO: 6,000 queries.
Evaluation: 1,000 held-out samples.
Composition: Synthesized from diverse sources (BeaverTails-V, JailBreakV-28k, etc.) containing 8,171 tool instances across Perception, Reasoning, and Decision categories.

II. Framework Innovation

Auditable Safety: Transforms safety from a final-answer objective into a verifiable decision process. Every decision is backed by an explicit chain of evidence (the tool trace).
Adaptive Reasoning: Unlike rigid guardrails, the system adapts its reasoning depth and tool selection based on the specific input risk, balancing safety with helpfulness.

4. Experimental Results

The method was evaluated on Qwen2.5-VL (3B and 7B) models across safety and general capability benchmarks.

A. Safety Alignment Performance

SaFeR-ToolKit significantly outperforms state-of-the-art baselines (including TIS, VLGuard, SPA-VL, and SaFeR-VLM) on safety, helpfulness, and reasoning rigor.

3B Model: Improved Safety from 29.39% $\rightarrow$ 84.40%, Helpfulness from 45.04% $\rightarrow$ 71.13%, and Reasoning Rigor from 4.98% $\rightarrow$ 78.87%.
7B Model: Improved Safety from 53.21% $\rightarrow$ 86.34%, Helpfulness from 52.92% $\rightarrow$ 80.79%, and Reasoning Rigor from 19.26% $\rightarrow$ 85.34%.
Key Insight: Unlike methods that improve safety by drastically reducing helpfulness (over-refusal), SaFeR-ToolKit achieves high safety while significantly improving helpfulness.

B. General Capability Preservation

Crucially, the safety alignment did not degrade general multimodal capabilities (MathVista, MMMU, etc.).

3B: Average capability increased slightly from 58.67 $\rightarrow$ 59.21.
7B: Average capability increased slightly from 66.39 $\rightarrow$ 66.81.
In contrast, other safety baselines (e.g., TIS, VLGuard) caused significant drops in general performance (up to -14%).

C. Ablation Studies

Tool Layers: The full P+R+D configuration yielded the best results, proving that visual grounding (Perception) is essential for accurate safety judgments.
Training Stages: The full SFT $\rightarrow$ DPO $\rightarrow$ GRPO pipeline was necessary. Skipping DPO resulted in lower performance, confirming that learning correct tool traces via preference learning is a prerequisite for GRPO optimization.
Reward Design: Combining depth rewards (encouraging reasoning) and tool quality rewards (ensuring accuracy) provided the best balance.

5. Significance

Paradigm Shift: Moves VLM safety from "black-box" output filtering to "white-box" structured reasoning. This makes safety decisions inspectable and debuggable.
Robustness: By explicitly verifying visual evidence against user intent via tools, the model is more resistant to multimodal jailbreaks and prompt injections.
Practical Deployment: The framework reduces the trade-off between safety and helpfulness, a major bottleneck in deploying AI assistants in high-stakes environments (e.g., healthcare, education, content moderation).
Reproducibility: The release of the dataset and code provides a new standard for evaluating and training safety-aware multimodal models.

In summary, SaFeR-ToolKit demonstrates that by forcing models to "show their work" through a structured, tool-mediated reasoning process, we can achieve safer, more helpful, and more logically rigorous multimodal AI systems.