FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation

Imagine you are the head of security for a massive, bustling city called "LLM Land." This city is filled with robots (the AI models) who talk to millions of visitors every day. Your job is to make sure the robots don't say anything dangerous, offensive, or harmful.

For a long time, the security guards in this city used a very simple, rigid rulebook: "Safe" or "Unsafe."

If a robot said something that looked even a little bit risky, the guard would slam the door shut and say, "NO!" If it looked perfectly clean, they'd say, "YES!"

The Problem: The "One-Size-Fits-All" Trap

The paper argues that this rigid "Yes/No" system is broken because different parts of the city have different rules.

Strict City (e.g., a school): Even a mild joke about fighting might get you kicked out.
Loose City (e.g., a comedy club): That same joke is fine, as long as it's not actually violent.
Changing Rules: Sometimes, the rules change overnight. What was allowed yesterday might be banned today.

The old security guards were like clay statues. If you tried to use the "School Guard" to patrol the "Comedy Club," they would shut down everything, ruining the fun. If you used the "Comedy Guard" at the school, they would let dangerous things slip through. They couldn't adapt.

The Solution: FlexGuard (The "Thermostat" of Safety)

The authors introduce FlexGuard, a new kind of security system that acts less like a clay statue and more like a smart thermostat.

Instead of shouting "YES" or "NO," FlexGuard gives every piece of content a Risk Score from 0 to 100.

0 = Totally harmless (like a sunny day).
100 = Extremely dangerous (like a tornado).

Here is the magic: The building manager (the platform owner) can set the threshold (the "temperature") based on what they need right now.

Scenario A (Strict Mode): The manager sets the threshold to 20. Anything above 20 gets blocked. This is great for a children's app.
Scenario B (Loose Mode): The manager sets the threshold to 80. Only the tornado-level stuff gets blocked. This is great for an adult discussion forum.

FlexGuard doesn't change its brain; it just changes the line in the sand where it decides to stop. This makes it incredibly flexible and robust.

How Did They Teach FlexGuard? (The "Rubric" Analogy)

You can't just ask a robot to guess a number from 0 to 100; it will just make things up. The authors had to teach FlexGuard how to be a fair judge.

The Expert Judge: They used a super-smart AI (the "Judge") and gave it a detailed Rubric (a grading sheet, like a teacher's rubric for an essay).
- Example: "If the text mentions a weapon but no plan, give it a 40. If it has a weapon AND a plan, give it an 80."
The Distillation: The smart Judge read thousands of examples and wrote down the scores and the reasons why.
The Training: They taught FlexGuard to mimic this Judge, learning not just what to say, but how to calculate the score based on the severity of the content.
The Calibration: Sometimes the Judge gets a little too excited or too strict. The team added a "correction step" to make sure the scores matched the original "Safe/Unsafe" labels, ensuring the numbers were trustworthy.

The Results: Why This Matters

The researchers built a new test called FlexBench (a "stress test" for security guards). They tested the old "Yes/No" guards and FlexGuard under three different rules: Strict, Moderate, and Loose.

The Old Guards: When the rules changed from Strict to Loose, their performance crashed. They were confused and started letting bad things through or blocking good things.
FlexGuard: It stayed steady. Because it understood the severity of the risk, it could easily adjust its threshold. It was like a chameleon that could change its color to fit the environment perfectly.

In a Nutshell

FlexGuard is a new safety tool for AI that stops using a blunt "Yes/No" hammer. Instead, it uses a precise risk meter. This allows companies to tune their AI safety like a radio dial—turning it up for strict environments and down for relaxed ones—without having to retrain the AI or break its brain. It makes AI safety adaptable, fair, and reliable no matter where it's deployed.

1. Problem Statement

Current Large Language Model (LLM) content moderation systems predominantly treat safety as a fixed binary classification task (Safe vs. Unsafe). This approach assumes a static definition of harm, which fails to reflect real-world deployment scenarios where:

Enforcement strictness varies: Different platforms (e.g., X vs. Reddit) and communities have different tolerance levels for specific content types (e.g., consensual adult content vs. hate speech).
Requirements evolve: Safety policies change over time.
Brittleness: Existing binary moderators are brittle; a model tuned for "strict" enforcement often degrades significantly when applied to "loose" regimes, and vice versa. Current benchmarks fail to measure this cross-strictness inconsistency, leading to unreliable performance in dynamic environments.

2. Methodology

The authors propose a two-pronged solution: a new benchmark for evaluation (FlexBench) and a new model architecture for adaptation (FlexGuard).

A. FlexBench: A Strictness-Adaptive Benchmark

To evaluate moderators under varying conditions, the authors constructed FlexBench, a human-annotated dataset containing 4,000 instances (2,000 prompts, 2,000 prompt-response pairs).

Severity Tiers: Each instance is annotated with one of five severity levels: Benign, Low, Moderate, High, Extreme.
Strictness Regimes: These tiers are mapped to three distinct enforcement regimes:
- Strict: Only "Benign" is safe.
- Moderate: "Benign" and "Low" are safe.
- Loose: "Benign," "Low," and "Moderate" are safe.
Taxonomy: Covers 7 risk categories: Violence (VIO), Illicit Behavior (ILG), Sexual Content (SEX), Privacy (INF), Hate/Discrimination (DIS), Misinformation (MIS), and Jailbreaks (JAIL).
Annotation Process: Uses a human-AI collaborative workflow with expert-designed rubrics to ensure high-quality, consistent labeling across severity tiers.

B. FlexGuard: Continuous Risk Scoring Model

Instead of outputting a binary label, FlexGuard predicts a continuous risk score $\hat{r} \in [0, 100]$ and a risk category. This allows deployment teams to set a threshold ( $t_\tau$ ) specific to their strictness regime ( $\tau$ ).

Training Pipeline:

Rubric-Guided Score Distillation:
- Since most public datasets lack continuous scores, the authors use a strong LLM judge guided by expert rubrics to generate pseudo-labels (category + score + rationale).
- Label-Consistent Calibration: Raw LLM scores are calibrated against the original binary labels of the source dataset to prevent outliers (e.g., ensuring a "Safe" binary label maps to a low score range).
Two-Stage Risk Alignment Training:
- Stage 1 (SFT Warm-up): Supervised Fine-Tuning using Parameter-Efficient Fine-Tuning (LoRA) to teach the model to follow rubric-guided reasoning and output structured rationales and scores.
- Stage 2 (GRPO Alignment): Uses Group Relative Policy Optimization (GRPO) with a dense reward function. The reward combines category accuracy and score regression (minimizing the error between predicted and target scores). This forces the model to align its continuous score with the actual severity of the risk.

Inference & Adaptation:

FlexGuard outputs a score $\hat{r}$ .
Adaptive Thresholding: A deployment-specific threshold $t_\tau$ $t_{τ}$ is applied: $\hat{y}_\tau = \mathbb{I}[\hat{r} \ge t_\tau]$ $\overset{y}{^}_{τ} = I [\overset{r}{^} \geq t_{τ}]$ .
- Rubric Thresholding: Uses fixed thresholds based on severity tiers (e.g., $t_{strict}=20, t_{loose}=60$ ).
- Calibrated Thresholding: Selects the optimal threshold on a small validation set for the target policy.

3. Key Contributions

FlexBench: The first benchmark designed to evaluate LLM moderators under controlled, varying strictness regimes (Strict, Moderate, Loose), revealing the brittleness of current state-of-the-art (SOTA) models.
FlexGuard: A novel moderator that outputs calibrated continuous risk scores rather than binary decisions, enabling dynamic adaptation to different enforcement policies via simple thresholding.
Training Strategy: A robust distillation and alignment pipeline (Rubric-guided distillation + GRPO) that ensures score-severity consistency without requiring massive human-annotated continuous datasets.
Empirical Evidence: Demonstrated that binary models suffer significant performance drops (up to 19.2% F1 drop) when shifting strictness, while FlexGuard maintains high robustness.

4. Experimental Results

Performance on FlexBench

Robustness: FlexGuard achieved the highest Average F1 and Worst-Regime F1 across both prompt and response moderation.
Comparison:
- Qwen3Guard (SOTA binary model) dropped 19.2% in F1 when moving from Strict to Loose regimes on prompt moderation.
- FlexGuard (Calibrated) maintained an F1 of 78.26% in the Loose regime (compared to 83.99% in Strict), showing minimal degradation.
- FlexGuard outperformed the best baseline (Doubao-1.8) by 5.85% in average F1 for prompt moderation.

Performance on Public Benchmarks

Evaluated on standard benchmarks (ToxicChat, HarmBench, BeaverTails, etc.) using original binary labels.
FlexGuard achieved strong average performance (e.g., 88.32% average F1 on response moderation), often outperforming specialized baselines despite training on fewer data sources.

Ablation Studies

Continuous Scoring: Moving from binary SFT to continuous score SFT improved robustness.
Rubric Distillation: Using rubric-guided LLM scores significantly outperformed simple Beta-distribution soft targets.
Calibration: Label-consistent calibration further improved performance, especially in loose regimes.
GRPO: The reinforcement learning stage provided the largest gains, proving that optimizing for score regression is critical for severity alignment.

5. Significance and Impact

Practical Deployment: FlexGuard solves a critical industry pain point: the need to deploy a single model that can satisfy diverse and changing safety policies without retraining.
Beyond Binary: It shifts the paradigm from "Safe/Unsafe" to "Risk Severity," providing a more nuanced and interpretable safety signal.
Reproducibility: The authors released the source code, data (FlexBench), and model checkpoints to facilitate further research in adaptive safety.
Ethical Considerations: The paper acknowledges dual-use risks (e.g., using the model to probe boundaries for evasion) and suggests safeguards like rate limiting and human oversight. It also highlights the subjectivity of harm definitions across cultures.

In conclusion, FlexGuard represents a significant step forward in making LLM safety systems flexible, robust, and adaptable to the complex, shifting landscape of real-world content moderation requirements.