Concept Heterogeneity-aware Representation Steering

Imagine you have a giant, super-smart robot (a Large Language Model) that can write stories, answer questions, and even draw pictures. But sometimes, you want to tweak its personality. Maybe you want it to be less toxic, more creative, or to write in a specific style like "Cyberpunk."

Currently, the standard way to do this is called Representation Steering. Think of it like giving the robot a single, giant shove in one direction. If you want it to be "nicer," you push it toward the "niceness" zone. If you want it to be "Cyberpunk," you push it toward the "Cyberpunk" zone.

The Problem: The "One-Size-Fits-All" Shove
The paper argues that this "single shove" approach is too simple. It assumes that the concept of "niceness" or "Cyberpunk" is a single, uniform blob in the robot's brain.

But in reality, the robot's brain is messy and complex.

"Harmful" isn't just one thing. Sometimes it's a violent threat, sometimes it's a scam, and sometimes it's a subtle insult. These are like different clusters of friends hanging out in different corners of a giant party.
"Cyberpunk" isn't just neon lights; it's also about dystopian cities, hacking, or specific fashion.

If you give the whole robot a single shove toward "Cyberpunk," you might accidentally make the "hacking" part too strong while ruining the "fashion" part, or you might push the robot off a cliff because you didn't account for the different "clusters" of meaning. The paper calls this Concept Heterogeneity—the idea that big concepts are actually made of many different, distinct sub-groups.

The Solution: CHaRS (The Smart GPS)
The authors propose a new method called CHaRS (Concept Heterogeneity-aware Representation Steering).

Instead of a single shove, CHaRS acts like a smart GPS navigation system for the robot's brain. Here's how it works using a simple analogy:

Mapping the Party (Clustering):
First, CHaRS looks at the "Harmful" examples and the "Harmless" examples. Instead of seeing them as one big group, it uses a tool (k-means clustering) to find the distinct "clusters" or groups within them.
- Analogy: It realizes that "Harmful" isn't just one crowd; it's a group of scammers, a group of bullies, and a group of hackers, all standing in different spots in the room.
The Dance Floor Match (Optimal Transport):
The paper uses a mathematical concept called Optimal Transport. Imagine you have a pile of sand (the "Harmful" concepts) and you want to move it to a new shape (the "Harmless" concepts) with the least amount of effort.
- Old Way: You just dump the whole pile of sand in one direction.
- CHaRS Way: It calculates the most efficient way to move each specific grain of sand to its perfect new spot. It matches the "scammer" cluster to the "safe" cluster, and the "bully" cluster to a different "safe" cluster.
The Custom Push (Input-Dependent Steering):
When the robot is actually talking to you, CHaRS looks at what the robot is thinking right now.
- If the robot is thinking about "hacking," CHaRS gently nudges it toward the "safe hacking" cluster.
- If the robot is thinking about "scams," it nudges it toward the "safe scam" cluster.
- Analogy: Instead of pushing the whole robot, it gives a tiny, custom nudge to the specific part of the robot's brain that is currently active. It's like a dance partner who knows exactly how to move with you, rather than just pushing you forward.

Why This Matters (The Results)
The paper tested this on several tasks:

Jailbreaking: Trying to trick the robot into doing bad things. CHaRS was better at preventing the robot from being tricked because it understood the different ways a "jailbreak" attempt could look.
Toxicity: Making the robot less mean. CHaRS made the robot much nicer without making it sound like a robot or forgetting how to speak English.
Art Style: Changing a generated image to look "Cyberpunk." CHaRS did a better job of adding the style without losing the original picture's meaning (like keeping the horses in the picture, even if they turned into futuristic cars).

The "Secret Sauce" (CHaRS-PCT)
The authors also found that they didn't need to use every possible direction to make this work. They used a technique called Principal Component Thresholding (PCT).

Analogy: Imagine you have a huge toolbox with 1,000 tools. You don't need all of them to fix a leak; you just need the top 5 best ones. CHaRS-PCT picks the most important "directions" to nudge the robot, making the process faster and cleaner without losing quality.

In a Nutshell
Old methods tried to steer the robot with a single, blunt stick. CHaRS uses a laser-guided, multi-tool approach that understands that "concepts" are complex and varied. It looks at the specific context, finds the right "cluster" of meaning, and gives a precise, gentle nudge to get the robot to behave exactly how you want, without breaking its brain.

1. Problem Statement

Representation steering is a technique used to control the behavior of Large Language Models (LLMs) by intervening on internal activations during inference. The standard approach, known as Difference-in-Means (DiM) steering, calculates a single global steering vector by subtracting the mean activation of a "source" concept (e.g., harmful) from a "target" concept (e.g., harmless).

The Core Limitation:
Existing methods implicitly assume that semantic concepts are homogeneously distributed within the embedding space (i.e., they follow a single unimodal Gaussian distribution). However, empirical evidence (visualized via PCA and t-SNE in the paper) shows that LLM representations for a single concept are often highly heterogeneous, exhibiting clustered, context-dependent structures.

Consequence: A single global translation vector (DiM) is "brittle." It fails to account for different sub-regions of a concept (e.g., different types of harmful instructions or varying refusal styles), leading to inconsistent control and potential degradation of general language utility.

2. Methodology: CHaRS

The authors propose Concept Heterogeneity-aware Representation Steering (CHaRS), which reframes steering as an Optimal Transport (OT) problem between Gaussian Mixture Models (GMMs) rather than simple Gaussians.

A. Theoretical Framework: GMM-OT

Modeling: Instead of assuming a single Gaussian, the source ( $\mu$ ) and target ( $\nu$ ) activation distributions are modeled as GMMs:
$\mu = \sum p_k \mathcal{N}(a_k, \Sigma_k), \quad \nu = \sum q_l \mathcal{N}(b_l, \Gamma_l)$
where components are identified via clustering (e.g., k-means) on empirical activation data.
Mixture Wasserstein Distance: The authors utilize the Mixture Wasserstein (MW2) distance. This restricts the transport plan to be a mixture of Gaussian-to-Gaussian couplings.
Discrete OT Formulation: The continuous OT problem is reduced to a discrete optimal transport problem between the mixture components (clusters). The cost is defined by the Wasserstein distance between individual Gaussian clusters.
Barycentric Projection: To derive a deterministic steering map $T(x)$ for any input $x$ , the authors use barycentric projection. This creates an input-dependent steering vector that is a soft, weighted combination of local cluster-level shifts.

B. Practical Algorithm

Clustering: Hidden activations for source and target concepts are clustered into $K$ components using k-means.
Cluster Matching: An entropy-regularized OT problem (solved via Sinkhorn iterations) is used to find the optimal soft coupling matrix $P^*$ between source and target clusters. This determines how much each source cluster should map to each target cluster.
Steering Map Construction:
- For an input $x$ , the model calculates the probability of belonging to each source cluster using a kernel-based gating function (RBF kernel).
- The final steering vector $\hat{v}(x)$ is a weighted sum of local translation vectors ( $v_{ij} = b_j - a_i$ ), where weights depend on both the input's proximity to clusters and the optimal transport coupling.
- Equation: $\hat{T}_\alpha(x) = x + \alpha \hat{v}(x)$ .

C. CHaRS-PCT (Principal Component Thresholding)

To improve efficiency and interpretability, the authors introduce CHaRS-PCT.

Observation: The covariance of the steering vectors is inherently low-rank (rank $\le 2K-2$ ).
Method: They perform PCA on the weighted steering vectors and threshold the components, keeping only the top $L$ principal components. This acts as a spectral filter, reducing noise and computational cost while preserving the most significant semantic directions.

3. Key Contributions

Generalization to Multimodality: The work generalizes representation steering from restrictive unimodal Gaussian assumptions to multimodal GMMs, formally addressing concept heterogeneity.
Input-Adaptive Steering: CHaRS provides a smooth, context-sensitive steering map where the direction of intervention varies based on the specific input's position in the representation manifold, unlike static global vectors.
Low-Rank Factorization: The introduction of CHaRS-PCT leverages the low-rank structure of the steering field, allowing for effective control with fewer dimensions and acting as an implicit regularizer.
Empirical Validation: The method is validated across diverse tasks (jailbreaking, toxicity mitigation, image style control) and models (Llama, Qwen, Gemma, FLUX).

4. Experimental Results

The paper evaluates CHaRS and CHaRS-PCT against baselines like Activation Addition (ActAdd) and Directional Ablation (DirAbl).

Jailbreaking (Adversarial Attack):
- CHaRS consistently achieved higher Attack Success Rates (ASR) across models (3B to 32B parameters).
- Example: On Gemma2-9B, CHaRS improved ASR by ~7% over ActAdd.
- CHaRS-PCT matched or exceeded CHaRS performance while using fewer steering directions.
Toxicity Mitigation:
- In sequential steering settings (layer-by-layer intervention), CHaRS and CHaRS-PCT significantly outperformed Linear-Act in reducing toxic generations (up to 43% reduction in toxicity scores on Llama3-8B).
- Crucially, unlike some baselines, CHaRS did not degrade general language utility (perplexity and MMLU scores remained stable).
Image Style Control (Diffusion Models):
- Applied to FLUX.1 for text-to-image style transfer (e.g., "cyberpunk").
- CHaRS achieved the target style at lower intervention strengths compared to Linear-Act, maintaining a better trade-off between style induction and content preservation (higher CLIPScore).

5. Significance

Theoretical Advancement: The paper bridges Optimal Transport theory with LLM interpretability, providing a rigorous mathematical justification for why global steering fails on heterogeneous data and how to fix it.
Robust Control: By acknowledging that concepts are not monolithic but clustered, CHaRS offers a more robust mechanism for behavioral control, essential for safety alignment (preventing toxicity) and capability alignment (jailbreaking for red-teaming).
Efficiency: The low-rank nature of the steering field (exploited by CHaRS-PCT) suggests that complex behavioral shifts can be achieved with a compact set of semantic axes, making the method scalable for large models.

In summary, CHaRS moves beyond the "one-size-fits-all" steering vector, introducing a principled, input-adaptive framework that respects the complex, clustered geometry of LLM representation spaces.

Concept Heterogeneity-aware Representation Steering

1. Problem Statement

2. Methodology: CHaRS

A. Theoretical Framework: GMM-OT

B. Practical Algorithm

C. CHaRS-PCT (Principal Component Thresholding)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

DualDynamics: Synergizing Implicit and Explicit Methods for Robust Irregular Time Series Analysis

Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation

ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

Advanced Assistance for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction