PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

Imagine you are trying to sneak a note past a very strict security guard (the AI model) who only tells you "Yes" or "No" if your note is allowed through. You can't see the guard's internal checklist, you can't ask for hints, and you only get one chance to ask a question at a time before you run out of energy (the "query budget").

This is the challenge of Hard-Label Text Attacks.

The paper introduces a new method called PivotAttack. Instead of trying to brute-force your way through, PivotAttack uses a clever, efficient strategy to trick the guard. Here is how it works, explained with simple analogies:

1. The Old Way: "The Blind Search" (Outside-In)

Imagine you are trying to find the exit in a giant, dark maze. The old methods (like previous attacks) start at the very outside wall of the maze. They randomly poke holes in the wall, hoping to find a gap that leads to the exit.

The Problem: They waste a lot of energy poking the wrong walls. They might wander far away from where they started, making the note look weird and suspicious, and they run out of energy before finding the exit.

2. The New Way: "The Pivot Strategy" (Inside-Out)

PivotAttack changes the game. Instead of starting at the wall, it starts right where you are standing (the original text) and asks: "What are the specific things holding this sentence together?"

Think of a sentence like a house.

Most words are just furniture (chairs, lamps). If you move a lamp, the house still stands.
But some words are load-bearing walls. If you remove or change a load-bearing wall, the whole house collapses.

PivotAttack's goal is to find those load-bearing walls (which the paper calls Pivot Sets).

3. How It Finds the "Load-Bearing Walls"

The AI doesn't know which words are the walls. So, PivotAttack uses a smart guessing game called a Multi-Armed Bandit.

The Analogy: Imagine a row of slot machines (arms). You have a limited amount of coins (queries). You want to find the machine that pays out the most.
The Process: PivotAttack tries swapping out different words (pulling the levers). It quickly learns that swapping the word "happy" doesn't change the AI's mind (it's just furniture). But swapping "love" to "hate" makes the AI flip its decision (that was a load-bearing wall!).
The Result: It stops wasting coins on furniture and focuses entirely on the walls.

4. The "Collapse"

Once PivotAttack identifies the Pivot Set (the critical words), it doesn't just change one word randomly. It strategically swaps those specific "load-bearing" words with synonyms.

The Metaphor: Instead of trying to push the whole house over, it simply pulls out the one specific brick that holds the roof up. Crash! The house (the AI's prediction) collapses, and the label flips, but the rest of the house (the meaning of the sentence) looks almost exactly the same.

Why Is This a Big Deal?

Efficiency: Because it stops wasting time on "furniture," it finds the exit (the attack) much faster. It uses far fewer questions (queries) than other methods.
Stealth: Since it only changes the most critical words and leaves the rest alone, the sentence still sounds natural to a human. It doesn't look like gibberish.
Beating the Big Guys: The paper tested this on both old-school AI models and the newest, super-smart Large Language Models (like Qwen and Gemma). Even these powerful models have "load-bearing walls," and PivotAttack found them better than any previous method.

Summary

PivotAttack is like a master locksmith. Instead of trying every key in the keyring (the old way), it listens to the lock, finds the one specific pin that is holding the mechanism, and turns just that pin to open the door. It's smarter, faster, and harder to detect.

Here is a detailed technical summary of the paper "PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words".

1. Problem Statement

The paper addresses the challenge of hard-label black-box text adversarial attacks. In this setting:

Constraints: The attacker has no access to the model's gradients, internal states, or confidence scores. They can only query the model and receive a discrete class label (e.g., "Positive" or "Negative").
Goal: Generate an adversarial example $X'$ that causes the model to misclassify the input ( $f(X') \neq f(X)$ ) while preserving the semantic meaning of the original text.
Challenge: The search space is vast and discrete. Existing methods suffer from two main inefficiencies:
1. "Outside-in" Strategy: Many methods start with heavily perturbed text far from the original semantics and iteratively refine it toward the decision boundary. This traverses a massive search space, consuming excessive query budgets and degrading text quality.
2. Independent Token Scoring: Methods often score words independently (ignoring inter-word dependencies), leading to the selection of functional words rather than semantically critical "anchors," resulting in suboptimal perturbation sets.

2. Methodology: PivotAttack

PivotAttack introduces a novel "inside-out" strategy. Instead of searching from the outside in, it starts with the original text and identifies a Pivot Set—a minimal group of tokens that act as "load-bearing walls" for the model's prediction. Perturbing these specific tokens causes a disproportionate collapse in model confidence, efficiently driving the instance across the decision boundary.

The framework operates in two main stages:

A. Pivot Set Identification (Multi-Armed Bandit)

The core innovation is formulating the identification of the Pivot Set as a Pure-Exploration Multi-Armed Bandit (MAB) problem.

Objective: Find a set $S$ such that if non-pivot words are perturbed, the model's prediction remains unchanged with high probability (Retention Precision $p_S$ ).
Algorithm: The authors employ the KL-LUCB (Kullback-Leibler Lower-Upper Confidence Bound) algorithm.
- Arms: Each candidate token (or token combination) is an "arm."
- Reward: The retention precision (probability that the label remains unchanged when non-pivots are masked/changed).
- Process: The algorithm iteratively pulls arms (queries the model with masked variants) to tighten confidence bounds. It distinguishes true semantic anchors from statistical noise by prioritizing sets with high retention precision.
Optimization: The search minimizes the size of the Pivot Set ( $|S|$ ) to ensure stealthiness, subject to the constraint that the retention precision exceeds a threshold $\tau$ .
Pruning: A "Non-Actionable Attack Culling" step uses KL-divergence bounds to discard samples where the label is unlikely to flip, saving query budget.

B. Perturbation Execution

Once the Pivot Set is identified:

Substitution Generation: For each pivot token, the system retrieves $M$ nearest neighbors in a counter-fitted embedding space (preserving synonymy/antonymy).
Selection: It selects the substitution that maximizes cosine similarity to the original sentence embedding to minimize semantic drift.
Dynamic Constraints: A dynamic perturbation rate threshold is applied based on the remaining query budget to balance stealth and the need for further modifications.

3. Key Contributions

Paradigm Shift ("Inside-Out"): Moves away from boundary approximation (outside-in) to "breaking load-bearing walls" (inside-out), significantly reducing the search space.
Inter-word Dependency Modeling: Unlike methods that rank tokens in isolation, PivotAttack uses the MAB framework to explicitly capture combinatorial effects and inter-word interactions, identifying multi-word semantic anchors.
Interpretability: The MAB process generates human-readable intermediate outputs (identifying specific pivot words), making the attack behavior traceable and interpretable.
Robustness against LLMs: Demonstrates exceptional efficacy against both zero-shot and fine-tuned Large Language Models (LLMs), exposing vulnerabilities in models previously considered robust.

4. Experimental Results

The authors evaluated PivotAttack across five text classification datasets (Yelp, Yahoo, MR, Amazon, SST-2) and two textual entailment datasets (SNLI, MultiNLI) against various victim models (WordCNN, WordLSTM, BERT, DistilBERT, ALBERT, Qwen2.5, Gemma 3).

Attack Success Rate (ASR) & Query Efficiency:
- Under a strict 100-query budget, PivotAttack consistently outperformed state-of-the-art baselines (e.g., HyGloadAttack, TextHacker, LimeAttack, VIWHard).
- Example: On Qwen2.5 (Zero-shot), PivotAttack achieved 93.5% ASR with only 1.1% perturbation, whereas TextHacker required 4.0% perturbation for a lower success rate.
- On fine-tuned Qwen2.5, PivotAttack remained the top performer on 4 out of 5 datasets, proving its effectiveness against robust models.
Perturbation Quality: PivotAttack achieved significantly lower perturbation rates (fewer words changed) while maintaining high semantic similarity, indicating better stealth.
Ablation Studies:
- Removing the Pivot Set identification (randomizing the set) caused the largest drop in ASR, confirming the importance of targeting specific anchors.
- The KL-LUCB component was shown to be critical for refining the Pivot Set selection.
Human Evaluation: In a study with 10 participants, PivotAttack's selected pivot words were judged as more "reasonable" and predictive of the model's decision compared to LimeAttack, which often highlighted trivial functional words.

5. Significance and Limitations

Significance: PivotAttack fundamentally changes how hard-label attacks are approached, proving that identifying and targeting "semantic anchors" is more efficient than random or gradient-approximated searches. It highlights a critical vulnerability in modern LLMs, suggesting that even fine-tuned models can be easily fooled by minimal, targeted edits to specific pivot words.
Limitations: The KL-LUCB component is relatively query-intensive. To manage the budget, the current implementation uses a greedy search strategy rather than more advanced (but costlier) strategies like beam search. Future work aims to reduce the query cost of the MAB component.

In conclusion, PivotAttack offers a highly efficient, interpretable, and robust framework for hard-label text attacks, setting a new state-of-the-art in both success rate and query efficiency across traditional and large language models.