Linear Model Extraction via Factual and Counterfactual Queries

Imagine you have a secret recipe for a perfect cake. This recipe is a "black box" machine learning model. You can't see the ingredients (the parameters), but you can taste the cake (get a prediction) whenever you bring in a specific set of ingredients (data points).

This paper is about hacking that recipe. The authors ask: "If I can ask the baker questions, how many questions do I need to ask to figure out the exact recipe?"

They look at three types of questions you can ask the baker:

1. The Three Types of Questions

Factual Queries (The "Taste Test"):
- The Question: "If I use 2 cups of flour and 1 egg, will the cake be 'Good' or 'Bad'?"
- The Result: The baker just says "Good" or "Bad."
- The Paper's Finding: If you just ask these, you can narrow down the recipe, but you need a lot of questions to get it perfect. It's like trying to guess a shape by only touching its edges.
Counterfactual Queries (The "What-If" Change):
- The Question: "I used 2 cups of flour and it was 'Bad'. What is the smallest change I can make to the ingredients to make it 'Good'?"
- The Result: The baker says, "Change the flour to 2.1 cups."
- The Paper's Finding: This is a super-powerful question. It tells you exactly where the "line" is between Good and Bad.
- The Twist: It depends on how you measure "change."
  - If you measure change with a smooth ruler (a "differentiable" norm), one single question is enough to steal the whole recipe!
  - If you measure change with a blocky, pixelated ruler (a "non-differentiable" norm like counting whole cups only), you need to ask many more questions (specifically, one for every ingredient you have) to figure it out.
Robust Counterfactual Queries (The "Safe" Change):
- The Question: "I want to change the ingredients so the cake is 'Good', but I want to be sure that even if I accidentally add a tiny bit of extra salt or sugar (noise), it stays 'Good'."
- The Result: The baker gives you a change that is "safe" from small mistakes.
- The Paper's Finding: This is the safest way to protect the recipe. Because the baker has to give you a "buffer zone" of safety, it takes twice as many questions to steal the recipe compared to the normal "What-If" question.

2. The Big Analogy: The Invisible Wall

Imagine the machine learning model is a giant, invisible wall dividing a field into two sides: Side A (Yes) and Side B (No).

Factual Queries are like throwing darts at the field. If you hit Side A, you know that spot is A. If you hit Side B, you know that spot is B. You can draw a rough map of where the wall might be, but you don't know the exact angle.
Counterfactual Queries are like asking, "I'm standing on Side A. What is the shortest step I can take to cross the wall?"
- If the ground is smooth (smooth math), that shortest step points directly at the wall's angle. You know the wall's orientation immediately.
- If the ground is made of stairs (blocky math), that shortest step might just be "go up one stair." It doesn't tell you the exact angle of the wall, so you have to try stepping in different directions until you map the whole wall.
Robust Counterfactual Queries are like asking, "I want to cross the wall, but I need to be sure I don't slip back if I take a clumsy step." The baker has to push you further away from the wall to be safe. This extra distance hides the wall's exact location even more, making it harder for the hacker to figure out the recipe.

3. Why Does This Matter?

This paper is a warning and a guide for security:

Privacy Risk: If a company gives you "Counterfactual Explanations" (e.g., "Your loan was denied, but if your income was $500 higher, it would be approved"), they might be accidentally giving away their secret algorithm.
The "Smoothness" Trap: If the system uses smooth math to calculate these explanations, a hacker can steal the entire model with just one question.
The Defense: To protect the model, companies should use "blocky" math (non-differentiable norms) for these explanations. This forces hackers to ask many more questions, making the theft much harder.
Robustness is a Shield: If the explanations are "Robust" (accounting for small errors), it adds an extra layer of protection, doubling the effort needed to steal the model.

The Takeaway

In the world of AI, explanations are a double-edged sword. They help humans understand the AI, but they can also help hackers steal the AI's brain.

Smooth explanations = Easy to steal.
Blocky or Robust explanations = Harder to steal.

The authors have mapped out exactly how many "questions" a hacker needs to ask to steal a linear model, proving that the way you design your explanations directly impacts how secure your AI really is.

1. Problem Statement

The paper addresses the security vulnerability of model extraction attacks against linear classifiers. In these attacks, an adversary queries a black-box model to reconstruct its internal parameters (the weight vector $a$ and bias $b$ ).

Context: While traditional extraction relies on factual queries (asking for the label of a specific point), the growing demand for Explainable AI (XAI) has introduced counterfactual queries (asking for the minimal change to an input to flip the prediction).
Gap: Existing literature provides bounds for factual queries or specific counterfactual setups. However, there is a lack of theoretical understanding regarding:
1. How much information arbitrary sets of factual and counterfactual queries reveal about classification regions without recovering parameters.
2. The exact number of queries required to fully extract linear model parameters under arbitrary norm-based distances (both differentiable and non-differentiable).
3. The impact of robust counterfactuals (where the prediction flip must hold for all perturbations within a robustness set) on model security.

2. Methodology

The authors analyze three query types for a linear classifier $h_{a,b}(x) = \text{sign}(a^\top x - b)$ :

Factual Queries ( $q_F$ ): Returns the class label ( $\pm 1$ ).
Counterfactual Queries ( $q_{CF}$ ): Returns the minimal edit $x^*_{CF}$ to flip the label, minimizing $\|x^*_{CF} - x\|_{N_1}$ .
Robust Counterfactual Queries ( $q_{RCF}$ ): Returns $x^*_{RCF}$ such that the label flips even if the point is perturbed by any $s$ in a robustness set $S = \{s \mid \|s\|_{N_2} \le \rho\}$ .

Key Analytical Techniques:

Uncertainty Sets: The authors define convex sets ( $U_{a,b}$ ) representing all possible hyperplane parameters consistent with the query results.
Optimization Formulations: They derive mathematical characterizations for the "Yes" and "No" classification regions ( $X_{Yes}, X_{No}$ ) by solving optimization problems over these uncertainty sets.
Duality and Subgradients: They utilize convex duality and subdifferential calculus ( $\partial f$ ) to analyze the relationship between the query direction and the model parameters.
Differentiability Analysis: A critical distinction is made between differentiable norms (e.g., $\ell_2$ ) and non-differentiable norms (e.g., $\ell_1, \ell_\infty$ ), as the subdifferential of the distance function dictates the recoverability of the parameter vector $a$ .

3. Key Contributions

A. Characterization of Classification Regions

The paper provides novel, computationally tractable formulations to determine if a new point's classification is known without further querying:

Factual Queries: The known regions extend beyond the simple convex hull of queried points. The authors formulate these regions as linear optimization problems (Theorem 5).
Counterfactual Queries: They derive conic quadratic formulations for the classification regions (Theorem 6), showing that counterfactuals significantly reduce the uncertainty set compared to factuals alone.
Robust Counterfactuals: They characterize the regions using constraints involving the dual norm of the robustness set. They show that for non-differentiable norms, the regions are more complex (often translated pointed cones) but can be characterized under specific conditions (Lemma 23, Corollary 24).

B. Query Complexity Bounds for Model Extraction

The authors derive precise upper bounds on the number of queries needed to exactly recover the hyperplane parameters $(a, b)$ :

Query Type	Norm Type ( $N_1$ )	Queries Required	Result
Counterfactual (CF)	Differentiable (e.g., $\ell_2$ )	1	One CF query is sufficient to find the direction of $a$ (via gradient). One additional factual query determines the side of the hyperplane.
Counterfactual (CF)	Non-differentiable (e.g., $\ell_1, \ell_\infty$ )	$p + 1$	Requires $p+1$ queries to find $p$ linearly independent points on the hyperplane.
Robust CF (RCF)	Differentiable	1 RCF + 1 Factual	Similar to standard CF, but requires an extra factual query to resolve the bias shift caused by the robustness radius $\rho$ .
Robust CF (RCF)	Non-differentiable	$p + 1$ RCF + $p + 1$ Factual	Requires finding $p+1$ linearly independent RCF points. Crucially, each RCF query must be paired with a factual query to resolve the non-linear term $\\|a\\|_{N_2}^*$ .

C. Algorithmic Construction

For non-differentiable norms, the paper proposes Algorithm 1, which:

Queries a basis vector to find a direction $v$ that aligns with the subgradient.
Constructs a full basis containing $v$ .
Queries counterfactuals for the entire basis to generate a system of linear equations solvable for $a$ and $b$ .

4. Key Results

Differentiability is Critical: The choice of distance metric is the primary determinant of security.
- Differentiable norms ( $\ell_2$ ): Highly insecure. A single counterfactual query reveals the gradient direction, allowing full model extraction with minimal effort.
- Non-differentiable norms ( $\ell_1, \ell_\infty$ ): More secure. The subdifferential is a set, not a singleton, obscuring the exact direction of $a$ . Extraction requires $O(p)$ queries.
Robustness Adds Privacy: Robust counterfactuals provide an additional layer of security. Even with differentiable norms, extracting the model requires an extra factual query compared to standard counterfactuals. For non-differentiable norms, the query count effectively doubles (requiring both RCF and factual queries for each basis vector).
Geometric Insights:
- For $\ell_1$ and $\ell_\infty$ , optimal counterfactuals often lie on the vertices of the norm ball. The authors show that even if the solver returns a non-vertex solution, a vertex solution can be reconstructed.
- Robust counterfactuals do not lie on the decision boundary; they lie at a distance determined by the robustness radius $\rho$ .

5. Significance and Implications

Security Guidelines: The paper provides concrete guidelines for model owners. To protect linear models from extraction via counterfactual explanations, practitioners should:
1. Use non-differentiable distance metrics (like $\ell_1$ or $\ell_\infty$ ) for generating counterfactuals.
2. Implement robust counterfactuals if possible, as they significantly increase the query cost for attackers.
Theoretical Foundation: It bridges the gap between XAI (counterfactuals) and adversarial machine learning (model extraction), providing the first theoretical bounds for extracting linear models using robust counterfactuals.
Limitations & Future Work: The current work assumes continuous, unconstrained data ( $\mathbb{R}^p$ ). Future directions include handling categorical data, immutable features, and extending the theory to non-linear models (e.g., neural networks, decision trees) and heuristic (non-optimal) counterfactual generation.

In summary, the paper demonstrates that while counterfactual explanations enhance transparency, they pose a severe risk to model privacy. However, this risk can be mitigated by carefully selecting the distance metric (preferring non-differentiable norms) and utilizing robust counterfactual definitions.

Linear Model Extraction via Factual and Counterfactual Queries

1. The Three Types of Questions

2. The Big Analogy: The Invisible Wall

3. Why Does This Matter?

The Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

A. Characterization of Classification Regions

B. Query Complexity Bounds for Model Extraction

C. Algorithmic Construction

4. Key Results

5. Significance and Implications

More like this

Robust Multi-agent Communication via Multi-view Message Certification

DySCo: Dynamic Semantic Compression for Effective Long-term Time Series Forecasting

Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method

Forecasting Supply Chain Disruptions with Foresight Learning

UQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engression