Finite-Sample Decision Instability in Threshold-Based Process Capability Approval

Imagine you are a quality inspector at a factory. Your job is to decide if a new batch of parts is good enough to ship to a customer. You have a strict rule: "If the quality score is 1.33 or higher, we ship it. If it's lower, we reject it."

This sounds simple and fair, right? But this paper reveals a hidden trap in that simple rule, especially when you only have a small amount of data to work with.

Here is the story of Decision Instability, explained through a few everyday analogies.

1. The "Fuzzy Ruler" Problem

Imagine you are trying to measure a piece of wood to see if it is exactly 10 inches long. You have a ruler, but it's a bit wobbly, and you can only take a few measurements (maybe 30 or 50).

The Reality: The wood might actually be exactly 10 inches long.
The Measurement: Because your ruler is wobbly and you only measured it a few times, your average measurement might come out as 10.05 inches (Pass!) or 9.95 inches (Fail!).

In the world of manufacturing, the "quality score" (called $C_{pk}$ ) is like that measurement. It's not a fixed number; it's an estimate based on a small sample. Because it's an estimate, it has "noise" or "jitter."

2. The "Coin Flip" at the Edge

The paper's biggest discovery is what happens when a part is right on the edge of the rule.

Let's say the rule is 1.33.

If the true quality of the part is exactly 1.33, what happens?
Because of the "wobble" in your measurements, sometimes your calculation will say 1.34 (Pass), and sometimes it will say 1.32 (Fail).

The paper proves mathematically that if the true quality is exactly on the line, your decision becomes a 50/50 coin flip.

50% chance: You approve a part that is barely good enough.
50% chance: You reject a part that is actually good enough.

The Analogy: Imagine a tightrope walker standing exactly in the middle of a rope. If the wind blows even a tiny bit (which is like the random noise in your data), they will fall to the left or the right with equal probability. The "decision" of which side they land on is completely unstable, even though the walker is perfectly balanced.

3. The "Danger Zone" (The Ridge)

The authors found that this instability isn't just for parts that are exactly 1.33. It happens for anything close to 1.33.

They call this the "Instability Ridge."

If a part is way above 1.33 (say, 1.80), you are almost 100% sure to pass it.
If a part is way below 1.33 (say, 0.80), you are almost 100% sure to reject it.
But if a part is in the "Danger Zone" (between 1.25 and 1.40), your decision is shaky.

The paper looked at 880 real-world factory dimensions and found that about 11% of them were sitting right in this Danger Zone. This means that for a huge chunk of real products, the decision to ship or scrap them is essentially a gamble based on how the random numbers happened to land that day.

4. Why "More Data" Doesn't Fully Fix It

You might think, "If I measure 1,000 parts instead of 30, the wobble goes away, right?"

Yes, the wobble gets smaller, but the "Danger Zone" just gets narrower. It never disappears completely unless the part is far away from the line.

With a small sample (30 parts), the danger zone is wide.
With a huge sample (1,000 parts), the danger zone is a thin sliver.

But in real life, factories often don't have time or money to measure thousands of parts. They measure 30 or 50. In this "small sample" world, the danger zone is wide enough to catch many products.

5. The Solution: The "Safety Buffer"

So, what should factories do? The paper suggests we stop treating the rule as a hard, sharp line.

Instead of saying "Pass if $\ge$ 1.33," we should add a Safety Buffer (or a "Guard Band").

Old Rule: Pass if Score $\ge$ 1.33.
New Smart Rule: Pass if Score $\ge$ 1.62 (or whatever the math says is safe).

The Metaphor: Imagine a parking spot.

Old Way: "If your bumper is past the line, you're parked." (If you are exactly on the line, you might get a ticket or not, depending on the officer's mood).
New Way: "You must park at least 6 inches inside the line to be considered parked."

By moving the goalpost further away from the edge, you ensure that even with the "wobble" of your measurements, you are still safely inside the "Pass" zone.

Summary

The Problem: Using a fixed number (like 1.33) to make a Pass/Fail decision is risky when you only have a small sample of data.
The Surprise: If a product is right on the edge, the decision is a coin flip. You might reject a good product or accept a bad one purely by chance.
The Reality: Many real-world products sit right on this edge, making their approval status unstable.
The Fix: We need to add a "safety margin" to our rules. We shouldn't just look at the number; we need to account for the fact that our measurement is a bit fuzzy.

In short: Don't trust a single number on the edge. If you are standing on the line, you aren't really "safe" until you take a step back.

Here is a detailed technical summary of the paper "Finite-Sample Decision Instability in Threshold-Based Process Capability Approval."

1. Problem Statement

In manufacturing quality control, Process Capability Indices (PCIs) like $C_{pk}$ are standard tools for supplier qualification and product release. Decisions are typically made using a deterministic threshold rule: if the estimated index $\hat{C}_{pk} \geq C_0$ (e.g., $C_0 = 1.33$ ), the process is approved; otherwise, it is rejected.

The paper identifies a critical flaw in this practice:

Finite-Sample Variability: In practice, $\hat{C}_{pk}$ is computed from moderate sample sizes ( $n \approx 20\text{--}50$ ). Both the sample mean ( $\bar{X}$ ) and standard deviation ( $S$ ) are random variables, making $\hat{C}_{pk}$ a stochastic estimator.
Uncontrolled Risk: Unlike hypothesis testing where Type I error is calibrated, industrial thresholds are fixed ex ante. When the true process capability ( $C_{pk}^{true}$ ) lies near the threshold $C_0$ , sampling variability induces a high probability of misclassification (false acceptance or false rejection).
The Gap: Existing literature focuses on estimating confidence intervals for the parameter $C_{pk}$ , but fails to characterize the stochastic reliability of the decision rule itself ( $I(\hat{C}_{pk} \geq C_0)$ ) as a function of sample size and true capability.

2. Methodology

The authors reformulate capability approval as a statistical decision problem under finite-sample uncertainty.

A. Theoretical Framework

Decision Rule: Defined as $D = I(\hat{C}_{pk} \geq C_0)$ , where $D=1$ is acceptance.
Misclassification Probabilities:
- Type I (False Accept): $P(\hat{C}_{pk} \geq C_0 \mid C_{pk}^{true} < C_0)$ .
- Type II (False Reject): $P(\hat{C}_{pk} < C_0 \mid C_{pk}^{true} \geq C_0)$ .
Asymptotic Analysis: Under standard regularity conditions (Assumptions A1: unique active constraint; A2: joint asymptotic normality of $\bar{X}$ and $S$ ), the authors apply the multivariate delta method to derive the asymptotic distribution of $\hat{C}_{pk}$ .
Local Parameterization: They analyze the behavior when the true capability deviates from the threshold by $O(n^{-1/2})$ , i.e., $C_{pk}^{true} = C_0 + h/\sqrt{n}$ .

B. Simulation and Empirical Validation

Monte Carlo Simulations: Used to construct "misclassification risk surfaces" across varying true capabilities ( $C_{pk}^{true}$ ) and sample sizes ( $n$ ).
Empirical Study: Analyzed a dataset of 880 manufacturing dimensions (sample size $n=32$ $n = 32$ per dimension).
- Bootstrap Resampling: For each dimension, 5,000 non-parametric bootstrap samples were generated to estimate the "flip rate" (probability that repeated sampling would reverse the pass/fail decision).
- Normality Check: Results were stratified by dimensions passing vs. failing normality tests to ensure robustness.

3. Key Contributions

Formal Decision-Theoretic Formulation: The paper shifts the focus from parameter estimation accuracy to decision reliability, defining misclassification probability as the primary metric for threshold-based rules.
Boundary Instability Theorem:
- Proves that if $C_{pk}^{true} = C_0$ , the acceptance probability converges to 0.5 as $n \to \infty$ .
- This implies that even with infinite data, a fixed threshold rule cannot distinguish a process exactly at the boundary from random noise; the decision remains inherently unstable.
Characterization of the Instability Region:
- Derived that the region of high decision uncertainty has a width proportional to $1/\sqrt{n}$.
- Identified that decision behavior is governed by a scaled signal-to-noise ratio: $z = \frac{\sqrt{n}(C_{pk}^{true} - C_0)}{\sigma_C}$ .
Closed-Form Approximation: Provided an explicit formula for the asymptotic variance $\sigma_C$ under normal assumptions:
$\sigma_C = \sqrt{\frac{1}{9} + \frac{(C_{pk}^{true})^2}{2}}$
This allows practitioners to quantify the dispersion of the estimator without complex simulations.

4. Key Results

The 50% Limit: When the true capability equals the threshold (e.g., 1.33), the probability of approval is asymptotically 50%. No amount of sampling can eliminate this boundary risk under a fixed threshold.
Ridge Structure: Misclassification risk is not uniform; it forms a sharp "ridge" centered at the threshold.
- For $n=32$ and $C_{pk}^{true}$ within $\pm 0.05$ of the threshold, misclassification risk exceeds 30%.
- Even at $n=50$ , significant uncertainty persists near the boundary.
Empirical Prevalence: In the 880-dimension dataset:
- ~11.4% of dimensions operate within the "instability band" ( $| \hat{C}_{pk} - 1.33 | \leq 0.2908$ ), where asymptotic decision reliability falls below 95%.
- ~7.8% of dimensions exhibit a bootstrap flip rate $>20%$ , meaning there is a high chance the approval decision would change if the sample were redrawn.
Non-Normality: The instability phenomenon persists under non-normal distributions (e.g., lognormal), though the width of the instability region changes based on the estimator's dispersion.

5. Significance and Implications

Operational Risk: The study demonstrates that "borderline" processes are not just theoretically risky but are common in practice. Relying on a simple $\hat{C}_{pk} \geq 1.33$ rule exposes manufacturers to uncontrolled false acceptance/rejection rates.
Design of Better Rules: The paper suggests moving away from fixed deterministic thresholds toward risk-adjusted rules.
- Guard Bands: Introducing a safety margin $\kappa/\sqrt{n}$ to the threshold (e.g., requiring $\hat{C}_{pk} \geq 1.33 + 0.29$ for $n=32$ ) to control the boundary acceptance probability to a desired level (e.g., $\alpha = 0.05$ ).
- Alternative Approaches: Recommends using Lower Confidence Bounds (LCB) or probability-based acceptance rules, which naturally account for estimator uncertainty.
Statistical Governance: Provides a principled basis for interpreting release decisions, arguing that decision stability depends on the interaction between estimator dispersion and the fixed threshold, not just the point estimate itself.

In conclusion, the paper establishes that threshold-based capability approval is structurally fragile for processes operating near the acceptance limit. It provides the mathematical tools to quantify this instability and offers a framework for designing more robust, risk-controlled decision rules in manufacturing.

Finite-Sample Decision Instability in Threshold-Based Process Capability Approval

1. The "Fuzzy Ruler" Problem

2. The "Coin Flip" at the Edge

3. The "Danger Zone" (The Ridge)

4. Why "More Data" Doesn't Fully Fix It

5. The Solution: The "Safety Buffer"

Summary

1. Problem Statement

2. Methodology

A. Theoretical Framework

B. Simulation and Empirical Validation

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model