Central subspace data depth

Imagine you are standing in a crowded room full of people. In traditional statistics, if you wanted to find the "center" of this crowd, you would look for the single person standing right in the middle. Everyone else is ranked by how far they are from that one person. This is called Data Depth. It's like a map where the person in the middle is the "deepest" point, and the people on the edges are "shallow."

But what if the crowd isn't a blob? What if they are all standing in a long, straight line, like a queue for a rollercoaster?

If you try to find the "center" of that line using the old method, you might pick a spot in the middle of the line. But that doesn't feel right. The "center" of a line isn't a single dot; it's the line itself. The people standing right on the line are the most "central," and the people wandering off into the crowd are the outliers.

This paper introduces a new way to do statistics called Central Subspace Data Depth. Here is the breakdown in simple terms:

1. The Problem: The Wrong Shape

The authors argue that for many real-world problems, data doesn't form a ball or a cloud; it forms a line, a plane, or a sheet.

The Old Way: Tries to find a single "center point." If the data is a line, this method gets confused and might miss the true structure.
The New Way: Recognizes that the "center" can be a whole subspace (a line, a flat surface, etc.). It asks, "What is the best line that runs through the middle of this data?"

2. The Analogy: The "Best Fit" Line vs. The "Best Fit" Dot

Think of a scatter of raindrops on a window.

Traditional Depth: You try to find the one specific drop that is the "heart" of the storm. You measure how far every other drop is from that single heart.
Central Subspace Depth: You realize the drops are flowing down the glass in a specific direction. Instead of finding one heart, you find the main current (the line). You measure how close every drop is to that current. The drops on the current are the "deepest" (most central). The drops far away from the current are the "shallow" ones (outliers).

3. Why Does This Matter? (The Fraud Detective)

The paper uses a real-world example to show why this is powerful: Customs Fraud Detection.

Imagine the European Union is checking imports. They look at two numbers for every shipment:

Weight (How heavy is it?)
Declared Value (How much money is it worth?)

Usually, heavy things are expensive, and light things are cheap. If you plot this data, most honest shipments form a straight line.

The Fraud: A smuggler might declare a very heavy shipment as having a very low value to avoid taxes. On the graph, this point would be far away from the "main line" of honest trade.

The Old Method: Might look for the "average" point in the whole cloud. It might miss the fraud because the fraudster is just a little bit off the average, but still within the general "cloud."
The New Method: Finds the "Main Line" of honest trade first. Then, it measures how far away every shipment is from that line.

If a shipment is far from the line, it's a red flag.
The paper shows this method is much better at spotting these "red flags" (fraud) because it understands that the "normal" behavior is a line, not a dot.

4. How It Works (The "Dispersion" Meter)

The authors created a mathematical tool to find this "best line."

They imagine sliding a ruler (a line) through the data in every possible direction.
For each direction, they measure the dispersion (how spread out the data is perpendicular to that line).
They pick the line where the data is least spread out. This is the "Central Subspace."
Once they have this line, they rank every data point based on how close it is to the line.

5. The Result: A Better Map

By using this new method, statisticians can:

See the structure: They can tell if data is a ball, a line, or a flat sheet.
Find outliers better: They can spot the "weird" data points that don't fit the pattern (like the fraudsters) much more accurately.
Reduce complexity: They can simplify complex 3D or 4D data down to a simple 1D line without losing the important story.

Summary

Think of this paper as upgrading the GPS for data.

Old GPS: "You are 5 miles from the center of the city." (Good for a round city).
New GPS: "You are 5 miles from the main highway." (Perfect for a city built along a river or a road).

The authors have given statisticians a new tool to find the "highway" hidden inside messy data, making it much easier to spot the cars that are driving off-road.

Here is a detailed technical summary of the paper "Central subspace data depth" by Francisci and Agostinelli.

1. Problem Statement

Traditional statistical data depth functions (e.g., halfspace depth, simplicial depth) provide a "center-outward" ordering of multivariate data based on a single central point (dimension $p=0$ ). While effective for symmetric distributions centered at a point, many real-world datasets exhibit a linear or subspace structure rather than a point-symmetric structure.

In such cases, forcing a point-based depth measure fails to capture the intrinsic geometry of the data. For example, in customs fraud detection, import weights and declared values often lie along a linear trend. A point-based depth would identify a single "deepest" point, potentially missing the fact that the entire line represents the "center" of the distribution. The authors address the need for a framework where the "center" is a subspace of dimension $p$ ($0 \le p \le m-1$), allowing for a center-outward ordering relative to that subspace.

2. Methodology

The proposed methodology introduces Central Subspace Data Depth (CSDD) and a corresponding Dispersion Measure to identify the optimal subspace.

A. Symmetry with Respect to a Subspace

The authors extend standard symmetry definitions (halfspace, angular, central, elliptical) to subspaces.

Definition: A random variable $X$ is symmetric with respect to a subspace $S_p$ if the projection of $X$ onto the orthogonal complement subspace $S_q$ (where $q = m-p$ ) is symmetric in the usual point-sense.
This allows the definition of a "center" as a subspace $S_p$ rather than a point $\mu$ .

B. Deeply Immersion and Dispersion Measure

To find the optimal central subspace, the authors utilize a dispersion measure based on data depth.

Dispersion Measure ( $\sigma$ ): Defined as the integral of the data depth over the space: $\sigma(F) = \int_{\mathbb{R}^m} d(x, F) dx$ .
Deeply Immersion: A random variable $X$ is "deeply immersed" in a subspace $S_q$ if the dispersion of the projected variable $B_q X$ (where $B_q$ projects onto $S_q$ ) is minimized.
Central Subspace: The orthogonal complement $S_p$ to the minimizing subspace $S_q$ is defined as the Central Subspace. The data is most concentrated (least dispersed) in the directions orthogonal to $S_p$ .

C. Central Subspace Data Depth (CSDD)

Once the optimal subspace $S_p$ is identified, the CSDD is defined for any subspace $S_{B_q}(y)$ (a subspace parallel to $S_p$ at offset $y$ ) as:
$d_S(S_{B_q}(y), F) = d(y, F_{B_q})$
where $d$ is a standard data depth function (e.g., halfspace depth) applied to the projected distribution $F_{B_q}$ .

Properties: The authors prove that CSDD satisfies invariance (location, scale, rotation, reflection), maximality at the central subspace, monotonicity, and convergence to zero at infinity.

D. Selection of Optimal Dimension ( $p$ )

The paper proposes a recursive algorithm to determine the optimal dimension $p^*$ :

Start with $p=1$ (assuming a 1D linear structure).
Compute the optimal projection direction $B_p$ that minimizes dispersion.
Project the data onto the orthogonal complement and test for spherical symmetry (uniformity of directions) using a Rayleigh test.
If the null hypothesis of spherical symmetry is accepted, stop; the current $p$ is optimal. If rejected, increment $p$ and repeat.

3. Key Contributions

Generalization of Data Depth: The paper formalizes the concept of data depth where the "center" is a subspace, not just a point. This bridges the gap between traditional depth and linear dimension reduction techniques.
Theoretical Framework:
- Established definitions for symmetry with respect to subspaces (Weak, Mild, Strong).
- Proved that for elliptically symmetric distributions, the minimization of the depth-based dispersion measure is equivalent to Principal Component Analysis (PCA). Specifically, the central subspace corresponds to the space spanned by the first $p$ principal components.
- Demonstrated that for non-elliptical distributions (e.g., mixtures of normals), the depth-based approach can differ from PCA, potentially capturing non-linear or non-Gaussian structures better.
Asymptotic Properties: The authors provide rigorous proofs for the existence, uniqueness, and almost sure convergence of the sample central subspace to the population central subspace under conditions of polynomial decay and finite moments.
Dimension Reduction Tool: The method serves as a robust, non-parametric dimension reduction technique that does not rely on the existence of a covariance matrix (unlike PCA), making it suitable for heavy-tailed distributions.

4. Results and Applications

Simulation Studies

Scenario Testing: The method was tested on multivariate normal and uniform distributions with varying dimensions.
Performance: The recursive uniformity test successfully identified the correct subspace dimensions ( $p^*$ ) in simulations. For example, in a 3D dataset with one dominant linear direction, the algorithm correctly identified $p=1$ .
Comparison with PCA: In elliptical cases, CSDD aligned with PCA. In mixture distributions (square-shaped clusters), CSDD identified the diagonal directions (minimizing dispersion) while PCA would identify the sides (maximizing variance), highlighting the method's ability to detect different structural features depending on the dispersion metric used.

Real-World Application: Customs Fraud Detection

The method was applied to European Union import data (POD datasets: Product, Origin, Destination), specifically looking for misdeclarations (undervaluation).

Data Structure: The data (log-weight vs. log-value) exhibited a strong linear structure.
Comparison:
- Standard Depth (Point-based): Identified a single central point. Outliers were defined as points far from this single point.
- CSDD (Subspace-based): Identified the optimal line (central subspace). Outliers were defined as points far from this line.
Outcome: The CSDD approach provided a superior ordering of the data. It successfully identified specific clusters of potential fraud (points with high quantiles relative to the central line) that were obscured or misclassified by the point-based depth. The method effectively separated the "fair trade" linear trend from anomalous declarations.

5. Significance

Robustness: By relying on data depth rather than covariance matrices, the method is robust to outliers and heavy-tailed distributions (e.g., Pareto, Stable distributions) where PCA fails or is unstable.
Interpretability: In applications like fraud detection or economics, data often follows linear trends (e.g., price vs. weight). CSDD provides a statistically rigorous way to define "normal" behavior as a line or plane, rather than an arbitrary point, leading to more accurate outlier detection.
Theoretical Unification: The paper unifies concepts of symmetry, dispersion, and dimension reduction, showing that PCA is a special case of a more general depth-based minimization problem.
Practical Utility: The proposed framework offers a new tool for exploratory data analysis, particularly for high-dimensional datasets where the underlying structure is linear or subspace-based but non-Gaussian.

Central subspace data depth

1. The Problem: The Wrong Shape

2. The Analogy: The "Best Fit" Line vs. The "Best Fit" Dot

3. Why Does This Matter? (The Fraud Detective)

4. How It Works (The "Dispersion" Meter)

5. The Result: A Better Map

Summary

1. Problem Statement

2. Methodology

A. Symmetry with Respect to a Subspace

B. Deeply Immersion and Dispersion Measure

C. Central Subspace Data Depth (CSDD)

D. Selection of Optimal Dimension (ppp)

3. Key Contributions

4. Results and Applications

Simulation Studies

Real-World Application: Customs Fraud Detection

5. Significance

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

D. Selection of Optimal Dimension ( $p$ )

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model