Limiting Spectral Distribution of moderately large Kendall's correlation matrix and its application

Imagine you are a detective trying to figure out if a group of people are secretly talking to each other. You have a large room with $n$ people (the sample size) and you are tracking $p$ different topics they might be discussing (the variables).

In the world of statistics, this is called a correlation matrix. It's a giant grid that tells you how much every topic influences every other topic. Usually, statisticians assume everyone in the room is behaving exactly the same way (they are "identically distributed") and that the number of topics is roughly the same as the number of people.

But in the real world, things are messier. Some people talk about sports, others about politics. Some topics are continuous (like temperature), while others are discrete (like "yes" or "no"). And often, you have way more people than topics ( $n$ is huge, $p$ is small). This is the "moderate high-dimensional" regime.

This paper by Raunak Shevade and Monika Bhattacharjee is like a new, more robust detective manual for these messy situations. Here is the breakdown in simple terms:

1. The Old Tools vs. The New Tool

The Old Way (The "Perfect World" Assumption):
Previous methods for analyzing these grids assumed everyone was identical and the data was smooth (continuous). If you tried to use these tools on messy, real-world data (like survey answers that are just "Yes/No" or data with extreme outliers), the tools would break. They would start seeing patterns that aren't there, leading to false alarms (thinking people are talking when they aren't).

The New Way (Kendall's Correlation):
The authors focus on Kendall's correlation. Instead of measuring exact values (like "how much did the temperature rise?"), this method looks at rankings and directions.

Analogy: Imagine two people, Alice and Bob.
- Old Method: "Alice's temperature went up by 5 degrees, Bob's went up by 2."
- Kendall's Method: "Did Alice's temperature go up? Yes. Did Bob's? Yes. Did they move in the same direction? Yes."
  This makes the method robust. It doesn't care if the data is weird, heavy-tailed, or full of zeros. It just cares about the direction of change.

2. The "Spectral Distribution" (The Shape of the Noise)

When you have a huge grid of numbers, you can look at its "eigenvalues." Think of these as the vibrations of a drum. If you hit a drum, it vibrates in specific patterns.

The Goal: The authors wanted to know: If we have a massive grid of random, unconnected data, what does the "shape" of these vibrations look like?
The Result: They proved that even when the data is messy (different people, different distributions), if you arrange the data correctly, the vibrations settle into a predictable, smooth shape.
The Twist: In the past, this shape was always a perfect Semicircle (like a rainbow). But the authors discovered that when data is heterogeneous (mixed up), the shape changes! It might look like a distorted semicircle or a completely different blob.
- Metaphor: If everyone in the room is wearing the same uniform, the crowd moves in a perfect wave (Semicircle). If everyone is wearing different clothes and moving at different speeds, the wave gets messy and changes shape. The authors figured out exactly how to predict that new, messy shape.

3. The "Centering" Trick

One of the biggest headaches in statistics is the diagonal of the matrix (the line from top-left to bottom-right). This represents how a variable correlates with itself.

In perfect data, this is always 1.
In messy data (like "Yes/No" surveys), this number can vary wildly.
The authors realized that if you don't fix this, the whole analysis gets skewed. They proposed subtracting the diagonal (centering) and scaling the matrix.
Analogy: Imagine trying to measure the height of a crowd. If some people are standing on stilts (the diagonal entries) and others are on the ground, your average is wrong. The authors say, "Let's cut off the stilts first, then measure the crowd." This simple step allowed them to handle data that previous methods couldn't.

4. The Real-World Application: Catching False Friends

The most exciting part of the paper is the application.
The authors created a "graphical tool" (a visual test) to check if variables are truly independent.

The Problem: If you ignore the fact that your data is messy (heterogeneous), your old tools will scream, "THEY ARE CONNECTED!" when they are actually just random noise. This is a spurious detection.
The Solution: By using their new, corrected shape (the new spectral distribution), you can draw a line. If the data's vibration pattern falls outside the line, you know there is a real connection. If it falls inside, it's just noise.
The Proof: They ran simulations where they knew the data was random. The old tools failed (they thought there was a connection 70-80% of the time!). The new tool worked perfectly, only flagging connections when they actually existed.

Summary

This paper is a breakthrough because it stops statisticians from making up stories about data that isn't there.

It accepts reality: It works with messy, mixed, and discrete data, not just "perfect" data.
It fixes the math: It shows that when data is messy, the "shape" of randomness changes, and we need new formulas to describe it.
It prevents false alarms: It gives researchers a better way to tell the difference between real relationships and random noise in high-dimensional data.

In short: If you are analyzing complex, real-world data, don't use the old "perfect world" rules. Use this new, robust map to avoid getting lost in false patterns.

Here is a detailed technical summary of the paper "Limiting Spectral Distribution of moderately large Kendall's correlation matrix and its application" by Raunak Shevade and Monika Bhattacharjee.

1. Problem Statement

The paper addresses the asymptotic behavior of the Limiting Spectral Distribution (LSD) of Kendall's correlation matrices in a moderate high-dimensional regime, defined as the dimension $p$ growing slower than the sample size $n$ (i.e., $p/n \to 0$ ).

Key challenges and gaps identified in existing literature include:

Distributional Heterogeneity: Most existing results assume independent and identically distributed (i.i.d.) observations. Real-world high-dimensional data often exhibit non-identical distributions (heterogeneity) across components or samples.
Data Types: Many classical results require continuous data with densities. There is a need for robust frameworks accommodating discrete and mixed (discrete/continuous) data, which are common in applications but often lead to degeneracy in standard correlation matrices.
Regime Specificity: Results derived for the proportional regime ( $p/n \to \theta \in (0, \infty)$ ) often degenerate or become non-informative when naively applied to the moderate regime ( $p/n \to 0$ ). The centering and scaling required for convergence differ fundamentally between these regimes.
Robustness: Standard covariance-based methods fail under heavy-tailed distributions. Rank-based methods like Kendall's $\tau$ are preferred, but their theoretical spectral properties under heterogeneity were largely unexplored.

2. Methodology and Framework

2.1. Model and Definitions

Data Matrix: Let $X$ be a $p \times n$ data matrix with entries $X_{ki}$ . The entries are assumed to be independent but not necessarily identically distributed.
Kendall's Matrix ( $T$ ): The matrix $T = (T_{kl})$ where $T_{kl}$ is the Kendall's $\tau$ statistic between the $k$ -th and $l$ -th rows. It is constructed using a sign-based kernel: $h((x_1, y_1), (x_2, y_2)) = \text{Sign}(x_1 - x_2)\text{Sign}(y_1 - y_2)$ .
Centering: The authors focus on the centered matrix $T - D(T)$ , where $D(T)$ is the diagonal matrix of $T$ . This is crucial because for discrete data, the diagonal entries (self-association) are not uniformly 1, complicating direct analysis of $T$ .

2.2. Hoeffding Decomposition

The core analytical tool is the Hoeffding decomposition of the U-statistic $T$ .

The matrix $T$ is decomposed into a first-order linear projection $G$ and a remainder term.
$G$ is a $p \times p$ matrix where the $(k, l)$ -th entry is defined via conditional expectations of sign functions:
$G_{kl} = \frac{1}{n(n-1)} \sum_{i,j} Y_{k,i,j} Y_{l,i,j}$
where $Y_{k,i,j} = E[\text{Sign}(X_{ki} - X_{kj}) | X_{ki}]$ .
The authors prove that in the moderate regime ( $p/n \to 0$ ), the empirical spectral distribution (ESD) of the centered and scaled $T - D(T)$ converges to the same limit as the appropriately scaled $G$ .

2.3. Key Assumptions

To establish convergence, the paper introduces three main assumptions:

Independence: Entries of $X$ are independent.
Symmetry Condition: $P(X_{ki} > X_{kj}) = P(X_{ki} < X_{kj})$ for all $k, i, j$ . This ensures the mean of the sign function is zero. This holds automatically for i.i.d. symmetric data but is explicitly required for heterogeneous data.
Trace Convergence (Assumptions G1 & G2):
- The averaged traces of the variance-covariance matrices $G_{k,i}$ (derived from the conditional expectations) must converge to specific constants.
- Specifically, $n^{-1} \sum \text{Trace}(G_{k,i})$ and higher-order trace products must converge to constants $g_1$ and $g_{2\pi}$ (indexed by non-crossing partitions $\pi$ ).
- These conditions allow for heterogeneity as long as the aggregate behavior of the variances stabilizes.

3. Key Contributions and Results

3.1. Main Theorem (Theorem 1)

Under the stated assumptions, the ESD of the properly centered and scaled matrix $\sqrt{n/p}(T - D(T))$ converges weakly almost surely to a deterministic probability distribution.

Moment Characterization: The limit distribution is symmetric (odd moments are zero). The $2R$-th moment is given by:
$\mu_{2R} = 2^{2R} \sum_{\pi \in NC_2(2R)} g_{2\pi}$
where $NC_2(2R)$ is the set of non-crossing pair partitions of $2R $elements, and$ g_{2\pi}$ are constants derived from the trace limits of the covariance matrices.
Generality: Unlike previous results, this limit is model-dependent and generally not the semi-circle law. It captures the specific heterogeneity of the data.

3.2. Special Cases and Semi-Circle Law

i.i.d. Case: If data is i.i.d., the limit can be characterized via the free multiplicative convolution of a semi-circle law and the LSD of the variance-covariance matrix of the transformed data.
Semi-Circle Regime (Theorem 2): The paper identifies specific conditions (Assumption 3) under which the limit reduces to the semi-circle law. This occurs when the heterogeneity in the variance structure is controlled such that the trace moments converge to a specific form ( $g_{2\pi} = C^R$ ).
Comparison with Dörnemann et al. [11]: The authors show their framework covers cases where Dörnemann et al.'s results fail, specifically:
- Discrete/Zero-Inflated Data: Dörnemann et al. require a non-degeneracy condition (asymptotic non-degeneracy of components) which fails for sparse/discrete data. The authors' centered approach handles this naturally.
- Heterogeneity: The authors' trace conditions allow for non-identical distributions across rows/columns, whereas Dörnemann et al. assume i.i.d.

3.3. Numerical Validation

The paper provides extensive simulations (Examples 1–5) covering:

Cauchy distributions with varying scales.
Mixtures of discrete and continuous distributions.
Heavy-tailed discrete data.
In all cases, the empirical moments of the ESD match the theoretical moments derived from the non-crossing partition sums, while methods based on Dörnemann et al. [11] show significant deviation or failure.

4. Application: Independence Testing

The authors propose a graphical diagnostic tool for testing independence in high-dimensional data with heterogeneous distributions.

Hypothesis: $H_0$ : Rows are independent vs. $H_1$ : Rows are dependent.
Procedure:
1. Compute the ESD of the observed centered Kendall matrix.
2. Simulate a reference matrix $\tilde{X}$ based on the estimated marginal distributions (using clustering to handle heterogeneity).
3. Compute the ESD of the reference matrix.
4. Compare the two distributions (e.g., via Kolmogorov distance).
Findings:
- Ignoring heterogeneity (assuming i.i.d.) leads to spurious detection of dependence (high Type I error) because the theoretical LSD under the null is mis-specified.
- The proposed method, which accounts for the specific variance structure via the derived LSD, maintains correct empirical size and demonstrates good power.
- Comparison with tests based on Dörnemann et al. [11] shows that those methods suffer from size distortion in heterogeneous settings.

5. Significance and Impact

Theoretical Advancement: This is the first systematic derivation of the LSD for Kendall's correlation matrices under non-identically distributed observations in the moderate high-dimensional regime. It bridges the gap between random matrix theory and robust statistics for heterogeneous data.
Handling Discrete Data: By focusing on the centered matrix and using Hoeffding projections, the framework successfully handles discrete and zero-inflated data where standard normalization fails.
Practical Utility: The paper provides a rigorous justification for why standard independence tests fail in heterogeneous settings and offers a practical, albeit exploratory, graphical method to detect dependence without assuming a specific parametric form for the data distribution.
Methodological Shift: It highlights that the $p/n \to 0$ regime requires distinct centering and scaling compared to the proportional regime, and that the limiting distribution is inherently tied to the data's variance structure (heterogeneity), not just a universal law like the semi-circle.

In summary, the paper extends random matrix theory to more realistic, heterogeneous data scenarios, providing both theoretical limits and practical tools for high-dimensional dependence testing where traditional i.i.d. assumptions are violated.