On topological and algebraic structures of categorical random variables

Imagine you are a detective trying to solve a mystery, but instead of looking for fingerprints, you are looking for patterns in data. Specifically, you are dealing with "categorical" data—things that aren't numbers, but labels like "Red, Blue, Green," "Yes, No," or "High, Medium, Low."

For a long time, statisticians have had a great tool for measuring how much two numbers (like height and weight) are related. But measuring how much two labels are related has been messy. This paper introduces a new, super-smart ruler and a new way to "mix" these labels together, giving them a mathematical home.

Here is the breakdown of what the authors did, using simple analogies.

1. The Problem: How Similar Are Two Labels?

Imagine you have a dataset of students. You have a column for "Favorite Color" and a column for "Favorite Sport."

If everyone who likes Blue also likes Soccer, those two columns are "best friends."
If "Favorite Color" has nothing to do with "Favorite Sport," they are strangers.

The authors use a tool called Symmetric Uncertainty (SU). Think of SU as a "Friendship Score" that goes from 0 to 1.

0 means they are total strangers (no relationship).
1 means they are identical twins (knowing one tells you everything about the other).

The paper starts by proving that this "Friendship Score" is actually a very reliable way to measure similarity.

2. The Ruler: Turning Friendship into Distance

Usually, we measure how far apart things are (distance). But SU measures how close they are (similarity).
The authors realized: "Hey, if we know how close they are, we can easily figure out how far apart they are!"

They created a simple formula: Distance = 1 - Friendship Score.

If the Friendship Score is 1 (identical), the Distance is 0.
If the Friendship Score is 0 (strangers), the Distance is 1.

Why is this cool?
This turns the messy world of labels into a map. You can now draw a map where "Red" is close to "Blue" if they behave similarly, and far away from "Green." This map has a topology (a shape), meaning you can talk about "neighborhoods" of similar variables. It's no longer just a list; it's a landscape you can navigate.

3. The "Indiscernible" Twins

The paper also deals with a tricky problem: What if two variables are the same, just with different names?

Variable A: {1, 2, 3}
Variable B: {Apple, Banana, Cherry}

If 1 always matches Apple, 2 matches Banana, and 3 matches Cherry, they are effectively the same variable. The authors call these "Indiscernible" (indistinguishable).

They created a special "Grouping Room" (a quotient space). In this room, you don't care about the specific names (1 vs. Apple); you only care about the pattern. If two variables follow the same pattern, they get the same ID card in this room. This makes the math much cleaner.

4. The Magic Mixer: The "Joint" Operation

This is the most creative part. The authors asked: "Can we mix two variables together to make a new one?"

Imagine you have:

Variable A: Income (Low, Medium, High)
Variable B: House Owner (Yes, No)

They define a "Joint" operation (let's call it A * B).

If you take a person with Medium income who Owns a house, the new variable becomes the pair (Medium, Yes).
The new variable is a list of all possible combinations: (Low, Yes), (Low, No), (Medium, Yes), etc.

The Big Discovery:
The authors proved that if you do this mixing operation in their special "Grouping Room," you get a Commutative Monoid.

Commutative: It doesn't matter if you mix A then B, or B then A. The result is the same (like mixing paint: Red + Blue is the same as Blue + Red).
Monoid: It's a stable system where you can keep mixing things together, and there's a "neutral" element (a variable that does nothing, like a blank canvas) that keeps things stable.

5. The Harmony: Shape and Math Working Together

The final, and perhaps most important, part of the paper is showing that the Map (Topology) and the Mixer (Algebra) get along perfectly.

The Metaphor: Imagine you have a smooth, stretchy rubber sheet (the map). You have a rule that says "If you mix two points, the result lands somewhere on the sheet."
The authors proved that if you mix two points that are very close to each other, the result will also be very close to the result of mixing their neighbors.
In math terms, the operation is continuous. The "mixing" doesn't tear the map or create jumps. The shape of the data and the rules for combining it are perfectly synchronized.

Why Should You Care?

Before this paper, if you wanted to use "correlation" for non-numerical data (like survey answers, medical diagnoses, or product categories), you had to use rough approximations.

This paper gives statisticians and data scientists:

A Ruler: A precise way to measure how similar two categories are.
A Map: A way to visualize these relationships.
A Mixer: A formal way to combine variables without breaking the math.

The Bottom Line:
The authors have taken the chaotic world of "Yes/No/Maybe" data and given it a structured, mathematical home. They showed that you can treat these labels just like numbers when it comes to measuring relationships, opening the door for more accurate and intuitive data analysis in fields like medicine, marketing, and social science.

Here is a detailed technical summary of the paper "On topological and algebraic structures of categorical random variables" by Ortiz, Gómez-Guerrero, and Schaerer.

1. Problem Statement

Categorical (nominal/qualitative) random variables are ubiquitous in data science, yet they lack a rigorous mathematical framework comparable to that of continuous numerical variables. Specifically:

Metric Limitations: While correlation measures like Symmetric Uncertainty (SU) exist, they are often treated merely as heuristic similarity scores. There is no formal proof establishing SU (or a transformation of it) as a valid distance metric on a space of categorical variables, nor is there a defined topology for this space.
Algebraic Limitations: There is no established algebraic structure (e.g., group, ring, or monoid) that allows for the combination of categorical variables in a way that is consistent with their information-theoretic properties.
Integration Gap: Statistical practitioners often use Pearson correlation for numerical data and ad-hoc methods for categorical data. The paper seeks to unify these by providing a formal structure where categorical variables can be treated with the same mathematical rigor as numerical ones, enabling operations like "addition" (joining) and distance measurement.

2. Methodology

The authors employ a synthesis of Information Theory (Shannon entropy, mutual information) and Abstract Algebra/Topology (quotient spaces, monoids, metric spaces).

A. Theoretical Foundations

Symmetric Uncertainty (SU): Defined as a normalized version of Mutual Information (MI):
$SU(X, Y) = 2 \left[ \frac{MI(X|Y)}{H(X) + H(Y)} \right] = 2 \left[ 1 - \frac{H(X, Y)}{H(X) + H(Y)} \right]$
where $H$ is Shannon entropy and $H(X, Y)$ is joint entropy.
Indiscernibility: The authors define an equivalence relation $\sim$ on categorical random variables. Two variables $X$ and $Y$ are indiscernible if there exists a bijection $h$ between their codomains such that $Y = h \circ X$ almost everywhere. This effectively treats variables as equivalent if they partition the sample space in the same way, regardless of label names.
Quotient Space: The authors construct the space $\mathcal{C}$ as the set of equivalence classes of categorical random variables under the indiscernibility relation.

B. Topological Construction

The authors define a distance function $d$ on the quotient space $\mathcal{C}$ based on the complement of SU:
$d(X, Y) = 1 - SU(X, Y)$
They rigorously prove that $d$ satisfies the axioms of a metric (non-negativity, symmetry, triangle inequality, and identity of indiscernibles) within the quotient space.
They demonstrate that the resulting metric topology is not discrete, meaning variables can be arbitrarily close to one another (e.g., a variable and a "noisy" copy of itself).

C. Algebraic Construction

Joint Operation ( $*$ ): A binary operation is defined on the quotient space. For two variables $A$ and $B$ , the operation $A * B$ creates a new variable whose value is the pair $(A(p), B(p))$ for every outcome $p$ . In terms of partitions, this corresponds to the intersection of partitions ( $A \cap B$ ).
Monoid Structure: The authors prove that this operation is associative, commutative, and possesses a neutral element (the trivial variable with a single outcome), thereby endowing $\mathcal{C}$ with a commutative monoid structure.

D. Compatibility Analysis

The final methodological step involves proving that the algebraic operation ( $*$ ) is continuous with respect to the metric topology ( $d$ ). This ensures that small changes in the input variables result in small changes in the output of the joint operation.

3. Key Contributions

Formalization of SU as a Metric: The paper proves that $1 - SU$ is a valid normalized distance metric on the space of categorical random variables (modulo indiscernibility). This transforms SU from a heuristic similarity score into a rigorous geometric distance.
Definition of the Quotient Space $\mathcal{C}$ : By introducing the concept of indiscernibility, the authors create a well-defined mathematical space where variables are treated as equivalence classes, resolving issues related to arbitrary label naming.
Algebraic Structure (Commutative Monoid): The paper establishes that categorical variables form a commutative monoid under the joint operation. This provides a formal way to "combine" qualitative variables.
Topological-Algebraic Compatibility: The authors prove that the joint operation is continuous. This is a crucial result, implying that the algebraic and topological structures are harmonious, allowing for stable mathematical manipulation of these variables.
Non-Discrete Topology: The paper demonstrates that the space is not discrete, allowing for the modeling of "noisy" or slightly perturbed categorical variables, which is essential for real-world data analysis.

4. Results

Theorem 3.4 & 3.5: Established that $SU$ is a similarity metric and $1-SU $is a distance metric on the quotient space$ \mathcal{C}$.
Theorem 3.6: Proved that the metric topology on $\mathcal{C}$ is not discrete. Specifically, as the noise level $\epsilon$ in a variable $Y$ (a noisy copy of $X$ ) approaches zero, the distance $d(X, Y)$ approaches zero.
Theorem 4.5: Confirmed that the joint operation $*$ yields a commutative monoid structure on $\mathcal{C}$ .
Theorem 4.6: Proved the continuity of the joint operation. The distance between the joint of two pairs of variables is bounded by the sum of the distances of the individual pairs:
$d(X * Y, Z * W) \leq d(X, Z) + d(Y, W)$
This contractive property ensures stability in calculations.
Empirical Illustration: The paper includes a case study using student internship data (traits like Neatness, Creativity, etc.) to show how SU values correlate with intuitive groupings (e.g., Creativity and "GotHired" showed high similarity, suggesting a strong predictive relationship).

5. Significance and Implications

Bridging Parametric and Non-Parametric Statistics: The work allows categorical variables to be treated with a level of mathematical sophistication similar to Pearson correlation in numerical data. Practitioners can now "operate" on qualitative data (combining, measuring distance) with formal guarantees.
Feature Selection and Engineering: The metric structure provides a rigorous basis for feature selection. Variables with small distances (high SU) can be identified as redundant or highly correlated, while the algebraic structure allows for the systematic creation of composite features.
Mathematical Rigor for Qualitative Data: By moving beyond ad-hoc similarity measures, this framework offers a foundation for developing new algorithms for clustering, classification, and dimensionality reduction specifically tailored for categorical data.
Future Directions: The authors note that this framework can be extended to Multivariate Symmetric Uncertainty (MSU) for $n$ variables, further generalizing the concept of "entropic correlation" for complex multivariate systems.

In summary, this paper provides a foundational mathematical framework that elevates the study of categorical random variables from descriptive statistics to a structured topological and algebraic discipline, enabling more robust and interpretable data analysis.

On topological and algebraic structures of categorical random variables

1. The Problem: How Similar Are Two Labels?

2. The Ruler: Turning Friendship into Distance

3. The "Indiscernible" Twins

4. The Magic Mixer: The "Joint" Operation

5. The Harmony: Shape and Math Working Together

Why Should You Care?

1. Problem Statement

2. Methodology

A. Theoretical Foundations

B. Topological Construction

C. Algebraic Construction

D. Compatibility Analysis

3. Key Contributions

4. Results

5. Significance and Implications

More like this

Partial Sums of the Series for the Dirichlet Eta Function, their Peculiar Convergence, the Simple Zeros Conjecture, and the RH

Triangular arrangements on the projective plane

Some arithmetic properties of Weil polynomials of the form t2g+atg+qgt^{2g}+at^g+q^gt2g+atg+qg

Big Picard theorems and algebraic hyperbolicity for varieties admitting a variation of Hodge structures

On the dual positive cones and the algebraicity of a compact Kähler manifold

Some arithmetic properties of Weil polynomials of the form $t^{2g}+at^g+q^g$