On topological and algebraic structures of categorical random variables

This paper establishes a metric for categorical random variables based on entropy and symmetrical uncertainty, demonstrating that the resulting quotient space possesses a natural commutative monoid structure compatible with the induced topology.

Inocencio Ortiz, Santiago Gómez-Guerrero, Christian E. Schaerer

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are a detective trying to solve a mystery, but instead of looking for fingerprints, you are looking for patterns in data. Specifically, you are dealing with "categorical" data—things that aren't numbers, but labels like "Red, Blue, Green," "Yes, No," or "High, Medium, Low."

For a long time, statisticians have had a great tool for measuring how much two numbers (like height and weight) are related. But measuring how much two labels are related has been messy. This paper introduces a new, super-smart ruler and a new way to "mix" these labels together, giving them a mathematical home.

Here is the breakdown of what the authors did, using simple analogies.

1. The Problem: How Similar Are Two Labels?

Imagine you have a dataset of students. You have a column for "Favorite Color" and a column for "Favorite Sport."

  • If everyone who likes Blue also likes Soccer, those two columns are "best friends."
  • If "Favorite Color" has nothing to do with "Favorite Sport," they are strangers.

The authors use a tool called Symmetric Uncertainty (SU). Think of SU as a "Friendship Score" that goes from 0 to 1.

  • 0 means they are total strangers (no relationship).
  • 1 means they are identical twins (knowing one tells you everything about the other).

The paper starts by proving that this "Friendship Score" is actually a very reliable way to measure similarity.

2. The Ruler: Turning Friendship into Distance

Usually, we measure how far apart things are (distance). But SU measures how close they are (similarity).
The authors realized: "Hey, if we know how close they are, we can easily figure out how far apart they are!"

They created a simple formula: Distance = 1 - Friendship Score.

  • If the Friendship Score is 1 (identical), the Distance is 0.
  • If the Friendship Score is 0 (strangers), the Distance is 1.

Why is this cool?
This turns the messy world of labels into a map. You can now draw a map where "Red" is close to "Blue" if they behave similarly, and far away from "Green." This map has a topology (a shape), meaning you can talk about "neighborhoods" of similar variables. It's no longer just a list; it's a landscape you can navigate.

3. The "Indiscernible" Twins

The paper also deals with a tricky problem: What if two variables are the same, just with different names?

  • Variable A: {1, 2, 3}
  • Variable B: {Apple, Banana, Cherry}

If 1 always matches Apple, 2 matches Banana, and 3 matches Cherry, they are effectively the same variable. The authors call these "Indiscernible" (indistinguishable).

They created a special "Grouping Room" (a quotient space). In this room, you don't care about the specific names (1 vs. Apple); you only care about the pattern. If two variables follow the same pattern, they get the same ID card in this room. This makes the math much cleaner.

4. The Magic Mixer: The "Joint" Operation

This is the most creative part. The authors asked: "Can we mix two variables together to make a new one?"

Imagine you have:

  • Variable A: Income (Low, Medium, High)
  • Variable B: House Owner (Yes, No)

They define a "Joint" operation (let's call it A * B).

  • If you take a person with Medium income who Owns a house, the new variable becomes the pair (Medium, Yes).
  • The new variable is a list of all possible combinations: (Low, Yes), (Low, No), (Medium, Yes), etc.

The Big Discovery:
The authors proved that if you do this mixing operation in their special "Grouping Room," you get a Commutative Monoid.

  • Commutative: It doesn't matter if you mix A then B, or B then A. The result is the same (like mixing paint: Red + Blue is the same as Blue + Red).
  • Monoid: It's a stable system where you can keep mixing things together, and there's a "neutral" element (a variable that does nothing, like a blank canvas) that keeps things stable.

5. The Harmony: Shape and Math Working Together

The final, and perhaps most important, part of the paper is showing that the Map (Topology) and the Mixer (Algebra) get along perfectly.

  • The Metaphor: Imagine you have a smooth, stretchy rubber sheet (the map). You have a rule that says "If you mix two points, the result lands somewhere on the sheet."
  • The authors proved that if you mix two points that are very close to each other, the result will also be very close to the result of mixing their neighbors.
  • In math terms, the operation is continuous. The "mixing" doesn't tear the map or create jumps. The shape of the data and the rules for combining it are perfectly synchronized.

Why Should You Care?

Before this paper, if you wanted to use "correlation" for non-numerical data (like survey answers, medical diagnoses, or product categories), you had to use rough approximations.

This paper gives statisticians and data scientists:

  1. A Ruler: A precise way to measure how similar two categories are.
  2. A Map: A way to visualize these relationships.
  3. A Mixer: A formal way to combine variables without breaking the math.

The Bottom Line:
The authors have taken the chaotic world of "Yes/No/Maybe" data and given it a structured, mathematical home. They showed that you can treat these labels just like numbers when it comes to measuring relationships, opening the door for more accurate and intuitive data analysis in fields like medicine, marketing, and social science.