Learning Unified Distance Metric for Heterogeneous Attribute Data Clustering

Imagine you are a detective trying to solve a mystery by grouping suspects into teams based on their descriptions. You have two very different types of clues:

The "Numbers" Clues: Things like "Height" or "Income." These are easy. If someone is 6 feet tall and another is 5 feet, you know exactly how far apart they are. It's a straight line on a ruler.
The "Labels" Clues: Things like "Job Title" (Doctor vs. Lawyer) or "Favorite Color." These are tricky. Is a "Doctor" closer to a "Lawyer" or a "Nurse"? There is no ruler for this. A "Red" shirt isn't "halfway" between a "Blue" and a "Green" shirt.

The Problem:
Most old-school detective tools (clustering algorithms) are bad at mixing these clues. They either try to force the "Labels" into a fake ruler (which loses the true meaning) or they treat the "Numbers" and "Labels" as completely separate worlds that never talk to each other. This makes it hard to find the right teams.

The Solution: The "Universal Translator" (HARR)
This paper introduces a new method called HARR (Heterogeneous Attribute Reconstruction and Representation). Think of it as a Universal Translator that turns all your messy clues into a single, easy-to-understand language.

Here is how it works, using simple analogies:

1. The "Shadow Projection" Trick

Imagine you have a complex 3D sculpture (a categorical attribute like "Job Title"). You can't measure it with a simple ruler.

The Old Way: You take a photo of it and say, "It's just a blob." (This is like One-Hot Encoding, which treats every job as equally different from every other job).
The HARR Way: Instead of looking at the whole sculpture at once, you shine a light on it from every possible angle and look at the shadows it casts on a flat wall.
- You compare "Doctor" vs. "Lawyer" and see how their shadows differ.
- You compare "Doctor" vs. "Nurse" and see their shadows.
- By looking at all these 2D shadows (projections), you can map the complex 3D shape onto a simple, straight line (a ruler) without losing the details.

Now, "Job Title" isn't a fuzzy concept anymore; it's a set of numbers on a ruler, just like "Income." Suddenly, the computer can compare a Doctor's income to their job title on the same scale!

2. The "Smart Weight" System

Once everything is on the same ruler, the computer needs to decide which clues matter most.

The Old Way: You might guess that "Income" is important and "Job Title" is not, or you might give them equal weight.
The HARR Way: The system acts like a smart coach. It tries grouping the suspects. If "Income" helps separate the teams well, the coach says, "Great! Give Income a bigger megaphone!" If "Job Title" is confusing the teams, the coach says, "Quiet down, Job Title."
It does this automatically while it searches for the best groups, constantly adjusting the volume (weights) of each clue until the teams are perfectly formed.

3. Two Coaches (HARR-V and HARR-M)

The paper proposes two slightly different versions of this coach:

HARR-V (The Generalist): Gives one overall volume setting for each clue across the whole room. "Income is loud for everyone."
HARR-M (The Specialist): Gives specific volume settings for each clue per team. "Income is loud for Team A, but Job Title is loud for Team B." This is like having a coach who knows that for one group of suspects, money matters most, but for another group, their profession is the key.

Why is this a Big Deal?

No More Guessing: You don't need to manually tune knobs or guess how to measure "Job Title." The system figures it out.
It's Fast: Even though it does a lot of math (projecting shadows), it converges (finishes the job) very quickly.
It Works Everywhere: Whether you are grouping customers for marketing, patients for medical diagnosis, or students for school projects, this method handles the mix of numbers and words better than previous tools.

In a Nutshell:
This paper teaches computers how to stop treating "Numbers" and "Words" as enemies. By turning words into "shadows" on a ruler and letting the computer learn which clues are most important for the specific group it's building, it finds hidden patterns that other methods miss. It's like finally having a detective who can understand both the alibi (numbers) and the motive (words) perfectly.

1. Problem Statement

Real-world clustering tasks often involve mixed datasets containing both numerical (continuous) and categorical (nominal and ordinal) attributes.

The Core Challenge: Numerical attributes exist in a well-defined Euclidean space where values represent tendencies (e.g., high vs. low temperature). In contrast, categorical attributes exist in an implicit space where values represent distinct concepts (e.g., occupations) or ordered concepts (e.g., low, medium, high).
Limitations of Existing Methods:
- Encoding-based approaches: Convert categorical data to numerical (e.g., One-Hot Encoding), often losing the intrinsic semantic relationships or imposing equidistant constraints that do not reflect reality.
- Dissimilarity-based approaches: Define hybrid metrics (e.g., Gower's distance) but often treat the encoding/distance definition phase as independent from the clustering phase, leading to sub-optimal adaptability.
- Homogeneity Gap: Existing methods struggle to create a homogeneous distance space that allows numerical and categorical attributes to be processed jointly without losing the specific structural information (especially ordinal relationships) of the categorical data.

2. Methodology: HARR (Heterogeneous Attribute Reconstruction and Representation)

The authors propose a novel learning paradigm called HARR, which reconstructs heterogeneous attributes into a homogeneous state and integrates this representation with the clustering process via joint learning.

A. Homogeneous Attribute Representation (Projection-Based)

Instead of simple encoding, the method projects categorical attribute values into learnable one-dimensional spaces to mimic the Euclidean structure of numerical attributes.

Base Distance Calculation: First, a "base distance" ( $\kappa$ ) is computed between any pair of categorical values based on Conditional Probability Distributions (CPDs) derived from the dataset statistics. This captures the inter-dependency between attribute values.
Multi-Space Projection:
- Nominal Attributes: A categorical attribute with $v_r$ possible values is projected onto $\gamma_r = v_r(v_r-1)/2$ distinct one-dimensional spaces. Each space is spanned by a unique pair of conceptual values. The projection point of a value is calculated geometrically using the base distances (Pythagorean theorem application). This expands the attribute into multiple sub-attributes, preserving rich structural information.
- Ordinal Attributes: Since ordinal values have a linear order, they are projected onto a single one-dimensional space, preserving their inherent rank without the expansion factor used for nominal data.
- Numerical Attributes: Treated as already existing in a one-dimensional space.
Result: All attributes (original numerical, projected nominal, and projected ordinal) now exist in a homogeneous one-dimensional Euclidean space, allowing for unified distance computation.

B. Learning Algorithms (HARR-V and HARR-M)

The paper introduces two algorithms that jointly learn the attribute weights and the data partitions, avoiding the need for manual hyperparameter tuning.

Objective: Minimize the total intra-cluster dissimilarity while maximizing inter-cluster separation.
Weight Updating Strategy:
- HARR-V (Vector): Updates a global weight vector $w$ . It calculates weights based on the ratio of average inter-cluster distance to average intra-cluster distance for each attribute.
- HARR-M (Matrix): Updates a weight matrix $W$ (size $k \times \hat{d}$ ). This is a more advanced strategy that learns specific weights for each attribute per cluster. It acknowledges that an attribute might be crucial for distinguishing Cluster A from Cluster B but irrelevant for Cluster C.
Optimization Process: The algorithm iteratively updates:
1. Data partition ( $Q$ ) given fixed weights and prototypes.
2. Cluster prototypes ( $M$ ) given fixed weights and partition.
3. Attribute weights ( $w$ or $W$ ) given fixed partition and prototypes.
- This loop continues until convergence.

3. Key Contributions

Semantic Connection: Reveals the intrinsic connection between numerical, nominal, and ordinal attributes by treating them as representations of semantic concepts, enabling a unified understanding of heterogeneous data.
Projection-Based Reconstruction: Proposes a method to transform heterogeneous distance spaces into homogeneous ones using a projection mechanism based on base distances, avoiding prior bias or external knowledge.
Parameter-Free Learning: The proposed algorithms are parameter-free (no need to tune hyperparameters for the metric learning) and guarantee convergence.
High Degree of Learning Freedom (DoLF): Theoretical analysis shows that HARR-M offers a "hyper-DoLF" clustering capability, meaning it has more learnable variables than the number of clusters, allowing it to find optimal representations even for complex datasets.
Two Variants: Introduces HARR-V (global weights) and HARR-M (cluster-specific weights), with HARR-M offering finer granularity in learning.

4. Experimental Results

The authors evaluated HARR-V and HARR-M on 14 real-world datasets (6 mixed, 8 pure categorical) from the UCI repository, comparing them against 10 state-of-the-art baselines (including k-prototypes, Gower's distance, and various metric learning methods).

Performance Metrics: Adjusted Rand Index (ARI) and Clustering Accuracy (CA).
Key Findings:
- Superiority: HARR-M consistently outperformed all baselines, achieving the highest average ARI and CA scores. HARR-V also performed competitively, often ranking second or first.
- Statistical Significance: Friedman and Nemenyi (BD) tests confirmed that the improvements over baselines (like UDM, HOD, GBD) are statistically significant.
- Ablation Studies:
  - The projection mechanism significantly improved performance over using base distances alone.
  - The weight learning mechanism (HARR-V/M) significantly outperformed the static representation (HAR) combined with standard k-prototypes.
  - Distinguishing between nominal and ordinal attributes (rather than treating all as nominal) yielded better results, validating the specific handling of ordinal data.
- Efficiency: The algorithms converge quickly (typically within 15 iterations) and exhibit linear time complexity relative to the number of objects, making them scalable.
- Visualization: t-SNE visualizations of the Mushroom dataset showed that HARR-M produced much clearer cluster separation compared to One-Hot Encoding and other distance metrics.

5. Significance and Impact

Unified Framework: HARR provides a robust theoretical and practical framework for handling "Any-Type-Attributed Data" (numerical, nominal, and ordinal) without needing to pre-encode or manually define complex distance metrics.
Adaptability: By learning weights jointly with clustering, the method automatically adapts to the specific structure of the data and the number of clusters ( $k$ ), overcoming the rigidity of traditional fixed-metric approaches.
Practical Application: The method is highly effective for real-world domains like healthcare (diagnosis), finance (credit scoring), and market segmentation, where data is inherently mixed and noisy.
Future Work: The authors note limitations regarding missing values and streaming data, suggesting future research directions in incremental learning and noise handling.

In summary, this paper presents a significant advancement in mixed-data clustering by mathematically unifying heterogeneous attribute spaces through projection and learning, resulting in a highly accurate, efficient, and adaptive clustering paradigm.

Learning Unified Distance Metric for Heterogeneous Attribute Data Clustering

1. The "Shadow Projection" Trick

2. The "Smart Weight" System

3. Two Coaches (HARR-V and HARR-M)

Why is this a Big Deal?

1. Problem Statement

2. Methodology: HARR (Heterogeneous Attribute Reconstruction and Representation)

A. Homogeneous Attribute Representation (Projection-Based)

B. Learning Algorithms (HARR-V and HARR-M)

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation