Rethinking the Harmonic Loss via Non-Euclidean Distance Layers

Imagine you are teaching a robot to recognize different animals. For decades, the standard way to do this has been like giving the robot a multiple-choice test.

The robot guesses an answer (e.g., "Cat"), and the teacher (the computer program) says, "Wrong! You were 99% sure, but you were wrong. Try harder next time!" This method, called Cross-Entropy, works well, but it has two big problems:

It's a black box: The robot learns abstract numbers that don't really mean anything to us. We don't know why it thinks a picture is a cat.
It gets greedy: The robot tries to become so confident that it starts memorizing the test questions instead of learning the actual concept of a "cat." This is called "grokking," where the robot suddenly gets smart after a very long, inefficient training session.

The New Idea: The "Distance" Teacher

A few years ago, researchers proposed a new way to teach: Harmonic Loss. Instead of a multiple-choice test, imagine the teacher draws a bullseye on the floor for every animal.

There is a "Cat Bullseye," a "Dog Bullseye," and a "Bird Bullseye."
The robot's job isn't to guess a letter; it's to walk the picture to the correct bullseye.
The teacher measures the distance between the picture and the bullseye. If the picture is close to the Cat Bullseye, the robot gets a reward.

This is much more intuitive! The robot learns that "Cat" is a specific place in space. This makes the robot's brain more transparent (we can see where the "Cat" center is) and stops it from over-training.

The Paper's Big Discovery: "Not All Rulers Are the Same"

The authors of this paper asked a brilliant question: "If we are measuring distance, does it matter how we measure it?"

Until now, everyone used the Euclidean distance. Think of this as a straight-line ruler. If you want to go from point A to point B, you measure the straight line through the air. It's the standard way we think about distance in geometry.

But the authors realized that in the messy, high-dimensional world of AI, a straight line isn't always the best way to measure similarity. They decided to try 10 different types of "rulers" (mathematical distance metrics) to see which one taught the robot best.

Here are the analogies for the different "rulers" they tested:

Euclidean (The Straight Line): The standard ruler. Good, but maybe too rigid.
Cosine (The Compass): This ruler doesn't care how far you are from the center; it only cares about the direction you are facing. Imagine two people walking away from a campfire. One walks 1 mile, the other 10 miles. If they are walking in the same direction, the Compass says, "You are very similar!" This turned out to be the superstar of the experiment.
Manhattan (The Taxi Driver): Imagine you are in a city with a grid of streets. You can't walk through buildings; you have to walk down the street and turn corners. This measures distance by adding up the steps North/South and East/West. It's great for ignoring small, noisy details.
Bray-Curtis (The Ecologist): Used by scientists to compare species in a forest. It looks at the proportion of things. It's very good at spotting subtle differences in composition.
Mahalanobis (The Shape-Shifter): This ruler is smart. It knows that some features are more important than others. If "fur color" varies wildly but "number of legs" is always 4, this ruler stretches the space so that "legs" matters more. It's powerful but computationally expensive (it takes a lot of brainpower to calculate).

What Did They Find?

The researchers tested these rulers on two types of AI: Vision (looking at pictures of cats, dogs, and signs) and Language (teaching AI to write text like GPT).

1. The "Compass" Wins (Cosine Distance)

The Cosine ruler was the clear winner.

For Vision: It helped the robot learn faster, make fewer mistakes, and use less energy (lower carbon footprint). It was like giving the robot a better map.
For Language: It made the AI's writing more stable and less likely to go off the rails. It helped the AI understand the "direction" of a sentence rather than just the raw numbers.

2. The "Ecologist" and "Taxi Driver" are Great for Clarity

While the Compass was the best all-rounder, the Bray-Curtis and Chebyshev rulers made the robot's internal brain very easy to read. They organized the data so neatly that humans could easily see how the robot was thinking. It's like organizing a messy closet so perfectly that you can find anything instantly.

3. The "Shape-Shifter" is Powerful but Heavy

The Mahalanobis ruler created very sharp, clear categories, but it was slow and used a lot of electricity. It's like using a diamond-tipped laser to cut a piece of paper: it works perfectly, but it's overkill and expensive.

Why Should You Care? (The "Green" Angle)

This paper isn't just about making AI smarter; it's about making it greener.

Training AI today is incredibly energy-hungry. It burns massive amounts of electricity, contributing to carbon emissions.

The authors found that by switching from the standard "Straight Line" ruler to the "Compass" (Cosine) ruler, they could train the AI faster and with less energy.
In some cases, they reduced the carbon footprint by 30-40%.

The Takeaway

Think of AI training like building a house.

Old Way: We used a standard hammer (Cross-Entropy) that worked, but sometimes we hit our thumb, and the house was hard to understand.
New Way (Harmonic Loss): We realized we should measure distances to build the foundation.
This Paper's Contribution: We tested 10 different types of hammers and tape measures. We found that the Compass (Cosine) is the best tool for most jobs—it builds a stronger house, faster, and with less waste.

In short: We don't need to reinvent the wheel to make AI better. We just need to pick the right ruler to measure it. By choosing the right "distance," we can build AI that is smarter, easier to understand, and kinder to the planet.

1. Problem Statement

Deep learning classification has long relied on Cross-Entropy (CE) loss. However, CE suffers from several critical limitations:

Interpretability: Learned weight vectors act as abstract parameters rather than intuitive class prototypes, making model decisions opaque.
Training Dynamics: CE encourages unbounded weight growth to increase confidence, leading to phenomena like grokking (delayed generalization where test performance lags behind training performance for extended periods).
Sustainability: The pursuit of high confidence often leads to inefficient training dynamics and higher computational costs (carbon footprint).

While Harmonic Loss was previously proposed to address these issues by using Euclidean distance to measure the gap between sample representations and class prototypes, its scope was limited. It only explored Euclidean distance and lacked systematic evaluation regarding computational efficiency, sustainability, or the impact of alternative geometric metrics.

2. Methodology

The authors propose a generalized framework that replaces the Euclidean distance in Harmonic Loss with a broad spectrum of Non-Euclidean distance metrics.

Core Framework

The Harmonic Loss formulation replaces the standard inner-product logits and Softmax with a distance-based probability distribution:
$p_W(y_k|x) = \frac{d_k^{-n}}{\sum_{j=1}^K d_j^{-n}}$
Where $d_k$ is the distance between the sample representation $h$ and the class prototype $w_k$ . The paper systematically substitutes the standard Euclidean distance ( $L_2$ ) with various metrics:

$L_p$ Norms: Manhattan ( $L_1$ ), Chebyshev ( $L_\infty$ ), and Minkowski (general $L_p$ ).
Angular/Similarity: Cosine distance.
Specialized Metrics: Bray-Curtis (ecological similarity), Canberra, Hamming (for binary embeddings), and Mahalanobis (incorporating feature covariance).

Experimental Setup

The study evaluates these "distance-tailored" losses across two heterogeneous domains:

Vision Tasks: Image classification on MNIST, CIFAR-10, CIFAR-100, Marathi Sign Language, and TinyImageNet using backbones: MLP, CNN, ResNet-50, and PVT (Vision Transformer).
Language Tasks: Language modeling (Next-token prediction and Masked LM) on OpenWebText using Transformer families: GPT-2, BERT, and Qwen2.

Evaluation Metrics

The authors introduce a three-way evaluation framework:

Model Performance: Accuracy, F1-score, Perplexity, and Gradient Stability.
Interpretability: Measured via Principal Component Analysis (PCA) to assess variance concentration (PC2 EV) and intrinsic dimensionality (PCA@90%).
Sustainability: Carbon emissions (gCO2eq), energy consumption, and training duration tracked via CodeCarbon.

3. Key Contributions

First Comprehensive Study: This is the first work to extend Harmonic Loss beyond Euclidean distance, benchmarking a wide spectrum of metrics across both vision and NLP tasks.
Green AI Integration: The paper provides a controlled assessment of the carbon footprint and resource usage of different loss functions, moving beyond pure accuracy metrics.
Theoretical Insights: The authors provide theoretical proofs regarding scale invariance and finite convergence points for 1-homogeneous distances, and establish margin-style PAC-Bayes generalization bounds for these distance-based layers.
Plug-and-Play Implementation: They demonstrate that these losses can be implemented as drop-in replacements for the classification head without modifying the backbone architecture.

4. Key Results

A. Model Performance

Cosine Distance: Emerges as the most robust "all-rounder." It consistently achieves competitive or superior accuracy compared to Cross-Entropy and Euclidean Harmonic Loss across both vision and language tasks. It also significantly improves gradient stability and reduces grokking (delayed generalization).
Bray-Curtis & Chebyshev: Often match or exceed Euclidean performance while offering distinct geometric advantages.
Mahalanobis: Can yield very sharp class separation but often suffers from instability and slower convergence on complex datasets due to covariance estimation overhead.

B. Interpretability

Structured Representations: Non-Euclidean losses, particularly Bray-Curtis, Chebyshev, and Cosine, induce highly structured latent spaces.
PCA Findings: Models trained with these losses show significantly higher variance explained by the top principal components (PC2 EV) and require fewer dimensions to explain 90% of the variance (lower PCA@90%). This indicates that class prototypes are more distinct and the feature space is more compact compared to Cross-Entropy.
Grokking Elimination: In synthetic modulo-addition tasks, Harmonic Losses (regardless of the specific distance metric) eliminate the grokking phenomenon entirely, achieving simultaneous train-test convergence and forming perfect geometric structures (e.g., circular manifolds).

C. Sustainability (Green AI)

Vision Tasks: On CNNs and ResNet-50, Cosine and Bray-Curtis losses often result in lower carbon emissions than Cross-Entropy. This is attributed to faster convergence (fewer steps to target accuracy) rather than per-step FLOP reduction.
Language Tasks: For LLMs, the classifier head is lightweight, so emissions are driven by convergence speed. Cosine-based losses generally match or slightly improve upon the emissions of Cross-Entropy by stabilizing training.
Costly Metrics: Mahalanobis distance incurs higher emissions due to the computational cost of estimating and inverting covariance matrices.

5. Significance and Conclusion

This paper fundamentally shifts the perspective on classification loss functions from purely probabilistic (Cross-Entropy) to geometric (Distance-based).

Practical Impact: It offers practitioners a "toolbox" of distance metrics. Cosine is recommended as a superior default for most tasks due to its balance of accuracy, interpretability, and sustainability. Bray-Curtis is a strong alternative for interpretability-focused applications.
Theoretical Validation: The study confirms that the benefits of Harmonic Loss (mitigating grokking, improving interpretability) are not unique to Euclidean geometry but are inherent to the distance-based formulation itself.
Sustainability: It establishes that optimizing for geometric properties can simultaneously reduce the environmental cost of training deep learning models, challenging the "Red AI" paradigm where performance gains come at the cost of massive energy consumption.

In summary, the authors demonstrate that geometry matters. By carefully selecting the distance metric in the final layer, one can achieve models that are not only more accurate and stable but also more transparent and environmentally sustainable.