K-Means as a Radial Basis function Network: a Variational and Gradient-based Equivalence

Here is an explanation of the paper "K-Means as a Radial Basis Function Network" using simple language, analogies, and metaphors.

The Big Idea: Bridging Two Worlds

Imagine you have two very different tools for organizing a messy room:

The "Hard" Organizer (K-Means): This tool is like a strict librarian. It looks at a book and immediately slaps a label on it: "This goes in the History bin." It's fast and efficient, but once the label is on, it can't be changed. If you try to teach a robot to learn while it's labeling, the robot gets confused because the "labeling" step is a sudden, jerky jump, not a smooth slide.
The "Soft" Organizer (RBF Networks): This tool is like a gentle artist. Instead of a hard label, it paints a soft, fuzzy cloud around the book. The book is 90% History, 10% Biography. It's smooth, flexible, and a robot can easily learn from it because the changes are gradual.

The Problem: The strict librarian (K-Means) is great at finding groups, but it breaks the robot's brain (the neural network) because it's not "differentiable" (you can't calculate a smooth slope for it). The gentle artist (RBF) is great for robots, but it's often seen as just a "soft approximation" of the librarian, not the real deal.

The Paper's Solution: The authors prove that these two tools are actually the same person wearing different hats.

They show that if you take the "Soft" Organizer and turn a specific dial (called temperature, $\sigma$ ) all the way down to zero, the soft clouds instantly snap into hard labels. The "Soft" Organizer becomes the "Hard" Organizer.

The Core Concepts Explained

1. The Temperature Dial ( $\sigma$ )

Imagine you are trying to decide which of three friends to sit with at lunch.

High Temperature (Hot Day): You are indecisive. You might sit with 40% of Friend A, 30% of Friend B, and 30% of Friend C. You are "spread out." This is the Soft state.
Low Temperature (Freezing Day): You are desperate for warmth. You immediately run to the friend who is closest to you and sit only with them. You ignore the others completely. This is the Hard (K-Means) state.

The paper proves that as you slowly turn the dial from "Hot" to "Freezing," the soft, fuzzy decision smoothly transforms into the hard, binary decision without breaking anything.

2. The "Gamma-Convergence" (The Magic Bridge)

In math, proving two things are the same as a limit approaches zero is hard. The authors used a concept called $\Gamma$ -convergence.

Analogy: Imagine a hill with a valley at the bottom.
- The Soft version is a wide, gentle bowl.
- The Hard version is a sharp, deep V-shape.
- The paper proves that as you shrink the bowl, it doesn't just get smaller; it morphs perfectly into the V-shape. The bottom of the bowl (the best solution) lands exactly on the bottom of the V.

3. The Gradient Problem (The "Stuck" Robot)

Neural networks learn by sliding down a hill (gradient descent).

The Issue: If you use the "Hard" K-Means, the hill has a cliff. If the robot is on the edge, it doesn't know which way to slide because the ground is flat until it suddenly drops. The robot gets stuck.
The Fix: By using the "Soft" version (RBF), the hill is smooth. The robot can slide down easily. The paper shows that if you slide down this smooth hill and then turn the temperature dial to zero, the robot ends up in the exact same spot as if it had tried to jump off the cliff.

4. The "Entmax-1.5" Solution (The Safety Valve)

There was a catch. When the temperature gets too low (very close to zero), the math gets unstable. The numbers get so huge or so tiny that computers crash (like trying to divide by zero).

The Fix: The authors swapped the standard "Softmax" math for a new tool called Entmax-1.5.
Analogy: Think of Softmax as a balloon that inflates until it pops. Entmax-1.5 is a balloon that has a safety valve. It still snaps to a hard decision (0% or 100%), but it does so without the math exploding. It allows the computer to handle the "Freezing Day" scenario without crashing.

Why Does This Matter? (The Real-World Impact)

Before this paper, if you wanted to use K-Means inside a modern AI (like a self-driving car or a chatbot), you had to do it in two separate steps:

Run K-Means to find groups.
Feed those groups into the AI.

This is like baking a cake, taking it out of the oven, and then trying to frost it with a different machine that doesn't talk to the oven.

With this new method:
You can bake and frost the cake in one continuous motion. The AI can now learn how to group things and learn how to recognize patterns at the same time.

Joint Optimization: The AI doesn't just "guess" the groups; it adjusts the groups while it learns to see the data better.
End-to-End: You can plug this clustering tool directly into a deep neural network, and the whole system learns together, making the AI smarter and more efficient.

Summary in One Sentence

The authors proved that the rigid, old-school K-Means algorithm is actually just a "frozen" version of a smooth, modern neural network, and by using a special mathematical trick (Entmax-1.5), we can now melt them together so AI can learn to group data and recognize patterns all at once.

Here is a detailed technical summary of the paper "K-MEANS AS A RADIAL BASIS FUNCTION NETWORK: A VARIATIONAL AND GRADIENT-BASED EQUIVALENCE."

1. Problem Statement

The K-Means clustering algorithm is a cornerstone of unsupervised learning due to its simplicity and computational efficiency. However, it suffers from a fundamental structural limitation: it relies on hard assignments (Voronoi partitions), which are non-differentiable. This prevents K-Means from being directly embedded into end-to-end deep learning pipelines where gradient-based optimization is required.

Currently, K-Means is often treated as a discrete, external preprocessing step or a heuristic initialization for neural networks, creating a "methodological gap" between discrete partitioning and continuous optimization. While Radial Basis Function (RBF) networks offer a differentiable, distance-based alternative, their relationship to K-Means has historically been viewed as a loose approximation rather than a rigorous equivalence.

The Core Question: Can K-Means be characterized not as an external discrete procedure, but as the zero-temperature limit of a differentiable RBF network, thereby enabling joint optimization of representations and clusters?

2. Methodology

The authors establish a rigorous mathematical framework connecting K-Means and RBF networks through variational analysis and $\Gamma$ -convergence.

A. Variational Reparametrization

The authors reframe the K-Means objective (minimizing within-cluster squared distortion) using responsibility variables ( $r_{ij}$ ).

Hard Assignment (K-Means): $r_{ij} \in \{0, 1\}$ , where a point is assigned to exactly one cluster.
Soft Assignment (RBF): $r_{ij} \in [0, 1]$ , representing a probability distribution over clusters.

They introduce an entropically regularized functional ( $J_\sigma$ ) that includes a temperature parameter $\sigma$ :
$J_\sigma(\mu, r) = \sum_{i,j} r_{ij} \|x_i - \mu_j\|^2 + 2\sigma^2 \sum_{i,j} r_{ij} \log r_{ij}$
Minimizing this with respect to $r$ yields Softmax responsibilities, which correspond exactly to the activation functions of a Gaussian RBF network.

B. $\Gamma$ -Convergence Proof

The paper proves that as the temperature parameter $\sigma \to 0$ :

The soft RBF objective function $\Gamma$ -converges to the classical K-Means distortion functional.
The minimizers of the RBF objective converge to the minimizers of the K-Means objective.
This establishes that K-Means is the zero-temperature limit of the differentiable RBF formulation.

C. Gradient Dynamics Equivalence

The authors analyze the gradient descent updates for the RBF centers. They demonstrate that:

For a fixed learning rate $\eta = (2|S_j|)^{-1}$ (where $|S_j|$ is the cluster size), the gradient update of the RBF centers exactly recovers the closed-form centroid update rule of K-Means.
In the limit as $\sigma \to 0$ , the gradient flow of the soft model collapses to the discrete K-Means trajectory.

D. Numerical Stability: Entmax-1.5

A critical practical contribution addresses the numerical instability of Softmax when $\sigma$ is very small (underflow issues where $e^{-large}$ becomes zero).

The authors propose replacing Softmax with Entmax-1.5 (based on Tsallis entropy).
Entmax-1.5 produces sparse probability vectors (many zeros) while remaining differentiable.
Unlike Softmax, which converges exponentially, Entmax-1.5 converges polynomially ( $O(\sigma)$ ), ensuring numerical stability and preventing gradient collapse in the low-temperature regime.

3. Key Contributions

Theoretical Equivalence: Proves that K-Means is the $\Gamma$ -limit of a smooth RBF network, unifying discrete clustering and continuous optimization under a single variational framework.
Gradient Recovery: Demonstrates that the gradient-based update of RBF centers recovers the exact K-Means centroid update rule, validating that the two methods share identical fixed points and trajectories in the limit.
Stable Differentiable Clustering: Introduces Entmax-1.5 as a numerically stable alternative to Softmax for low-temperature regimes, enabling robust training without floating-point underflow.
End-to-End Integration: Provides a mechanism to embed K-Means directly into deep neural architectures, allowing for the joint optimization of feature representations and cluster assignments without alternating minimization.

4. Experimental Results

The authors validated their theory across four synthetic geometric regimes: Gaussian Blobs, Two Moons, Spirals, and Circles.

Convergence Behavior: As $\sigma$ decreases, soft RBF centroids monotonically collapse toward the hard K-Means fixed points.
Convergence Rates:
- Softmax: Exhibits exponential convergence (super-polynomial), consistent with Theorem 3.
- Entmax-1.5: Exhibits linear convergence ( $O(\sigma)$ ), consistent with Theorem 4.
Geometric Sensitivity: The convergence is stable for isotropic data (Blobs) but shows more irregular paths for non-convex manifolds (Moons, Spirals), reflecting the deformation of attraction basins as responsibilities sharpen.
Computational Cost: The SoftRBF approach adds only a modest $O(k \log k)$ overhead per sample due to the Entmax sorting step, which becomes negligible when sparsity is high.

5. Significance and Implications

Bridging the Gap: This work resolves the tension between discrete partitioning and continuous optimization, allowing K-Means to be treated as a differentiable layer within a neural network.
Joint Optimization: It enables end-to-end differentiable clustering, where the network learns representations specifically tailored to the clustering task, rather than clustering on top of pre-trained features.
Stability: By utilizing Entmax-1.5, the method overcomes the numerical limitations of previous "soft K-Means" attempts, making it viable for practical deep learning applications.
Limitations: The authors note that while the optimization is differentiable, the geometric expressiveness remains limited to Euclidean Voronoi partitions. The method does not inherently solve the problem of clustering non-linear manifolds (e.g., it still struggles with complex shapes without additional manifold-aware losses), but it provides a stable computational vehicle to apply K-Means logic within deep architectures.

In summary, the paper provides a mathematically rigorous foundation for treating K-Means as a differentiable component, offering a unified framework for modern deep clustering that is both theoretically sound and numerically stable.

K-Means as a Radial Basis function Network: a Variational and Gradient-based Equivalence

The Big Idea: Bridging Two Worlds

The Core Concepts Explained

1. The Temperature Dial (σ\sigmaσ)

2. The "Gamma-Convergence" (The Magic Bridge)

3. The Gradient Problem (The "Stuck" Robot)

4. The "Entmax-1.5" Solution (The Safety Valve)

Why Does This Matter? (The Real-World Impact)

Summary in One Sentence

1. Problem Statement

2. Methodology

A. Variational Reparametrization

B. Γ\GammaΓ-Convergence Proof

C. Gradient Dynamics Equivalence

D. Numerical Stability: Entmax-1.5

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

The fourth known primitive solution to a5+b5+c5+d5=e5a^5 + b^5 + c^5 + d^5 = e^5a5+b5+c5+d5=e5

Waring-Goldbach problems for one square and higher powers

Reductification of parahoric group schemes

Sobolev regularity of the symmetric gradient of solutions to a class of ϕ\phiϕ-Laplacian systems

On the approximation of Weierstrass function via superoscillations

1. The Temperature Dial ( $\sigma$ )

B. $\Gamma$ -Convergence Proof

The fourth known primitive solution to $a^5 + b^5 + c^5 + d^5 = e^5$

Sobolev regularity of the symmetric gradient of solutions to a class of $\phi$ -Laplacian systems