Joint Majorization-Minimization for Nonnegative CP and Tucker Decompositions under $\beta$-Divergences: Unfolding-Free Updates

Imagine you have a giant, multi-dimensional puzzle made of numbers (a tensor). This puzzle represents complex data, like traffic patterns across a city over time, or the ingredients in millions of recipes. Your goal is to break this giant puzzle down into simpler, understandable pieces (a decomposition) so you can see the hidden patterns.

However, there's a catch: the pieces must be non-negative (you can't have negative ingredients or negative traffic). Also, the data is "noisy," so you need a flexible ruler to measure how well your pieces fit together. This ruler is called the $\beta$ -divergence.

This paper is about building a faster, smarter way to solve this puzzle without getting bogged down in messy paperwork.

The Old Way: The "Unfolding" Nightmare

Traditionally, to solve these puzzles, computers would take the 3D (or 4D, or 5D) puzzle and unfold it into a giant 2D sheet of paper (a matrix).

The Analogy: Imagine trying to organize a 3D stack of books by taking every single book out, laying them all flat on the floor in one giant row, sorting them, and then trying to stack them back up.
The Problem: This "unfolding" creates massive piles of paper (memory usage) and requires a lot of moving around (computation time). It's slow and clumsy, especially when the puzzle is huge.

The New Way: "Unfolding-Free" Updates

The authors, led by Valentin Leplat, propose a new method that keeps the puzzle in its natural 3D (or multi-dimensional) shape the whole time.

The Analogy: Instead of laying the books flat, you just reach into the stack, grab the specific books you need, rearrange them, and put them back, all while keeping the stack intact.
The Tool: They use a technique called Tensor Contraction (specifically using einsum, a fancy way of saying "multiply and sum specific parts"). It's like having a magic wand that instantly calculates the necessary numbers without ever flattening the data.

The Secret Sauce: "Joint Majorization"

The paper introduces two main improvements, but the second one is the real game-changer.

1. The "Block-by-Block" Update (The Standard Approach)

Imagine you are painting a mural. You paint one section, then stop to mix new paint, then paint the next section, then stop to mix paint again.

The Paper's Version: They figured out how to paint each section using only the "magic wand" (contractions) without flattening the wall. This is already faster than the old way.

2. The "Joint Majorization" Strategy (The Innovation)

This is the paper's biggest contribution.

The Analogy: Imagine you are painting that mural again. Instead of stopping to mix new paint after every single brushstroke, you set up a master palette at the start of the hour. You decide, "Okay, for the next 10 minutes, I'm going to use this specific mix of colors."
How it works:
1. The computer picks a "reference point" (a snapshot of the current solution).
2. It builds a Master Surrogate (a simplified, safe approximation of the problem) based on that snapshot.
3. It then performs a quick "inner loop" of updates. It tweaks the puzzle pieces rapidly, reusing the cached (pre-calculated) numbers from that Master Surrogate.
4. It doesn't rebuild the expensive Master Surrogate until the inner loop is done.
The Benefit: You save a massive amount of time because you aren't constantly recalculating the "mixing instructions." You just reuse the cached instructions for a few quick steps.

Why Does This Matter?

The authors tested this on synthetic data and a real-world dataset of Uber pickup locations (a 5-dimensional puzzle of time, day, and location).

Speed: Their method was significantly faster than the old "unfolding" methods.
Efficiency: It used less computer memory, meaning it could handle bigger puzzles without crashing.
Versatility: It works for different types of "rulers" ( $\beta$ -divergences), making it useful for everything from counting data (like Uber rides) to measuring sound frequencies.

The Bottom Line

This paper is like upgrading from a manual, paper-and-pencil calculator to a high-speed digital processor.

Old Method: Flatten the data, do the math, flatten it again. (Slow, messy).
New Method: Keep the data 3D, use smart shortcuts (contractions), and reuse your calculations (joint majorization) to solve the puzzle in record time.

They proved mathematically that this method always gets better (or at least doesn't get worse) and eventually finds the best possible solution. For anyone working with massive, multi-dimensional data, this is a huge win for speed and efficiency.

Here is a detailed technical summary of the paper "Joint Majorization-Minimization for Nonnegative CP and Tucker Decompositions under $\beta$ -Divergences: Unfolding-Free Updates" by Valentin Leplat.

1. Problem Statement

The paper addresses the optimization of Nonnegative Tensor Decompositions, specifically Canonical Polyadic (CP) and Tucker models, under the $\beta$ -divergence family of loss functions. The $\beta$ -divergence is a versatile metric that generalizes several standard losses:

$\beta = 2$ : Squared Euclidean distance.
$\beta = 1$ : Kullback-Leibler (KL) divergence.
$\beta = 0$ : Itakura-Saito (IS) divergence.

Key Challenges:

Computational Cost: Traditional optimization methods for these problems often rely on mode unfoldings (matricization) and the formation of large intermediate matrices (e.g., Khatri-Rao products, MTTKRP). For high-order or large-scale tensors, these operations are memory-intensive and slow.
Algorithmic Efficiency: Existing Majorization-Minimization (MM) approaches typically update one factor block at a time, requiring the recomputation of expensive auxiliary quantities (like powered tensors) for every single block update.

2. Methodology

The authors propose a framework that eliminates explicit unfoldings and optimizes the MM strategy through Joint Majorization.

A. Unfolding-Free Updates (Contraction-Only)

Instead of matricizing tensors, the authors derive multiplicative update rules where all numerators and denominators are expressed as tensor contractions.

Implementation: These contractions are implemented using Einstein summation (einsum) primitives.
Mechanism: For a CP model, updating a factor matrix $A^{(n)}$ involves contracting the data tensor (or powered versions of it) with the other fixed factor matrices. This avoids creating large $N$ -dimensional arrays or unfolding the tensor into a 2D matrix.
Benefit: This approach significantly reduces memory traffic and allows for efficient implementation on modern hardware without explicit intermediate matrix formation.

B. Joint Majorization-Minimization (J-CoMM)

The paper introduces a Joint MM strategy, adapted from matrix $\beta$ -NMF, to further accelerate convergence.

Standard Block-MM: Updates one block, then immediately rebuilds the surrogate function (and expensive reference tensors) for the next block.
Joint MM Strategy:
1. Outer Loop: At a reference iterate $\tilde{\Theta}$ , construct a single global surrogate function $G(\Theta | \tilde{\Theta})$ that upper-bounds the objective for all variables simultaneously.
2. Inner Loop: Perform a short sequence of inexpensive block updates (e.g., one full sweep over all factors) to minimize this fixed surrogate.
3. Caching: Crucially, the "reference-powered" tensors (e.g., $\tilde{P} = X \odot \tilde{X}^{\beta-2}$ and $\tilde{Q} = \tilde{X}^{\beta-1}$ ) are computed once per outer iteration and reused across all inner block updates.
Result: This decouples the expensive computation of reference quantities from the frequent block updates, drastically reducing the computational cost per iteration.

C. Theoretical Analysis

The paper provides rigorous convergence guarantees:

Monotonicity: Both Block-MM and Joint-MM guarantee a monotonic decrease in the objective function (Block-MM per block update; Joint-MM per outer iteration).
Block-MM Convergence: Analyzed via BSUM (Block Successive Upper-bound Minimization) theory, showing convergence to stationary points under standard regularity assumptions.
Joint-MM Convergence: Since the surrogate is fixed during the inner loop, standard BSUM does not directly apply. The authors prove iterate convergence (convergence to a critical point) for the case of one inner sweep per outer iteration ( $L=1$ ) using the Kurdyka-Łojasiewicz (KL) property. This relies on establishing uniform strong convexity of the scalar subproblems and sufficient decrease estimates.

3. Key Contributions

Unfolding-Free Formulations: Derivation of classical multiplicative updates for CP and Tucker under $\beta$ -divergence entirely in terms of tensor contractions (einsum), avoiding large auxiliary matrices.
Joint Majorization Strategy: Introduction of a J-CoMM algorithm for tensor decompositions that reuses cached reference tensors across multiple inner updates, significantly reducing wall-clock time.
Convergence Theory:
- Proof of tightness for the proposed majorizers.
- Establishment of monotonic decrease and objective value convergence.
- Proof of iterate convergence to a critical point for J-CoMM using KL analysis, a non-trivial result for joint MM schemes with frozen surrogates.
Practical Implementation: Provision of explicit einsum recipes and efficient handling of sparse data (accumulating numerators over non-zeros).

4. Experimental Results

The authors benchmarked their methods (B-CoMM and J-CoMM) against:

Unfolding-based MU: The classical baseline using matricization.
NNEinFact: A recent general einsum-based factorization framework.

Datasets:

Synthetic tensors (CP and Tucker, various ranks and sizes).
Real-world data: The Uber spatiotemporal count tensor (5-way, highly sparse).

Findings:

Per-Iteration Progress: All MM-based methods showed comparable convergence rates when measured by iteration count.
Wall-Clock Time:
- J-CoMM consistently achieved the fastest runtime, significantly outperforming unfolding-based baselines.
- B-CoMM (unfolding-free block MM) was also faster than unfolding-based methods.
- J-CoMM was competitive with or faster than NNEinFact, particularly for CP decompositions across various $\beta$ values.
Scalability: The contraction-only approach demonstrated substantial speedups by avoiding the memory overhead of unfolding large tensors.

5. Significance

This work bridges the gap between theoretical optimization guarantees and practical efficiency in tensor decomposition.

Memory Efficiency: By removing the need for mode unfoldings, the method makes it feasible to apply $\beta$ -divergence minimization to very large, high-order tensors that would otherwise exceed memory limits with traditional approaches.
Algorithmic Innovation: The adaptation of Joint MM to multilinear tensor models is a significant step forward, showing that "caching" reference quantities can be effectively generalized beyond matrix factorization.
Robustness: The algorithms handle the full range of $\beta \in [0, 2)$ , including the challenging Itakura-Saito case ( $\beta=0$ ), with built-in safeguards for numerical stability.
Theoretical Depth: The KL-based convergence proof for J-CoMM provides a solid theoretical foundation for using joint surrogates in non-convex tensor optimization, a topic where convergence guarantees are often lacking.

In summary, the paper presents a highly efficient, theoretically sound, and memory-friendly framework for nonnegative tensor factorization, making advanced $\beta$ -divergence models more accessible for large-scale data analysis.

Joint Majorization-Minimization for Nonnegative CP and Tucker Decompositions under β\betaβ-Divergences: Unfolding-Free Updates

The Old Way: The "Unfolding" Nightmare

The New Way: "Unfolding-Free" Updates

The Secret Sauce: "Joint Majorization"

1. The "Block-by-Block" Update (The Standard Approach)

2. The "Joint Majorization" Strategy (The Innovation)

Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology

A. Unfolding-Free Updates (Contraction-Only)

B. Joint Majorization-Minimization (J-CoMM)

C. Theoretical Analysis

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The *-variation of the Banach-Mazur game and forcing axioms

Modified averaged vector field methods preserving multiple invariants for conservative stochastic differential equations

The probabilistic superiority of stochastic symplectic methods via large deviations principles

Hodge-Gromov-Witten theory

Large deviations principles for symplectic discretizations of stochastic linear Schrödinger Equation

Joint Majorization-Minimization for Nonnegative CP and Tucker Decompositions under $\beta$ -Divergences: Unfolding-Free Updates