Generalizing Linear Autoencoder Recommenders with Decoupled Expected Quadratic Loss

Imagine you are running a massive library (a recommendation system) trying to guess what book a visitor will love next. You have a giant ledger (the data) showing which books people have checked out in the past.

For years, the smartest librarians used complex, deep-learning "super-brains" to make these guesses. But recently, a simpler approach called Linear Autoencoders (LAEs) started winning. It's like using a very sharp, simple calculator instead of a supercomputer. It works surprisingly well because it focuses on the most obvious patterns: "People who liked Book A also liked Book B."

One specific version of this calculator, called EDLAE, became a champion. It works by playing a game of "hide and seek" with the data. It hides some of the books people checked out (drops them) and tries to guess them back using the remaining books. To make the calculator better at this, it was taught to care more about the hidden books than the ones it could see.

However, the original creators of EDLAE made a strict rule: "You must care much more about the hidden books than the visible ones." They set a dial called $b$ to zero, meaning the calculator ignored the visible books entirely when learning.

This paper says: "Wait a minute. What if we turn that dial up a little bit?"

Here is the breakdown of what the authors did, using simple analogies:

1. The New Rulebook (DEQL)

The authors realized that the strict rule (ignoring visible books) wasn't always the best strategy. They created a new, more flexible framework called DEQL (Decoupled Expected Quadratic Loss).

The Old Way: Imagine a student studying for a test. The teacher says, "Only study the questions you got wrong; ignore the ones you got right."
The New Way (DEQL): The teacher says, "Study the questions you got wrong heavily, but don't completely ignore the ones you got right. Maybe looking at the easy ones helps you understand the hard ones better."

They proved mathematically that if you adjust this dial (letting $b > 0$ ), you can find solutions that are even better than the original "champion" model.

2. The "Too Hard to Solve" Problem

There was a catch. The original EDLAE was easy to solve because the math was simple. But when they tried to use the new, more flexible rules ( $b > 0$ ), the math became incredibly complicated.

The Analogy: Solving the old EDLAE was like solving a Sudoku puzzle. Solving the new DEQL with $b > 0$ was like trying to solve a million Sudokus at the same time. If you have a library with 100,000 books, the computer would take years to crunch the numbers. It was theoretically possible but practically impossible.

3. The Magic Shortcut (Miller's Theorem)

The authors didn't just give up; they found a "cheat code." They used a mathematical trick called Miller's Matrix Inverse Theorem.

The Analogy: Imagine you need to calculate the weight of a giant stack of bricks. The old way was to weigh every single brick one by one (taking forever). The authors found a way to weigh the whole stack by weighing just the bottom layer and doing a quick mental math trick to figure out the rest.
The Result: They turned a task that took years into a task that takes minutes. This made their new, better model actually usable on real-world data.

4. The Surprise Discovery

When they tested this new, faster, and more flexible model on real data (like Amazon books, movie ratings, and music), they found something surprising:

The Dial Matters: Sometimes, the best setting wasn't to ignore the visible books at all. On some datasets, the model performed best when it cared more about the books it could see than the ones it had to guess!
Breaking the Rules: The original authors thought you had to care more about the hidden items ( $a \ge b$ ). The new paper proved that sometimes, caring more about the visible items ( $b > a$ ) actually leads to better recommendations. It's like realizing that sometimes, reviewing your strengths is more helpful than obsessing over your weaknesses.

The Bottom Line

This paper is like taking a very good, simple calculator, realizing it was being used with a restrictive setting, and then inventing a new way to calculate that allows the calculator to use all its settings.

They generalized the math so it works for more situations.
They built a speed-boost so the math doesn't take forever.
They proved that the old "best" way wasn't actually the best, and that being flexible leads to better recommendations for users.

In short: Don't just ignore the easy stuff; sometimes, looking at everything helps you guess the hard stuff better.

1. Problem Statement

Linear Autoencoders (LAEs), such as EASE and EDLAE, have proven highly effective for collaborative filtering due to their simplicity, computational efficiency, and strong performance on sparse data. However, the state-of-the-art model, EDLAE (Steck, 2020), relies on a specific training objective involving dropout and emphasis weighting.

Limitation 1 (Theoretical Gap): EDLAE provides a closed-form solution only for the specific hyperparameter case where the emphasis weight on non-dropped items ( $b$ ) is zero ( $b=0$ ). The behavior and existence of solutions for the broader range $b > 0$ (including cases where $b > a$ , where $a$ is the weight on dropped items) were previously unexplored.
Limitation 2 (Computational Complexity): Extending the solution to $b > 0$ naively requires computing $n$ distinct matrix inverses (where $n$ is the number of items), leading to a prohibitive $O(n^4)$ time complexity, making it impractical for large-scale recommendation systems.
Limitation 3 (Sub-optimality): The original EDLAE assumes $a \ge b$ to prioritize reconstructing dropped items. It is unclear if this constraint is optimal or if relaxing it (e.g., $b > a$ ) could yield better generalization.

2. Methodology

A. Decoupled Expected Quadratic Loss (DEQL)

The authors generalize the EDLAE objective into a Decoupled Expected Quadratic Loss (DEQL).

Formulation: They reformulate the EDLAE objective as an expectation over a multivariate Bernoulli distribution of dropout masks. By decoupling the squared Frobenius norm over the columns of the weight matrix $W$ , the problem is transformed into $n$ independent expected quadratic loss problems.
General Solution: They derive a general closed-form solution for DEQL (Equation 8). This framework subsumes EDLAE as a special case but allows for a much broader hyperparameter space ( $b \ge 0$ ).
Key Theoretical Insights:
- Case $b=0$ : The solution is not unique. While off-diagonal entries are fixed, diagonal entries can be arbitrary. The original EDLAE solution corresponds to the specific choice of zero diagonals.
- Case $b>0$ : A unique closed-form solution always exists, even in the previously unexplored region where $b > a$ .

B. Efficient Algorithm via Miller's Theorem

To address the $O(n^4)$ complexity of computing the solution for $b > 0$ :

Matrix Decomposition: The authors observe that the matrix to be inverted, $H^{(i)}$ , differs from a base matrix $H_0$ only by low-rank updates (specifically, rank-1 updates related to the $i$ -th row and column).
Miller's Matrix Inverse Theorem: They apply Miller's theorem (1981) to compute the inverse of $H^{(i)}$ iteratively from $H_0^{-1}$ using rank-1 updates.
Complexity Reduction: This approach reduces the total computational complexity from $O(n^4)$ to $O(n^3)$ (specifically $O(\max(m+n)n^2)$ ), making the $b > 0$ solutions computationally tractable for large-scale datasets.

C. Regularization and Constraints

The framework naturally incorporates:

L2 Regularization: Added to the objective to prevent overfitting.
Zero-Diagonal Constraint: The authors show that while the original EDLAE enforces a zero diagonal, the optimal solution for $b > 0$ often benefits from small, non-zero diagonal entries. The algorithm can compute solutions with or without this constraint.

3. Key Contributions

Theoretical Generalization: Introduced DEQL, a unified loss function that generalizes EDLAE and proves the existence and uniqueness of closed-form solutions for the entire range $b \ge 0$ .
Algorithmic Innovation: Developed a fast algorithm based on Miller's theorem that reduces the complexity of solving EDLAE for $b > 0$ from $O(n^4)$ to $O(n^3)$ , enabling practical application.
Empirical Discovery: Demonstrated that the original EDLAE constraint ( $a \ge b$ ) is not universally optimal. They found that on certain datasets (e.g., AmazonBooks, Yelp2018), the optimal performance occurs when $b > a$ , challenging the intuition that dropped items should always be weighted more heavily.
Performance Gains: Showed that DEQL solutions with $b > 0$ and L2 regularization consistently outperform the $b=0$ EDLAE baseline and other state-of-the-art linear and deep learning models.

4. Experimental Results

Datasets: Evaluated on 9 benchmark datasets (Games, Beauty, ML-20M, Netflix, AmazonBooks, etc.) under both strong (user-split) and weak (interaction-split) generalization settings.
Performance:
- vs. LAE Baselines: DEQL(L2) and DEQL(L2+zero-diag) outperformed EASE, EDLAE, DLAE, and ELSA across most datasets.
- vs. Deep Learning: In weak generalization settings, DEQL(L2) achieved the best performance on AmazonBooks, outperforming deep models like LightGCN and SimpleX by significant margins (up to 27-34% improvement in Recall@20 and NDCG@20).
Sensitivity Analysis:
- Performance generally improves as $b$ increases from 0, peaks, and then declines.
- Crucial Finding: On datasets with high item-user ratios (sparse data), the optimal $b/a$ ratio often exceeds 1 ( $b > a$ ). This suggests that for very sparse data, emphasizing the reconstruction of remaining items (self-association) is more beneficial than emphasizing dropped items (cross-item association).
Efficiency: While DEQL requires more CPU memory (loading the full $n \times n$ matrix) than deep learning models, it trains significantly faster (minutes vs. hours) due to the closed-form nature, avoiding iterative gradient descent.

5. Significance

Redefining Linear Recommenders: The paper challenges the assumption that deep learning is necessary for top-tier recommendation performance, showing that optimized linear models can surpass complex neural networks, especially in sparse data regimes.
Theoretical Rigor: It provides the first rigorous derivation of closed-form solutions for EDLAE beyond the $b=0$ case, expanding the theoretical understanding of linear autoencoders.
Practical Impact: The proposed algorithm makes high-performance linear models scalable to large item catalogs ( $n$ ), offering a fast, reproducible, and interpretable alternative to deep learning.
Interpretability: By relaxing the strict zero-diagonal constraint and finding optimal $b$ values, the work highlights that "identity" mappings (self-reconstruction) play a non-trivial role in recommendation, offering new insights into how linear models capture item relationships.

In summary, this work bridges the gap between theoretical optimization and empirical performance in linear recommendation systems, proving that a generalized, efficiently computable loss function (DEQL) can unlock superior models that were previously inaccessible due to computational or theoretical limitations.