From Latent to Observable Position-Based Click Models… — Plain-Language Explanation

Original authors: Santiago de Leon-Martinez, Robert Moro, Branislav Kveton, Maria Bielikova

Published 2026-06-17

📖 4 min read☕ Coffee break read

Original authors: Santiago de Leon-Martinez, Robert Moro, Branislav Kveton, Maria Bielikova

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are walking through a massive, digital library. Instead of a single long list of books on a shelf, the library is organized into carousels—rows of books that you can swipe left and right, like turning pages in a magazine. This is how modern apps like Netflix or Spotify show you movies and songs.

The paper you're asking about is trying to solve a mystery: How do we figure out what a user actually looks at versus what they actually click on?

In the world of computer science, this is called a "Click Model." For a long time, these models were like old, single-lane roads. They assumed users just looked at a list from top to bottom. But carousels are complex; users might look at the top row, swipe to the second, look at the third, and then go back to the first. The old models didn't understand this dance.

Here is what the researchers did, broken down simply:

1. The Problem: The "Guessing Game"

In the past, when a computer tried to learn what a user liked, it only had one clue: The Click.

If you clicked a movie, the computer knew you liked it.
If you didn't click, the computer had to guess: "Did they not see it? Did they see it and hate it? Or did they see it and just forget to click?"

Because the computer couldn't see the user's eyes, it had to guess (or "infer") where the user looked. This is like trying to guess what a person is reading in a crowded room just by seeing which book they pick up, without ever seeing their face.

2. The New Idea: "The Eye-Tracker Glasses"

The researchers decided to stop guessing. They used a dataset where they actually had eye-tracking data. Imagine giving the users special glasses that record exactly where their eyes stop and stare (called "fixations").

They built three new types of "Click Models" (mathematical recipes) to understand carousel behavior:

The Standard Model (CPBM): A smart guesser that learns where people look based on clicks.
The Row-Column Model (RCPBM): A model that understands that people look at the top of the screen more than the bottom, and the left more than the right, treating rows and columns separately.
The "Eye-Open" Model (OEPBM): This is the star of the show. It's the first model that doesn't have to guess. It uses the actual eye-tracking data as a direct signal. It knows, "Ah, the user's eyes were here for 2 seconds, so they definitely examined this item," regardless of whether they clicked it.

3. The Race: Old Math vs. New Math

The researchers also tested how to teach these models.

The Old Way (EM/MLE): Like a student slowly working through a textbook, checking one answer at a time to get it right.
The New Way (Gradient Ascent): Like a hiker feeling the slope of a mountain and taking big, smart steps downhill to find the bottom (the best answer) much faster.

The Result: The "hiker" (Gradient Ascent) consistently found better answers than the "textbook student" (the old methods).

4. The Big Discovery: Clicks Lie

Here is the most important lesson from the paper, explained with a metaphor:

Imagine a teacher grading a student.

The Click Score: The student gets an 'A' if they get the right answer.
The Behavior Score: The teacher also checks if the student actually read the question before answering.

The researchers found that if you only grade the Click Score, the student might get an 'A' by guessing or luck, even if they didn't actually read the question. The model might look perfect at predicting clicks, but it has a terrible understanding of how humans actually browse.

The OEPBM (the Eye-Open model) was the only one that got both scores right. It predicted clicks well and it perfectly matched the real eye-tracking patterns (like the "F-pattern," where people scan the top and left side first).

5. The Conclusion

The paper concludes with a simple truth: In complex interfaces like carousels, clicks alone are not enough.

If you want to truly understand how people browse, you can't just watch their fingers (clicks); you need to watch their eyes (or use other signals). The best model they built (OEPBM) uses this extra information to create a realistic picture of user behavior, proving that sometimes, you need more than just a "click" to understand a human.

In short: They built a new, smarter way to predict what people look at in swipeable lists, proved that modern math methods work better than old ones, and showed that without eye-tracking data, computers will always be guessing about what users actually see.

Technical Summary: From Latent to Observable Position-Based Click Models in Carousel Interfaces

Problem Statement
Recommender systems increasingly rely on complex, multi-list interfaces, specifically carousel interfaces (e.g., Netflix, Spotify), which allow users to browse horizontally within topic-specific rows and vertically across rows. However, existing click models are predominantly designed for single ranked lists and rely on assumptions (such as sequential browsing) that do not align with the observed browsing behaviors in carousels. Furthermore, traditional models often treat "examination" (whether a user looked at an item) as a latent variable, inferred solely from click data using methods like Expectation-Maximization (EM) or Maximum Likelihood Estimation (MLE). This paper addresses the gap in modeling realistic user examination and browsing patterns in carousel interfaces, questioning whether current optimization methods and model structures can accurately capture these behaviors without additional behavioral signals.

Methodology
The authors propose a shift from latent to observable examination modeling and evaluate three novel Position-Based Models (PBMs) adapted for carousels, alongside re-implementations of existing cascade-based baselines.

Model Formulations:
- Carousel Position-Based Model (CPBM): Extends the standard PBM to carousel coordinates $(i, j)$ , learning an independent examination probability $w_{i,j}$ for each position.
- Row-Column PBM (RCPBM): Decomposes examination into row ( $w_i$ ) and column ( $w_j$ ) probabilities to capture dependencies within rows and columns while reducing the number of parameters.
- Observed Examination PBM (OEPBM): The core novelty. This model replaces the latent examination variable with an observable signal derived from eye-tracking data. A position is considered examined if the user's gaze fixates on it. This allows the model to learn examination probabilities directly from observation rather than inference.
Optimization Strategies:
The paper implements a general framework supporting multiple optimization techniques to fit these models:
- Maximum Likelihood Estimation (MLE): Used for models with fully observable variables (TCM, CCM, OEPBM).
- Expectation-Maximization (EM): Used for models with latent examination variables (CPBM, RCPBM).
- Gradient Ascent (GA): A gradient-based optimization approach applied to the log-likelihood of both click-only and click-plus-examination data.
Experimental Setup:
- Dataset: The RecGaze dataset, containing eye-tracking and click data from 87 users interacting with a Netflix-mimicking carousel interface (10 rows, 15 items per row).
- Evaluation: Models were evaluated on two metrics:
  - Click Log-Likelihood (LL): Measures prediction accuracy of clicks/non-clicks.
  - Observed Examination Log-Likelihood (OELL): Measures the joint accuracy of predicting clicks and the actual eye-tracking examination events. This metric penalizes models that fit clicks but fail to model realistic browsing patterns.
- Scenarios: Experiments included a "Standard" setup (learning both attraction and examination) and a "Fixed Attraction" setup (fixing item attractiveness to the observed CTR to isolate examination learning).

Key Contributions

Novel Models: The proposal of three carousel-specific PBM variants, most notably the OEPBM, which is the first position-based click model to incorporate observed examination signals from eye tracking, eliminating the need to infer examination as a latent variable.
Optimization Analysis: A comprehensive implementation and comparison of gradient-based optimization (GA) against classic EM and MLE methods for carousel click models.
Behavioral Alignment: An empirical demonstration that optimizing solely for click likelihood does not guarantee the modeling of realistic user examination patterns. The study highlights the necessity of incorporating additional behavioral signals (like gaze) to achieve realistic user modeling in complex interfaces.

Results

Optimization Performance: Gradient Ascent (GA) consistently achieved better or equal click likelihoods compared to EM and MLE. Specifically, GA initialized with MLE solutions (or CTR) often outperformed classic methods, suggesting GA is a robust and efficient approach for fitting carousel click models.
Model Performance:
- In terms of Click Likelihood, the PBM variants (CPBM, RCPBM, OEPBM) generally outperformed the existing cascade-based baselines (TCM, CCM).
- In terms of Examination Realism (OELL), the OEPBM achieved the strongest performance. It produced examination patterns that most closely aligned with actual user eye-tracking data, capturing the "F-pattern" on initial rows and a "mirrored F-pattern" on swiped rows.
- Latent vs. Observable: Models with latent examination (CPBM, RCPBM) struggled to match the OELL performance of OEPBM unless initialized with gaze data. This indicates that inferring examination from clicks alone is insufficient for capturing realistic browsing behavior.
The "Click-Only" Limitation: The results revealed a fundamental limitation: a model can achieve a high click likelihood while modeling unrealistic examination patterns. Optimizing for clicks alone does not guarantee a correct understanding of user browsing behavior.

Significance and Claims
The paper claims that while gradient-based optimization offers superior performance in fitting carousel click models compared to classic methods, the structural design of the model is equally critical. The primary significance lies in demonstrating that click-only models are fundamentally limited in complex interfaces like carousels.

The authors argue that to move beyond "click fit" and achieve realistic user modeling, click models must incorporate additional behavioral signals, such as eye-tracking data. The OEPBM serves as proof that transforming latent examination into an observable variable significantly improves the alignment between the model's internal representation of user behavior and actual user actions. The work advocates for "interface-aware" recommender systems that leverage richer feedback loops beyond simple clicks to better understand and predict user intent in modern, multi-list UIs.

From Latent to Observable Position-Based Click Models in Carousel Interfaces