Least trimmed squares regression with missing values and cellwise outliers

This paper proposes a new least trimmed squares regression method that simultaneously handles missing values, resists both casewise and cellwise outliers, accommodates skewed distributions, and enables robust out-of-sample predictions.

Jakob Raymaekers, Peter J. Rousseeuw

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are trying to bake the perfect cake based on a recipe that uses 20 different ingredients. You have data from 3,000 different bakeries. Most of them follow the recipe perfectly, but some have made mistakes.

In the world of statistics, this is called Regression. It's the tool we use to find the "recipe" (the relationship between ingredients and the final cake) that works best for the majority of the data.

However, real-world data is messy. This paper introduces a new, super-smart way to handle two specific types of mess:

  1. The "Bad Batch" (Casewise Outliers): An entire bakery messed up the whole cake (maybe they forgot the oven).
  2. The "Sour Spoon" (Cellwise Outliers): A bakery got the recipe mostly right, but one specific ingredient was measured wrong (e.g., they used salt instead of sugar).

The Problem with Old Methods

Traditional methods (like Ordinary Least Squares) are like a chef who tastes the whole batch and tries to average it out. If one bakery put in a cup of salt instead of sugar, the chef might think, "Oh, maybe the recipe actually needs salt!" and ruin the recipe for everyone else.

Even newer "robust" methods are good at spotting the "Bad Batches" (entirely wrong bakeries), but they often get confused by the "Sour Spoons." If one ingredient is wrong, they might throw away the whole bakery's data, or worse, they might not know how to predict the cake for a new bakery that has a wrong ingredient.

The New Solution: "CellLTS"

The authors, Jakob and Peter, propose a new method called CellLTS. Think of it as a two-step detective process that acts like a very careful, paranoid chef.

Step 1: The "Ingredient Scrub" (Cleaning the Predictors)

Before even looking at the final cake (the result), the method looks at the ingredients (the predictors) first.

  • The Symmetrization Trick: Imagine you have a list of heights. Some are very tall, some very short. To make the math easier and fairer, the method creates a "mirror world." It pairs every person with every other person and looks at the difference between them. This turns a lopsided, messy list into a neat, symmetrical bell curve. It's like taking a jagged rock and grinding it down until it's a smooth, perfect sphere.
  • The "CellMCD" Detective: Now, using this smooth data, the method scans every single ingredient. If a bakery says they used 500 pounds of flour (when the average is 5), the system flags that specific cell as "suspicious."
  • The Repair: Instead of throwing the whole bakery away, the system says, "Okay, the flour measurement is wrong. Based on the other ingredients they used (like sugar and eggs), what should the flour amount have been?" It fills in the missing or wrong data with a "best guess" (imputation).

Step 2: The "Cake Tasting" (Robust Regression)

Now that the ingredient lists are cleaned and fixed, the method looks at the final cakes.

  • The "Trimmed" Taste Test: It uses a technique called Least Trimmed Squares (LTS). Imagine tasting 100 cakes. Instead of averaging all 100 (which would be ruined by 5 burnt ones), the chef ignores the 25 worst-tasting cakes and calculates the average of the best 75. This finds the true "recipe" without being swayed by the disasters.

Why This is a Game-Changer: The "New Bakery" Test

The coolest part of this paper is how it handles Out-of-Sample Prediction.

Imagine a brand new bakery comes in. They have a list of ingredients, but one is missing, and one looks suspiciously high.

  • Old Methods: Would say, "I can't use this data," or "I'll just plug these numbers into my formula," which would give a terrible prediction because the input was broken.
  • CellLTS: Says, "Hold on. That ingredient looks weird. Let me check my 'Sour Spoon' database. It looks like a measurement error. I'll fix that number first, fill in the missing one, and then predict the cake."

It treats the new data with the same care as the old data, cleaning it before making a guess.

The Real-World Test: Cancer Rates

The authors tested this on real data about cancer death rates across US counties.

  • The Mess: They found counties with impossible data, like a median age of 400 years (clearly a typo) or a cancer incidence rate that was way too high.
  • The Result: The old methods got confused by these typos and gave weird predictions. CellLTS spotted the "400-year-old" error, fixed it, and gave a much more accurate prediction of cancer rates based on income, education, and age.

The Bottom Line

This paper gives statisticians a new, powerful tool that doesn't just ignore bad data; it fixes it.

  • It handles missing values (like a missing ingredient).
  • It spots single bad numbers (a sour spoon) without throwing away the whole recipe.
  • It works even when the data is skewed (like a list of billionaires and regular people).
  • It can predict the future for new, messy data by cleaning it first.

In short, it's a robust, self-correcting system that ensures your statistical "recipe" stays delicious, even when the kitchen is a bit of a disaster.