Metropolis--Hastings with Scalable Subsampling

Imagine you are trying to find the "perfect" spot in a massive, foggy city to set up a new coffee shop. You have a map (your data) and a hunch about what makes a good location (your model). But the city has billions of potential spots (data points).

To make the best decision, you need to check the foot traffic, rent prices, and competition at every single spot on the map. If you do this one by one, it would take you a lifetime. This is the problem statisticians face with Big Data and the Metropolis-Hastings (MH) algorithm. The MH algorithm is a smart way to explore a map by taking random steps and deciding whether to keep the new spot based on how much better it is than the old one. But traditionally, to make that decision, you have to check the entire city every single time you take a step. It's too slow.

The Old Ways: "Divide and Conquer" vs. "Guessing"

Before this paper, people tried two main workarounds:

The "Divide and Conquer" Team: They split the city into 100 small neighborhoods, sent 100 friends to check them, and tried to stitch the results together. The problem? If the city isn't perfectly uniform (which it rarely is), stitching the maps together creates a blurry, inaccurate picture.
The "Subsampling" Gamblers: They decided to just check a random handful of spots (a subsample) instead of the whole city to save time.
- The Problem: If you only check 5 spots out of a million, your guess might be wildly wrong. To fix this, previous methods used "control variates" (a fancy way of saying "smart guesses based on past experience") to correct the error. But their "smart guesses" were often too loose, meaning they still had to check way too many spots to be safe. It was like trying to guess the temperature of a whole ocean by sticking a thermometer in one cup of water and hoping your math was perfect.

The New Solution: MH-SS (Metropolis-Hastings with Scalable Subsampling)

The authors of this paper, Estevão Prado, Christopher Nemeth, and Chris Sherlock, have invented a new way to play this game. They call it MH-SS.

Think of MH-SS as having a super-smart GPS and a magic magnifying glass.

1. The "Smart GPS" (Control Variates)

Instead of guessing randomly, the algorithm first finds the "center of gravity" of the city (the posterior mode, or the most likely good spot). It then uses a Taylor expansion (a mathematical shortcut) to predict what the foot traffic would be at a new spot based on how far it is from the center.

The Analogy: Imagine you know the traffic is heavy at the city center. If you propose a new spot 1 mile away, you don't need to check every street. You can mathematically estimate the traffic drop-off with high precision. This estimate is your "Control Variate."

2. The "Magic Magnifying Glass" (Tighter Bounds)

Here is the breakthrough. Previous methods were like using a magnifying glass that was slightly out of focus. To be safe, they had to look at a huge area to make sure they didn't miss anything.

The authors figured out how to make the magnifying glass crystal clear. They derived new, much tighter mathematical "bounds."

The Result: Because their estimate is so accurate, they only need to check a tiny, tiny number of actual data points to confirm if their guess was right.
The Metaphor: If the old method needed to check 10,000 spots to be 99% sure, the new MH-SS method might only need to check 50 spots to get the same certainty.

3. The "Delayed Acceptance" (The Two-Stage Filter)

The algorithm uses a clever two-step filter to save even more time:

Stage 1 (The Quick Glance): It uses the "Smart GPS" estimate to see if the new spot looks promising. If the math says "No way, that's a bad spot," it rejects it immediately without checking any real data.
Stage 2 (The Deep Dive): If the spot looks promising, it only then pulls out the "Magic Magnifying Glass" and checks the tiny subsample of real data to make the final decision.

Why This Matters

In the real world, this is like upgrading from a sneaker to a jetpack.

Speed: The authors tested this on datasets with millions of data points (like the UK road accident database or high-energy physics data). Their method was orders of magnitude faster than the previous best methods.
Accuracy: Unlike the "Divide and Conquer" methods, this method is exact. It doesn't approximate the answer; it finds the true answer, just much faster.
Efficiency: It requires fewer "steps" (iterations) to find the best solution because it doesn't get stuck checking useless data.

The Bottom Line

Imagine you are trying to find the best route through a maze with a billion walls.

Old MH: You walk to every single wall to check if it's a dead end. (Takes forever).
Old Subsampling: You peek at a few walls, guess the rest, and hope you didn't hit a dead end. (Fast, but risky).
MH-SS (This Paper): You have a map that predicts the maze structure so well that you only need to peek at three walls to know for sure if the path is clear.

This paper gives statisticians a powerful new tool to analyze massive datasets (Big Data) without sacrificing accuracy or waiting years for the computer to finish the job. It turns a "prohibitively expensive" task into a routine calculation.

Here is a detailed technical summary of the paper "Metropolis–Hastings with Scalable Subsampling" by Estevão Prado, Christopher Nemeth, and Chris Sherlock.

1. Problem Statement

In Bayesian inference, the Metropolis–Hastings (MH) algorithm is the standard method for sampling from posterior distributions. However, for large datasets ( $n$ in the millions or billions), the algorithm becomes computationally prohibitive because it requires evaluating the full likelihood (summing over all $n$ data points) at every iteration to calculate the acceptance ratio.

Existing solutions face significant trade-offs:

Approximate Methods: Variational inference or Laplace approximations are fast but introduce bias and are not exact.
Divide-and-Conquer: Partitioning data and combining subposteriors is difficult to do accurately for non-Gaussian targets.
Exact Subsampling Methods: Previous exact subsampling MH algorithms (e.g., Scalable MH (SMH) by Cornish et al., 2019, and TunaMH by Zhang et al., 2020) attempt to use subsets of data but suffer from:
- Loose Bounds: They rely on bounds for the log-likelihood difference that become very loose as the parameter dimension ( $d$ ) increases, forcing the use of large subsamples.
- Poor Mixing: To maintain reasonable acceptance rates, these methods often require drastically reducing the proposal step size (scaling parameter $\lambda$ ), leading to high autocorrelation and slow mixing.
- Variance Issues: Methods like TunaMH use high-variance acceptance ratios that require tuning parameters to stabilize, often at the cost of efficiency.

2. Methodology: The MH-SS Algorithm

The authors propose Metropolis–Hastings with Scalable Subsampling (MH-SS), an exact algorithm that satisfies detailed balance with respect to the true posterior while utilizing small subsamples.

Core Mechanism

The algorithm relies on control variates derived from Taylor expansions of the log-likelihood terms around a pre-computed approximate posterior mode ( $\hat{\theta}$ ).

Control Variates:
For the difference in log-likelihood between a proposed state $\theta'$ and current state $\theta$ , the authors approximate the term $\ell_i(\theta') - \ell_i(\theta)$ using:
- First-order (CV1): Linear Taylor expansion.
- Second-order (CV2): Quadratic Taylor expansion involving the Hessian.
The approximation error is bounded by a deterministic function:
$|\ell_i(\theta') - \ell_i(\theta) - r_i(\theta, \theta'; \hat{\theta})| \leq c_i M(\theta, \theta')$
where $c_i$ depends on the data and $M(\theta, \theta')$ depends on the proposal step size.
Poisson Thinning & Subsampling:
Instead of evaluating all $n$ terms, the algorithm uses Poisson thinning to select a random subset of data indices $I$ .
- It defines auxiliary functions $\phi_i$ and $\phi'_i$ based on the error bounds and the control variate difference ( $\Delta_i$ ).
- Data points are sampled with probabilities proportional to these functions.
- The acceptance ratio is constructed using an unbiased estimator involving the sampled subset, ensuring the chain targets the exact posterior.
Delayed Acceptance (DA):
The algorithm employs a two-stage acceptance process:
- Stage 1: A cheap pre-screening step using the control variate approximation (no data subsampling required) to reject obviously poor proposals.
- Stage 2: If passed, a subsample-based correction is applied. If the subsample size required exceeds $n$ , the algorithm falls back to the full data likelihood.
Optimal Tuning:
The authors derive that the efficiency is maximized when the parameter $\gamma$ (controlling the shape of the bounding function) is set to 0. Furthermore, they establish that the optimal acceptance rate for MH-SS is approximately 45.2%, significantly higher than the standard 23.4% for Random Walk Metropolis (RWM), allowing for larger step sizes and better mixing.

3. Key Contributions

Tighter Theoretical Bounds: The paper derives new, significantly tighter bounds for the log-likelihood difference remainder terms for logistic, probit, Poisson, and robust regression models. These bounds scale as $O(d^{1/2})$ better than previous methods (SMH) in moderate-to-high dimensions, drastically reducing the required subsample size.
Optimality of Control Variates: The authors prove that setting $\gamma=0$ in the construction of the acceptance probability maximizes the acceptance rate and, consequently, the effective sample size per second.
Computational Complexity:
- The per-iteration cost of MH-SS is shown to be independent of $n$ (specifically $O(d^{3/2})$ for CV1 and $O(d^3/\sqrt{n})$ for CV2).
- In contrast, previous methods like SMH have costs that degrade more rapidly with dimension $d$ .
Multimodal Extension: The paper provides a theoretical extension of MH-SS to handle multimodal posteriors (e.g., mixture models) by selecting control variates based on the midpoint between current and proposed states, ensuring detailed balance is maintained.

4. Results

The authors validate MH-SS through extensive simulations and real-world applications (Hepmass, UK Road Casualties, US Census, Gas Sensors).

Efficiency Gains:
- vs. RWM: MH-SS is consistently orders of magnitude more efficient (in terms of Effective Sample Size per second) than standard RWM, especially as $n$ and $d$ increase.
- vs. SMH: MH-SS requires substantially smaller subsamples. For example, in logistic regression with $d=50$ , SMH often requires evaluating all $n$ data points, whereas MH-SS uses a tiny fraction.
- vs. TunaMH: While TunaMH uses small subsamples, it suffers from poor mixing due to the need for tiny step sizes to maintain acceptance rates. MH-SS achieves high acceptance rates with larger steps, resulting in much higher ESS.
Real-World Performance:
- On the Hepmass dataset ( $n=10^6, d=26$ ), MH-SS-2 (second-order) achieved an ESS per second roughly 69.2, compared to 0.002 for Tuna and 0.016 for SMH-1.
- MH-SS-2 was consistently the most efficient algorithm across all tested regression models (Logistic, Probit, Poisson).
Scaling Behavior: The empirical results confirm the theoretical prediction that MH-SS scales better with dimension $d$ than SMH. The "looseness" of SMH bounds forces it to evaluate more data as $d$ grows, whereas MH-SS bounds remain tight.

5. Significance

This paper represents a major advancement in scalable Bayesian inference.

Exactness: It provides a method that is computationally feasible for massive datasets while maintaining exactness (unlike variational inference).
Practicality: By deriving tighter bounds and optimal tuning parameters, the authors make the algorithm robust and easy to implement for standard regression models without the need for complex, problem-specific lower bounds required by methods like Firefly Monte Carlo.
Efficiency: The ability to achieve high acceptance rates (45%) with small subsamples allows the Markov chain to explore the posterior space much faster than existing exact subsampling methods, making it a superior choice for "Big Data" Bayesian analysis.

In summary, MH-SS solves the "curse of dimensionality" in subsampling MCMC by leveraging tighter control variate bounds and optimal acceptance strategies, offering a practical, exact, and highly efficient solution for large-scale Bayesian inference.