Estimation of relative risk, odds ratio and their logarithms with guaranteed accuracy and controlled sample size ratio

Imagine you are a detective trying to solve a mystery involving two different groups of people. Let's call them Team A and Team B.

Your goal is to figure out how much more likely a specific event (like catching a cold, or buying a product) is to happen in Team A compared to Team B. In statistics, we call this the Relative Risk. You might also want to know the "Odds Ratio," which is a slightly different way of measuring that same comparison, or the "Log" versions, which are just mathematical tweaks to make the numbers easier to handle.

The problem? You don't know the true probabilities. You only know that Team A has a probability $p_1$ and Team B has $p_2$ , but these numbers are hidden from you.

The Old Way vs. The New Way

The Old Way (Fixed Sample Size):
Imagine you decide to interview exactly 100 people from Team A and 100 from Team B, no matter what.

The Flaw: If the event is very rare (like winning the lottery), interviewing 100 people might not give you enough "successes" to make a good guess. Your estimate would be shaky. If the event is super common, you might have interviewed way more people than you needed, wasting time and money.
The Risk: You can't guarantee your answer will be accurate enough for every situation.

The New Way (This Paper's Solution):
The author, Luis Mendo, proposes a smart, two-step detective strategy that adapts as it goes. Think of it like a video game where you level up your equipment based on how the previous level went.

Step 1: The "Scout" Mission (First Stage)

You send out a small team of scouts to both Team A and Team B. You don't stop until you find a specific number of "successes" (e.g., 5 people who got sick).

Why? This gives you a rough, quick estimate of how common the event is in both groups.
The Magic: Because you stop based on finding successes (not a fixed number of people), you get a reliable "feel" for the rarity of the event, even if it's very rare.

Step 2: The "Main Mission" (Second Stage)

Now, you look at what the scouts found.

If the event is rare: You know you need to interview many more people to get a good answer.
If the event is common: You know you don't need to interview as many.
The Adjustment: The paper provides a mathematical formula to calculate exactly how many more people you need to interview in each group to hit a specific accuracy target (let's say, "I want my answer to be within 5% of the truth").

Crucially, this formula also lets you control the ratio of your teams. Maybe you have twice as many people available in Team A as Team B. The math ensures you use that extra manpower efficiently so you don't waste resources, while still hitting your accuracy goal.

Two Ways to Gather Data

The paper also covers two ways to collect your data:

Element Sampling (One by One): You interview people individually. As soon as you need one more person from Team A, you grab one. This is the most efficient way but requires you to be able to stop and start at any moment.
Group Sampling (Batches): Imagine you can only interview people in pre-packaged groups (e.g., a bus of 10 people from Team A and a bus of 10 from Team B arrive together).
- The Challenge: You might need 12 people from Team A, but the bus only brings 10. You take the bus, interview the 10, and then have to wait for the next bus to get the remaining 2.
- The Paper's Fix: The author shows how to handle these "leftover" people. At the end of the process, you might end up with a few extra people you don't need (who you politely send home), but the math guarantees that your final answer is still just as accurate as if you had interviewed them one by one.

Why is this a Big Deal?

In the world of statistics, there is a "Gold Standard" called the Cramér-Rao Bound. It's like the theoretical speed limit for how fast and accurate an estimator can possibly be.

Efficiency: The author proves that his method is very efficient. It gets close to that speed limit.
Guaranteed Accuracy: Unlike other methods that say "on average, this is accurate," this method says, "No matter what the hidden probabilities are, I guarantee your error will be below this specific number."
Versatility: It works for Relative Risk, Odds Ratios, and their logarithmic versions (which are used in machine learning and medical studies).

The Analogy of the "Smart Bucket"

Imagine you are trying to fill two buckets with water to a specific level (your accuracy target).

Bucket A has a leak (low probability of success).
Bucket B has a tiny hole (high probability of success).

The Old Method: You pour water from a hose for exactly 10 minutes into both. Bucket A might be empty, and Bucket B might overflow. You failed.

This Paper's Method:

Scout: You pour a little water to see how fast the buckets fill.
Adjust: You realize Bucket A is leaking fast, so you turn the hose up high. You realize Bucket B is holding water well, so you turn the hose down.
Stop: You stop exactly when both buckets reach the perfect level.
Group Constraint: If you can only pour water in 5-gallon jugs (Group Sampling), you might pour a little extra into Bucket B, but the math ensures you still hit the target level with minimal waste.

Summary

This paper gives statisticians and data scientists a universal toolkit to compare two groups. Whether you are testing a new vaccine, analyzing marketing data, or training an AI, this method ensures you get the most accurate answer possible without wasting time or money, and it works even when you are forced to collect data in batches. It turns a guessing game into a precise science.

Here is a detailed technical summary of the paper "Estimation of relative risk, odds ratio and their logarithms with guaranteed accuracy and controlled sample size ratio" by Luis Mendo.

1. Problem Statement

The paper addresses the statistical challenge of estimating four key parameters derived from two independent populations with unknown binary success probabilities, $p_1$ and $p_2$ :

Relative Risk (RR): $\theta = p_1/p_2$
Log Relative Risk (LRR): $\Theta = \log(p_1/p_2)$
Odds Ratio (OR): $\psi = \frac{p_1(1-p_2)}{p_2(1-p_1)}$
Log Odds Ratio (LOR): $\Psi = \log\left(\frac{p_1(1-p_2)}{p_2(1-p_1)}\right)$

Key Constraints and Requirements:

Guaranteed Accuracy: The estimators must ensure that the Mean Square Error (MSE) for logarithmic parameters, or the relative MSE for non-logarithmic parameters, is strictly less than a pre-specified target value $A$ , regardless of the true values of $p_1$ and $p_2$ (for $p_1, p_2 \in (0,1)$ ).
Controlled Sample Size Ratio: The ratio of the average sample sizes from the two populations ( $E[N_1]/E[N_2]$ ) must approximate a prescribed ratio $\lambda$ .
Sampling Modes: The solution must work for both element sampling (taking samples one by one) and group sampling (taking samples in fixed-size batches containing $l_1$ and $l_2$ items from populations 1 and 2, respectively).

Context: Fixed sample sizes cannot guarantee a specific relative error for all $p_1, p_2$ because the variance of estimators depends on the probabilities. Therefore, sequential sampling is required.

2. Methodology

The proposed solution utilizes a two-stage sequential sampling framework based on Inverse Binomial Sampling (IBS).

A. Core Mechanism: Inverse Binomial Sampling (IBS)

Instead of fixing the number of trials, IBS fixes the number of "successes" ( $r$ ) and observes the number of trials ( $N$ ) required to achieve them. $N$ follows a Negative Binomial distribution.

Stage 1 (Pilot): Fixed IBS parameters $r_1$ and $r_2$ are used to obtain initial sample sizes $M_1$ and $M_2$ . This provides a rough estimate of the target parameter to inform the second stage.
Stage 2 (Refinement): IBS parameters $s_1$ and $s_2$ are calculated dynamically based on the Stage 1 results ( $M_1, M_2$ ). These parameters are chosen to satisfy the accuracy constraint ( $A$ ) and the sample size ratio constraint ( $\lambda$ ).

B. Estimator Construction

For RR and LRR:
- The estimator uses the ratio of unbiased estimators derived from IBS.
- For RR, the estimator is $\hat{\theta} = \frac{(s_1-1)N_2}{s_2(N_1-1)}$ .
- For LRR, the estimator uses harmonic numbers: $\hat{\Theta} = -H_{N_1-1} + H_{N_2-1} + H_{s_1-1} - H_{s_2-1}$ .
- The parameters $s_1$ and $s_2$ are solved from a system of equations ensuring the error function $e(s_1, s_2) \leq A$ and the ratio constraint is met.
For OR and LOR:
- These require estimating odds ( $p/(1-p)$ ) rather than probabilities directly.
- Bernoulli Factory: A specific technique is employed to generate samples with parameter $\bar{p}_i = p_i(1-p_i)$ from samples with parameter $p_i$ . This allows the use of IBS on the "odds" space.
- The second stage involves two IBS processes per population (one for successes, one for failures) to construct unbiased estimators for the odds.

C. Handling Constraints

Accuracy: An "error function" $e(s_1, s_2)$ is defined such that if $e(s_1, s_2) \leq A$ , the relative MSE (or MSE) is guaranteed to be below $A$ for all $p_1, p_2$ .
Sample Size Ratio: The relationship between $s_1$ and $s_2$ is derived to ensure $E[M_1+N_1] / E[M_2+N_2] \approx \lambda$ . This involves solving for design parameters ( $\gamma, \delta_1, \delta_2$ ) based on the pilot data.
Group Sampling: Samples are taken in batches. If a batch provides more samples than needed for one population, the surplus is stored. The total number of groups $G$ is determined by the maximum requirement of either population.

3. Key Contributions

Unbiased Estimators with Guaranteed Accuracy: The paper proposes the first estimators for RR, OR, and their logarithms that guarantee a target accuracy (relative or absolute) for any $p_1, p_2 \in (0,1)$ , a feat impossible with fixed-sample designs.
Controlled Sample Size Ratio: The method explicitly controls the ratio of average sample sizes between the two populations, addressing a common practical requirement in clinical trials and social sciences.
Unified Framework: A single two-stage sequential framework is adapted to handle both element and group sampling, as well as both ratio-based (RR, OR) and log-based (LRR, LOR) parameters.
Bernoulli Factory Integration: For OR and LOR, the paper integrates a Bernoulli factory mechanism to transform binary observations into samples with the specific parameter $\bar{p} = p(1-p)$ , enabling efficient sequential estimation.
Theoretical Bounds and Efficiency Analysis:
- Derived upper bounds for average sample sizes and the number of groups.
- Analyzed estimation efficiency relative to the Cramér–Rao bound.
- Proved that efficiency approaches 1 as the target error $A \to 0$ .

4. Results

The paper validates the methodology through extensive theoretical analysis and Monte Carlo simulations ($10^6$ realizations).

Accuracy: Simulations confirm that the relative MSE (for RR/OR) and MSE (for LRR/LOR) are strictly below the target $A$ for all tested values of $p_1$ and $p_2$ .
Sample Size Ratio: The achieved ratio of average sample sizes closely matches the prescribed $\lambda$ . Deviations are minimal, especially for small $A$ .
Efficiency:
- The estimators are highly efficient. For small target errors (e.g., $A=0.04$ , corresponding to 20% relative RMSE), efficiency is around 80%.
- As $A \to 0$ , efficiency approaches 1, meaning the estimators perform nearly as well as the theoretical optimum (Cramér–Rao bound).
- Group Sampling vs. Element Sampling: Group sampling incurs a small efficiency loss (approx. 0.15 for $A \in [0.01, 0.1]$ ) because it requires taking the maximum of the two populations' needs, potentially discarding surplus samples. However, it ensures an exact integer ratio of sample sizes.
Approximation Accuracy: Theoretical bounds for average sample sizes derived in the paper are shown to be tight and accurate, particularly for small $A$ and small probabilities.

5. Significance

Medical and Social Science Applications: The ability to guarantee relative error is crucial in fields like epidemiology (e.g., vaccine efficacy trials) where the magnitude of the risk ratio matters more than absolute differences. Existing methods often fail to provide such guarantees without knowing the underlying probabilities.
Resource Optimization: By controlling the sample size ratio, researchers can optimize costs (e.g., if one population is harder to recruit, the method adjusts the other to maintain the desired statistical power ratio).
Machine Learning and Logistic Regression: The Log Odds Ratio (LOR) is a fundamental parameter in logistic regression. Providing a sequential estimator with guaranteed accuracy and controlled sampling is a significant contribution to statistical learning theory.
Generalizability: The paper demonstrates that this approach can be extended to other functions of $p_1$ and $p_2$ (e.g., $p_1 p_2$ ) provided an appropriate error function and Bernoulli factory can be constructed.

In summary, this paper provides a rigorous, practical, and efficient solution for sequential estimation of comparative risk metrics, overcoming the limitations of fixed-sample designs and offering precise control over both accuracy and experimental resource allocation.

Estimation of relative risk, odds ratio and their logarithms with guaranteed accuracy and controlled sample size ratio

The Old Way vs. The New Way

Step 1: The "Scout" Mission (First Stage)

Step 2: The "Main Mission" (Second Stage)

Two Ways to Gather Data

Why is this a Big Deal?

The Analogy of the "Smart Bucket"

Summary

1. Problem Statement

2. Methodology

A. Core Mechanism: Inverse Binomial Sampling (IBS)

B. Estimator Construction

C. Handling Constraints

3. Key Contributions

4. Results

5. Significance

More like this

The fourth known primitive solution to a5+b5+c5+d5=e5a^5 + b^5 + c^5 + d^5 = e^5a5+b5+c5+d5=e5

Waring-Goldbach problems for one square and higher powers

Reductification of parahoric group schemes

Sobolev regularity of the symmetric gradient of solutions to a class of ϕ\phiϕ-Laplacian systems

On the approximation of Weierstrass function via superoscillations

The fourth known primitive solution to $a^5 + b^5 + c^5 + d^5 = e^5$

Sobolev regularity of the symmetric gradient of solutions to a class of $\phi$ -Laplacian systems