Nominal Alpha | Observed | Nominal Alpha | Observed |
---|---|---|---|

0.01 | 0.000 | 0.11 | 0.030 |

0.02 | 0.003 | 0.12 | 0.032 |

0.03 | 0.007 | 0.13 | 0.037 |

0.04 | 0.009 | 0.14 | 0.040 |

0.05 | 0.011 | 0.15 | 0.043 |

0.06 | 0.014 | 0.16 | 0.043 |

0.07 | 0.016 | 0.17 | 0.045 |

0.08 | 0.020 | 0.18 | 0.045 |

0.09 | 0.022 | 0.19 | 0.047 |

0.10 | 0.025 | 0.20 | 0.051 |

# Notes on Wald’s SPRT

Wald’s sequential probability ratio test is a sequential procedure that allows one make observations sequentially. The procedure terminates as soon as the evidence is sufficient to decide between \(H_0\) and \(H_1\), and can result in runtime savings over traditional “fixed horizon” approaches common to A/B testing.

Wald’s Sequential Probability Ratio Test (SPRT) is a test for data that comes in sequentially (not all at once). It allows us the opportunity to stop and draw a conclusion after each observation comes in one at a time, while maintaining the overall type 1 error rate \(\alpha\) and type 2 error rate \(\beta\). It found uses in manufacturing/quality control but more recently in online A/B testing problems.

According to Friedman and Friedman (1998), the US Navy during the Second World War had a problem of measuring superiority of one munition to another. Firing munitions can be expensive in terms of time and materials. The traditional approach (choosing a sample size so that we can guarantee the error rates, waiting until the full sample size is collected, then drawing a conclusion) seemed wasteful to experienced ordinance officers. These officers might notice after the first few thousand or ever few hundred rounds that one kind of ordinance was clearly superior to the other, so why waste the rest of the rounds and their time?

This problem led Wald (1945) to develop the SPRT in 1943. Such was its value that it was classified as a military secret until 1945. The test compares two simple hypotheses about a parameter \(\theta\), \(H_0: \theta = \theta_0\) versus \(H_A: \theta = \theta_1\). Wald and Wolfowitz (1948) found that Wald’s test is optimal in the sense that no test can do better in terms of expected run time under \(H_0\) and under \(H_1\) without having a higher type 1 and/or type 2 error rate. This optimality condition at \(\theta_0\) and \(\theta_1\) is called Wald-Wolfowitz optimality, and is different from Kiefer-Wolfowitz optimality that I’ll describe later.

The SPRT relies on the likelihood ratio statistic \(\Lambda_n\) where

\[ \Lambda_n = \frac{f_1(X_1,\ldots,X_n)}{f_0(X_1,\ldots,X_n)} \]

and \(f_i\) represents the density or mass function with parameter value \(\theta_i\). Often we assume the events are independent and identically distributed so that the statistic

\[ \Lambda_n = \prod_{i=1}^n \frac{f_1(X_i)}{f_0(X_i)}~. \]

Let’s take a quick detour to discuss this statistic in the contect of martingales.

Next, let’s look at a quick example of how to formulate the likelihood ratio for Guassian variates.

# Wald’s SPRT

In Wald (1945), Wald chooses constant boundaries \(A\) and \(B\) for the likelihood ratio statistic over \(n=1,2,\ldots\) and continues to sample at each time \(n\) while

\[ B < \Lambda_n < A~. \]

When \(\Lambda_n > A\) then we stop and declare for \(H_1\). When \(\Lambda_n < B\) we stop and declare for \(H_0\). We want to choose \(A\) and \(B\) so that we have some control of our type 1 error rate \(\alpha\) and type 2 error rate \(\beta\).

Consider some type 1 error \(\alpha\) under \(H_0\) and type 2 error \(\beta\) under \(H_1\). We proceed conditional on the experiment stopping at time \(n\) by defining two events

\[ R_0 = \{x : \Lambda_n \le B\} \quad\text{and}\quad R_1 = \{x : \Lambda_n \ge A\} \]

where \(R_0\) is when we stop the test and find for \(H_0\), and \(R_1\) are the events where we stop the test finding for \(H_1\).

The probability of detection \(P(\text{Detection})=1-\beta\) is found with respect to \(H_1\) when \(x \in R_1\), and the probability of a false alarm \(P(\text{False Alarm}) = \alpha\) is found with respect to \(H_0\) when \(x \in R_1\) so \[\begin{align*} 1-\beta &= \int_{R_1} f_1(x) ~\mathrm{d}x = \int_{R_1} \frac{f_1(x)}{f_0(x)} f_0(x) ~\mathrm{d}x \\ &= \int_{R_1} \Lambda_k f_0(x) ~\mathrm{d}x \ge A \int_{R_1} f_0(x)~\mathrm{d}x = A ~\alpha \end{align*}\] and we can choose \(A \ge \frac{1-\beta}{\alpha}\) to ensure our errors don’t exceed \(\alpha\) and \(\beta\).

Similarly, the probability of finding for \(H_0\) when \(H_0\) is true, \(1-P(\text{False Alarm}) = 1 - \alpha\), is found with respect to \(H_0\) when \(x \in R_0\). Also, the probability of a type 2 error \(1-P(\text{Detection}) = \beta\) is found with respect to \(H_1\) when \(x \in R_0\), so \[\begin{align*} 1-\alpha &= \int_{R_0} f_0(x) ~\mathrm{d}x = \int_{R_0} \frac{f_0(x)}{f_1(x)} f_1(x) ~\mathrm{d}x \\ &= \int_{R_0} (\Lambda_k)^{-1} f_1(x) ~\mathrm{d}x \ge B^{-1} \int_{R_0} f_1(x)~\mathrm{d}x = B^{-1} ~\beta \end{align*}\] and we can choose \(B \le \frac{\beta}{1-\alpha}\) to ensure our errors don’t exceed \(\alpha\) and \(\beta\).

Since we have bounds for \(A\) and \(B\) we can choose conservatively and decide as follows

\[ \text{At time}~n \left\{ \begin{array}{ll} \text{decide for}~H_0 & \text{when}~\Lambda_n < B = \frac{\beta}{1-\alpha},\\ \text{decide for}~H_1 & \text{when}~\Lambda_n > A = \frac{1-\beta}{\alpha},~\text{and}\\ \text{collect more data} & \text{otherwise}. \end{array} \right. \]

As long as the process is guaranteed to stop and the error rates at each stopping point are constant (or bounded at least) then the overall test will have those error rate (bounds). When the data are iid the stopping guarantee is given by Wald, but when the data are not it does not immediately follow that this process will stop. Wald and Wolfowitz (1948) show that the stopping time of the above procedure is optimal in the Wald-Wolfowitz sense.

Note that sequential tests are not generally required to have the same error rates at each stopping point, but for computational tractability we do so here. Other shapes of error boundaries may produce performance that is better for every \(\theta \in (\theta_0, \theta_1)\). Such performance is related to Kiefer-Wolfowitz optimality.

As a check I’ve simulated 100000 runs of experiments where \(H_0: \theta = 0\) is true or \(H_A: \theta = 2\) is true and our observations are distributed \(\mathrm{N}(\theta,1)\). The empirical type 1 and type 2 error rates for nominal \(\alpha\) and \(\beta\) are given in the table below.

\(\alpha=0.01\) | \(\alpha=0.05\) | \(\alpha=0.10\) | \(\alpha=0.01\) | \(\alpha=0.05\) | \(\alpha=0.10\) | |
---|---|---|---|---|---|---|

\(\beta=0.01\) | 0.00324 | 0.01664 | 0.03178 | 0.00330 | 0.00367 | 0.00332 |

\(\beta=0.05\) | 0.00338 | 0.01697 | 0.03202 | 0.01601 | 0.01631 | 0.01700 |

\(\beta=0.10\) | 0.00357 | 0.01781 | 0.03333 | 0.03105 | 0.03326 | 0.03302 |

# Calculating \(p\)-Values

Since the \(p\)-value is the smallest \(\alpha\) at which we’d reject \(H_0\), we can use Ville’s inequality directly (or use Wald’s derivation with \(\beta = 0\)) and take

\[ p\text{-value at time}~k \le \min_{i=1,2,\ldots,k} \frac{1}{\Lambda_i}~. \]

The inequality comes from the fact that Ville’s inequality is for the supremum which must be larger than (or equal to) the minimum over any finite set of observations. Again, as long as the test is guaranteed to conclude this *should* give us something that’s useful as a \(p\)-value. A simulation study of \(p\)-values given under \(H_0: \theta = 0\) where \(X_i \sim \mathrm{N}(\theta, 1)\) and an alternative \(H_A: \theta = \theta_1\) is performed.

It’s interesting how conservative these \(p\)-values are, and that the amount of conservatism increases as the alternative grows larger.

# Calculating Confidence Intervals

We cannot construct a general confidence interval because the ratio of likelihoods is messy. Numerically, we can search for values of \(H_1: \theta = \theta_1\) for which the \(p\)-value is greater than or equal to our confidence level \(\alpha\). But there are certain circumstances where we can construct a closed form solution for the interval.

From Ville’s inequality or from Wald’s with \(\beta = 0\) we know

\[ P\left(\sup_k \Lambda_k > \frac{1}{\alpha} \right) < \alpha \]

which we can restate as

\[ P\left(\Lambda_k > \frac{1}{\alpha} \quad\text{for some}\quad k\ge 1\right) < \alpha~. \]

The complement event then has probability

\[ P\left(\Lambda_k \le \frac{1}{\alpha} \quad\text{for every}\quad k\ge 1\right) \ge 1 - \alpha~. \]

The sequence of intervals at time \(k=1,2,\ldots\) of values of \(\theta_1\) satisfying \(\Lambda_k < \frac{1}{\alpha}\) is thus a \(100(1-\alpha)\%\)-level confidence sequence.

# Expected Sample Sizes

For the iid case, Wald (1945) demonstrated that the random stopping time \(K\) has expectation under \(f_i\) for \(i=0,1\)

\[ \mathbb{E}_i\left[\log \Lambda_K\right] = \mathbb{E}_i\left[K\right] \mathbb{E}_i\left[\log \frac{f_1(X_i)}{f_0(X_i)}\right] \]

Also, from our error calculations (assuming the probability mass of passing the boundary is that of us hitting the boundary) \[\begin{align*} \mathbb{E}_0[\log \lambda_K] &\approx \alpha \log A + (1-\alpha) \log B \quad\text{and}\\ \mathbb{E}_1[\log \lambda_K] &\approx (1 - \beta) \log A + \beta \log B~. \end{align*}\] Solving for expected value \(\mathbb{E}_i[K]\) we get \[\begin{align*} \mathbb{E}_0[K] &\approx \frac{\alpha \log A + (1-\alpha) \log B}{\mathbb{E}_0\left[\log \frac{f_1(X_i)}{f_0(X_i)}\right]} \quad\text{and}\\ \mathbb{E}_1[K] &\approx \frac{(1 - \beta) \log A + \beta \log B}{\mathbb{E}_1\left[\log \frac{f_1(X_i)}{f_0(X_i)}\right]}~. \end{align*}\]

Let’s look at a specific example, a Bernoulli random variable.

# Bernouli Example

Consider a situation where we observe one at a time some binary event \(X_k \in \{0,1\}\). Let \(S_n = \sum_{i=1,\ldots,n} X_n\) represent the sum of the successes up to time \(n\). The likelihood ratio statistic is then

\[ \Lambda_n = \frac{p_1^{S_n}(1-p_1)^{n-S_n}}{p_0^{S_n}(1-p_0)^{n-S_n}}~. \]

It’s often better to work at log scale whenever likelihood ratios are involved, in which case

\[ \log \Lambda_n = S_n \log\left[\frac{p_1(1-p_0)}{p_0(1-p_1)}\right] + n \log\left(\frac{1-p_1}{1-p_0}\right) \]

If we let \(\alpha = 0.05\) and \(\beta = 0.10\) then we end up with bounds (on log scale) of

\[ \log A = 2.890372 \quad\text{and}\quad \log B = -2.251292 \]

Under \(H_0: p = 0.15\) and \(H_1: p = 0.1575\) then

\[ \log \Lambda_n = 0.05765285 \cdot S_n - 0.008862687 \cdot n \]

Note that, if the success rate is \(0.15375 = (\theta_0 + \theta_1)/2\) then the expected value of this is zero. This suggests that our test will take the longest at or near that value of \(\theta\).

I did a simulation of 100,000 runs of the SPRT under the case \(p=p_0\), \(p=p_1\), and \(p=(p_0 + p_1)/2\) and found the following decision rates and expected run-times:

True p | Choose H0 | Choose H1 | Exp. Runtime |
---|---|---|---|

\(p_0\) | 95.1% | 4.9% | 9,332.4 |

\(p_1\) | 9.9% | 90.1% | 11,000.3 |

\((p_0 + p_1)/2\) | 55.9% | 44.1% | 15,154.2 |

### Bernouli Expected Stopping Time

We can work out the expected stopping time for this Bernoulli example since the Kullback-Leibler divergences are simple. Explicitly, \[\begin{align*} \mathbb{E}_0\left[\log \frac{f_1(X_i)}{f_0(X_i)}\right] &= p_0 \log \frac{p_1}{p_0} + (1-p_0) \log \frac{1-p_1}{1-p_0},\quad\text{and} \\ \mathbb{E}_1\left[\log \frac{f_1(X_i)}{f_0(X_i)}\right] &= p_1 \log \frac{p_1}{p_0} + (1-p_1) \log \frac{1-p_1}{1-p_0}~. \end{align*}\] The expected stopping time is thus \[\begin{align*} \mathbb{E}_0[K] &\approx \frac{\alpha \log \frac{1-\beta}{\alpha} + (1-\alpha) \log \frac{\beta}{1-\alpha}}{p_0 \log \frac{p_1}{p_0} + (1-p_0) \log \frac{1-p_1}{1-p_0}} \quad\text{and}\\ \mathbb{E}_1[K] &\approx \frac{(1 - \beta) \log \frac{1-\beta}{\alpha} + \beta \log \frac{\beta}{1-\alpha}}{p_1 \log \frac{p_1}{p_0} + (1-p_1) \log \frac{1-p_1}{1-p_0}}~. \end{align*}\]

For our example with \(\alpha = 0.05\), \(\beta=0.1\), \(p_0=0.15\), and \(p_1=0.1575\) we find the predicted value is very close to the empirical value:

True \(p\) | Analytical \(\mathbb{E}[K]\) | Empirical \(\mathbb{E}[K]\) |
---|---|---|

\(p_0\) | 9,285.8 | 9,332.4 |

\(p_1\) | 10,918.2 | 11,000.3 |

# Comparing Bernouli Distributions

Let’s say we have a two sample problem where we sample from a control \(X_i \sim \mathbf{Bernouli}(p_x)\), \(i=1,\ldots,n_x\) and from a challenger \(Y_i \sim \mathbf{Bernouli}(p_y)\), \(i=1,\ldots,n_y\) and where these results come in one at an arbitrary ordering of \(X_i\)’s and \(Y_i\)’s over time. We follow the approach used by Johari et al. (2022) for a different version of the SPRT and examine the statistic \(Z\).

\[ Z = \frac{\hat{p}_y - \hat{p}_x - \theta}{\sqrt{\hat{p_x}(1-\hat{p}_x)+\hat{p_y}(1-\hat{p}_y)}} \]

where \[\begin{align*} \hat{p} &= \frac{\sum_{i=1}^{n_x} X_i + \sum_{i=1}^{n_y} Y_i}{n_x + n_y}~,\\ \hat{p}_x &= \hat{p} - \frac{\theta}{2}~,\quad\text{and}\\ \hat{p}_y &= \hat{p} + \frac{\theta}{2}~. \end{align*}\]

Asymptotically, \(Z \sim \mathrm{N}(0,1)\), so the standard normal likelihood is used in this test. Johari et al. (2022) note that early on the sample size may be too small for the approximation to be appropriate, but if the value of \(\alpha\) is not too large then it’s not a problem. Another option would be to use a higher order approximation for binomial or some biasing approach that has better small sample performance impact while having a negligible large sample impact. (Note that Johari et al. (2022) go on to use a James-Stein estimator for effect sizes, which we do not use here).

Another option is to use the binomial distribution directly.

### Johari’s Approximation

Using the plugin estimation method of Johari we perform a simulation similar to the last.

Again with \(p_0 = 0.15\), \(p_1 = 0.1575\), \(\alpha = 0.05\) and \(\beta = 0.10\), simulating 100,000 runs under various conditions, we find:

True p | Choose H0 | Choose H1 | Exp. Runtime |
---|---|---|---|

\(p_0\) | 95.0% | 4.9% | 36,374.8 |

\(p_1\) | 10.0% | 89.8% | 44,100.6 |

\((p_0 + p_1)/2\) | 55.1% | 42.7% | 58,784.7 |

# Problems with Wald’s SPRT

While Wald’s method is surprising and elegantly leverages a lot of lucky simplifications, it’s not (and there is not) a universal solution to sequential testing. Some criticisms of Wald’s method are:

- There’s no guaranteed stopping time which is unrealistic because no one will run a test with this for 1,000 years.
- The specific trade-offs implied in the constant probability of type 1 and type 2 error at each stopping time may not vibe with you very well.
- Can be less efficient than group sequential methods if fewer check-ins are required.
- “Requires knowing the likelihood” which I guess, I dunno, I read that and that’s true so I put it here.
- Assumes no nuisance parameters - though there’s some techniques for handling this IIRC.
- “Only tests simple alternative not composite” - I think this only has teeth when Wald-Wolfowitz optimality is bad in terms of Kiefer-Wolfowitz optimality, or when you want confidence intervals to not be weird.

Anyway, that’s it. That’s what I got. I plan more about Lorden’s 2-SPRT as an approximate solution to the Kiefer-Wolfowitz problem and the mixture SPRT due to Robbins (1970) that Optimizely uses, but at some other time, maybe never.

## References

*Two Lucky People: Memoirs*. University of Chicago Press. https://books.google.com/books?id=\_rY3wdaDZ5UC.

*Operations Research*70 (3): 1806–21.

*The Annals of Mathematical Statistics*41 (5): 1397–1409. https://doi.org/10.1214/aoms/1177696786.

*Wikipedia*. Wikimedia Foundation. https://en.wikipedia.org/wiki/Ville%27s_inequality.

*The Annals of Mathematical Statistics*16 (2): 117–86. https://doi.org/10.1214/aoms/1177731118.

*The Annals of Mathematical Statistics*19 (3): 326–39. https://doi.org/10.1214/aoms/1177730197.