Notes on Wald’s SPRT

Wald’s sequential probability ratio test is a sequential procedure that allows one make observations sequentially. The procedure terminates as soon as the evidence is sufficient to decide between \(H_0\) and \(H_1\), and can result in runtime savings over traditional “fixed horizon” approaches common to A/B testing.

Published

May 20, 2023

Warning

This is in-progress and may be terribly wrong. In fact, it probably is wrong.

Wald’s Sequential Probability Ratio Test (SPRT) is a test for data that comes in sequentially (not all at once). It allows us the opportunity to stop and draw a conclusion after each observation comes in one at a time, while maintaining the overall type 1 error rate \(\alpha\) and type 2 error rate \(\beta\). It found uses in manufacturing/quality control but more recently in online A/B testing problems.

According to Friedman and Friedman (1998), the US Navy during the Second World War had a problem of measuring superiority of one munition to another. Firing munitions can be expensive in terms of time and materials. The traditional approach (choosing a sample size so that we can guarantee the error rates, waiting until the full sample size is collected, then drawing a conclusion) seemed wasteful to experienced ordinance officers. These officers might notice after the first few thousand or ever few hundred rounds that one kind of ordinance was clearly superior to the other, so why waste the rest of the rounds and their time?

This problem led Wald (1945) to develop the SPRT in 1943. Such was its value that it was classified as a military secret until 1945. The test compares two simple hypotheses about a parameter \(\theta\), \(H_0: \theta = \theta_0\) versus \(H_A: \theta = \theta_1\). Wald and Wolfowitz (1948) found that Wald’s test is optimal in the sense that no test can do better in terms of expected run time under \(H_0\) and under \(H_1\) without having a higher type 1 and/or type 2 error rate. This optimality condition at \(\theta_0\) and \(\theta_1\) is called Wald-Wolfowitz optimality, and is different from Kiefer-Wolfowitz optimality that I’ll describe later.

The SPRT relies on the likelihood ratio statistic \(\Lambda_n\) where

\[ \Lambda_n = \frac{f_1(X_1,\ldots,X_n)}{f_0(X_1,\ldots,X_n)} \]

and \(f_i\) represents the density or mass function with parameter value \(\theta_i\). Often we assume the events are independent and identically distributed so that the statistic

\[ \Lambda_n = \prod_{i=1}^n \frac{f_1(X_i)}{f_0(X_i)}~. \]

Let’s take a quick detour to discuss this statistic in the contect of martingales.

The Likelood Ratio Statistic is a Martingale

It’s interesting that the likelihood ratio statistic is a positive martingale: \[\begin{align*} \mathbb{E}_{f_0}\left[\Lambda_{k+1}|\Lambda_k,\ldots,\Lambda_1\right] &= \mathbb{E}_{f_0}\left[\frac{f_1(X_{k+1})}{f_0(X_{k+1})}\Lambda_{k}|\Lambda_k,\ldots,\Lambda_1\right] \\ &= \Lambda_k \mathbb{E}_{f_0}\left[\frac{f_1(X_{k+1})}{f_0(X_{k+1})}\right] \\ &= \Lambda_k \int \frac{f_1(x)}{f_0(x)} f_0(x) ~\mathrm{d}x \\ &= \Lambda_k \int f_1(x) ~\mathrm{d}x \\ &= \Lambda_k~. \end{align*}\]

Immediately, one could use Ville’s maximal inequality for super-martingales (“Ville’s Inequality” 2023). If \((S_n)\) is a non-negative super-martingale, then for any \(x > 0\) we have

\[ P\left(\sup_{n \in \mathbb{N}} S_n > x\right) \le \frac{\mathbb{E}[S_0]}{x}~. \]

Since \(\mathbb{E}_0[S_0] = 1\) this lets us choose an upper boundary \(x = \frac{1}{\alpha}\) to ensure the type 1 error rate is no larger than \(\alpha\). This is not Wald’s original argument, which we’ll review a little later.

Next, let’s look at a quick example of how to formulate the likelihood ratio for Guassian variates.

Example: Gaussian Likelihood Ratio

Consider \(X_i \sim \mathrm{N}(\theta, 1)\) and \(S_n = \sum_{i=1}^n X_i\). Note that \[\begin{align*} \Lambda_k &= \mathrm{exp}\left\{\frac{1}{2} \sum_{i=1}^k \biggl[(X_i - \theta_0)^2 - (X_i - \theta_1)^2\biggr]\right\} \\ &= \mathrm{exp}\left\{ (\theta_1 - \theta_0) S_k - \frac{k}{2}(\theta_1^2 - \theta_0^2) \right\} \end{align*}\] where \(S_k = \sum_{i=1}^k X_i\).

Using Ville’s inequality above we can calculate some type 1 error rates since

\[ P\left(\sup_{k} \Lambda_k > \frac{1}{\alpha}\right) < \alpha \quad\text{so the event}\quad \min_{k=1,\ldots,n} \frac{1}{\Lambda_k} < \alpha \]

should happen \(\alpha\) or less of the time for our choice of \(n\).

We can simulate type 1 error rates when \(\theta_0 = 0\) and \(\theta_1 = 2\) (looking at time \(n=100000\)). We find nominal type 1 error rates and observed type 1 error rates at various levels of \(\alpha\) to be:

Nominal Alpha Observed Nominal Alpha Observed
0.01 0.000 0.11 0.030
0.02 0.003 0.12 0.032
0.03 0.007 0.13 0.037
0.04 0.009 0.14 0.040
0.05 0.011 0.15 0.043
0.06 0.014 0.16 0.043
0.07 0.016 0.17 0.045
0.08 0.020 0.18 0.045
0.09 0.022 0.19 0.047
0.10 0.025 0.20 0.051

This is relatively conservative, perhaps partially due to the finite nature of the simulation, but obeys the inequality given in the previous note.

We can also base our test on the sample mean \(\bar{X}_k\) or the running sum \(S_k\). Our running sum \(S_k \sim \mathrm{N}(k\theta, k)\) so the likelihood ratio statistic has to be modified accordingly.

\[\begin{align*} \Lambda_k &= \mathrm{exp}\left\{\frac{1}{2k}\biggl[(S_k - k\theta_0)^2 - (S_k - k\theta_1)^2\biggr]\right\} \\ &= \mathrm{exp}\left\{ (\theta_1 - \theta_0) S_k - \frac{k}{2}(\theta_1^2 - \theta_0^2) \right\} \end{align*}\] which is the same as when we looked at each \(X_i\) together.

This approach seems to be more numerically stable. If we repeat the simulation using the running sum \(S_k\) we find about the same values but it’s less conservative (but still conservative) in all but one of the runs:

Nominal Alpha Observed Nominal Alpha Observed
0.01 0.002 0.11 0.037
0.02 0.009 0.12 0.042
0.03 0.013 0.13 0.046
0.04 0.013 0.14 0.049
0.05 0.017 0.15 0.050
0.06 0.017 0.16 0.056
0.07 0.018 0.17 0.056
0.08 0.026 0.18 0.061
0.09 0.029 0.19 0.062
0.10 0.033 0.20 0.065

This suggests that, for this problem at least, using the running sum is best.

Wald’s SPRT

In Wald (1945), Wald chooses constant boundaries \(A\) and \(B\) for the likelihood ratio statistic over \(n=1,2,\ldots\) and continues to sample at each time \(n\) while

\[ B < \Lambda_n < A~. \]

When \(\Lambda_n > A\) then we stop and declare for \(H_1\). When \(\Lambda_n < B\) we stop and declare for \(H_0\). We want to choose \(A\) and \(B\) so that we have some control of our type 1 error rate \(\alpha\) and type 2 error rate \(\beta\).

Consider some type 1 error \(\alpha\) under \(H_0\) and type 2 error \(\beta\) under \(H_1\). We proceed conditional on the experiment stopping at time \(n\) by defining two events

\[ R_0 = \{x : \Lambda_n \le B\} \quad\text{and}\quad R_1 = \{x : \Lambda_n \ge A\} \]

where \(R_0\) is when we stop the test and find for \(H_0\), and \(R_1\) are the events where we stop the test finding for \(H_1\).

The probability of detection \(P(\text{Detection})=1-\beta\) is found with respect to \(H_1\) when \(x \in R_1\), and the probability of a false alarm \(P(\text{False Alarm}) = \alpha\) is found with respect to \(H_0\) when \(x \in R_1\) so \[\begin{align*} 1-\beta &= \int_{R_1} f_1(x) ~\mathrm{d}x = \int_{R_1} \frac{f_1(x)}{f_0(x)} f_0(x) ~\mathrm{d}x \\ &= \int_{R_1} \Lambda_k f_0(x) ~\mathrm{d}x \ge A \int_{R_1} f_0(x)~\mathrm{d}x = A ~\alpha \end{align*}\] and we can choose \(A \ge \frac{1-\beta}{\alpha}\) to ensure our errors don’t exceed \(\alpha\) and \(\beta\).

Similarly, the probability of finding for \(H_0\) when \(H_0\) is true, \(1-P(\text{False Alarm}) = 1 - \alpha\), is found with respect to \(H_0\) when \(x \in R_0\). Also, the probability of a type 2 error \(1-P(\text{Detection}) = \beta\) is found with respect to \(H_1\) when \(x \in R_0\), so \[\begin{align*} 1-\alpha &= \int_{R_0} f_0(x) ~\mathrm{d}x = \int_{R_0} \frac{f_0(x)}{f_1(x)} f_1(x) ~\mathrm{d}x \\ &= \int_{R_0} (\Lambda_k)^{-1} f_1(x) ~\mathrm{d}x \ge B^{-1} \int_{R_0} f_1(x)~\mathrm{d}x = B^{-1} ~\beta \end{align*}\] and we can choose \(B \le \frac{\beta}{1-\alpha}\) to ensure our errors don’t exceed \(\alpha\) and \(\beta\).

Since we have bounds for \(A\) and \(B\) we can choose conservatively and decide as follows

\[ \text{At time}~n \left\{ \begin{array}{ll} \text{decide for}~H_0 & \text{when}~\Lambda_n < B = \frac{\beta}{1-\alpha},\\ \text{decide for}~H_1 & \text{when}~\Lambda_n > A = \frac{1-\beta}{\alpha},~\text{and}\\ \text{collect more data} & \text{otherwise}. \end{array} \right. \]

As long as the process is guaranteed to stop and the error rates at each stopping point are constant (or bounded at least) then the overall test will have those error rate (bounds). When the data are iid the stopping guarantee is given by Wald, but when the data are not it does not immediately follow that this process will stop. Wald and Wolfowitz (1948) show that the stopping time of the above procedure is optimal in the Wald-Wolfowitz sense.

Note that sequential tests are not generally required to have the same error rates at each stopping point, but for computational tractability we do so here. Other shapes of error boundaries may produce performance that is better for every \(\theta \in (\theta_0, \theta_1)\). Such performance is related to Kiefer-Wolfowitz optimality.

As a check I’ve simulated 100000 runs of experiments where \(H_0: \theta = 0\) is true or \(H_A: \theta = 2\) is true and our observations are distributed \(\mathrm{N}(\theta,1)\). The empirical type 1 and type 2 error rates for nominal \(\alpha\) and \(\beta\) are given in the table below.

Type 1 Error
Type 2 Error
\(\alpha=0.01\) \(\alpha=0.05\) \(\alpha=0.10\) \(\alpha=0.01\) \(\alpha=0.05\) \(\alpha=0.10\)
\(\beta=0.01\) 0.00324 0.01664 0.03178 0.00330 0.00367 0.00332
\(\beta=0.05\) 0.00338 0.01697 0.03202 0.01601 0.01631 0.01700
\(\beta=0.10\) 0.00357 0.01781 0.03333 0.03105 0.03326 0.03302

Calculating \(p\)-Values

Since the \(p\)-value is the smallest \(\alpha\) at which we’d reject \(H_0\), we can use Ville’s inequality directly (or use Wald’s derivation with \(\beta = 0\)) and take

\[ p\text{-value at time}~k \le \min_{i=1,2,\ldots,k} \frac{1}{\Lambda_i}~. \]

The inequality comes from the fact that Ville’s inequality is for the supremum which must be larger than (or equal to) the minimum over any finite set of observations. Again, as long as the test is guaranteed to conclude this should give us something that’s useful as a \(p\)-value. A simulation study of \(p\)-values given under \(H_0: \theta = 0\) where \(X_i \sim \mathrm{N}(\theta, 1)\) and an alternative \(H_A: \theta = \theta_1\) is performed.

It’s interesting how conservative these \(p\)-values are, and that the amount of conservatism increases as the alternative grows larger.

Calculating Confidence Intervals

We cannot construct a general confidence interval because the ratio of likelihoods is messy. Numerically, we can search for values of \(H_1: \theta = \theta_1\) for which the \(p\)-value is greater than or equal to our confidence level \(\alpha\). But there are certain circumstances where we can construct a closed form solution for the interval.

From Ville’s inequality or from Wald’s with \(\beta = 0\) we know

\[ P\left(\sup_k \Lambda_k > \frac{1}{\alpha} \right) < \alpha \]

which we can restate as

\[ P\left(\Lambda_k > \frac{1}{\alpha} \quad\text{for some}\quad k\ge 1\right) < \alpha~. \]

The complement event then has probability

\[ P\left(\Lambda_k \le \frac{1}{\alpha} \quad\text{for every}\quad k\ge 1\right) \ge 1 - \alpha~. \]

The sequence of intervals at time \(k=1,2,\ldots\) of values of \(\theta_1\) satisfying \(\Lambda_k < \frac{1}{\alpha}\) is thus a \(100(1-\alpha)\%\)-level confidence sequence.

Example: Gaussian CIs when \(\sigma^2=1\)

Consider independent and identically distributed samples

\[ X_i \sim \mathrm{N}(\theta,1) \]

for \(i=1,2,\ldots\) and let’s look at the reciprocal likelihood ratio at time \(k\) \[\begin{align*} \Lambda_k &= \mathrm{exp}\left\{ \frac{1}{2} \sum_{i=1}^k (X_i - \theta_0)^2 - (X_i - \theta_1)^2 \right\} \\ &= \mathrm{exp}\left\{ \frac{k}{2} (\theta_0^2 - \theta_1^2) + (\theta_1 - \theta_0) k \bar{X}_k \right\}~. \end{align*}\] Starting with \(1/\Lambda_k > \alpha\), taking the log on both sides and subtracting we get

\[ \frac{k}{2} (\theta_0^2 - \theta_1^2) + (\theta_1 - \theta_0) k \bar{X}_k + \log \alpha < 0~. \]

The roots with respect to \(\theta_1\) of the left-hand side are

\[ \text{roots w.r.t.}~~\theta_1 = \bar{X}_k \pm \sqrt{\left(\bar{X}_k - \theta_0\right)^2 + \frac{2}{k} \log \alpha}~. \]

When \((\bar{X}_k - \theta_0)^2 + \frac{2}{k}\log \alpha < 0\) then all values of \(\theta_1\) satisfy the inequality.

Recall that the confidence sequence is over all \(k\), so letting the interval at time \(k\) to be the intersection of all intervals up to time \(k\) does not change the coverage of the sequence.

The following plot is of confidence intervals for \(\theta_0=0\) with \(\alpha = 0.05\) at times \(k=1\), \(k=10\), and \(k=100\).

These confidence intervals have strange properties. Note that the \((2/k) \log \alpha\) term gets larger (less negative) as \(k\) increases, causing the interval to swell towards \(\theta_0\).

Practitioners under the alternative hypothesis will not continue to run after the interval no longer covers \(\theta_0\), so we won’t see this weird swelling happen in practice. Further, such practitioners may choose to take the intersection of all the intervals up to time \(k\) for the \(k\)th interval in which case the width of such intervals will be non-decreasing.

Practitioners under the null hypothesis will have \(\bar{X}\) converge to \(\theta_0\) where the interval will continue to shrink. Thus, the behavior of these intervals in practice is more sane than one might be led to believe from the formula alone.

A large (1M run) simulation of \(\theta = \theta_1\) where the alternative hypothesis is \(\theta_1 = 2\) versus the null hypothesis \(\theta_0 = 0\) was performed. Confidence regions at a 95% level were constructed, but they only covered the true value \(\theta_1\) 87.7% of the time. This lack of coverage suggests I did something wrong but I can’t figure out what.

Repeating the study at a specified fixed stopping time gave us the same coverage. But if we look at the intersection of all confidence intervals up to a stopping time we get poorer-than-nominal coverage, and I’m not sure why that is the case.

Expected Sample Sizes

For the iid case, Wald (1945) demonstrated that the random stopping time \(K\) has expectation under \(f_i\) for \(i=0,1\)

\[ \mathbb{E}_i\left[\log \Lambda_K\right] = \mathbb{E}_i\left[K\right] \mathbb{E}_i\left[\log \frac{f_1(X_i)}{f_0(X_i)}\right] \]

Also, from our error calculations (assuming the probability mass of passing the boundary is that of us hitting the boundary) \[\begin{align*} \mathbb{E}_0[\log \lambda_K] &\approx \alpha \log A + (1-\alpha) \log B \quad\text{and}\\ \mathbb{E}_1[\log \lambda_K] &\approx (1 - \beta) \log A + \beta \log B~. \end{align*}\] Solving for expected value \(\mathbb{E}_i[K]\) we get \[\begin{align*} \mathbb{E}_0[K] &\approx \frac{\alpha \log A + (1-\alpha) \log B}{\mathbb{E}_0\left[\log \frac{f_1(X_i)}{f_0(X_i)}\right]} \quad\text{and}\\ \mathbb{E}_1[K] &\approx \frac{(1 - \beta) \log A + \beta \log B}{\mathbb{E}_1\left[\log \frac{f_1(X_i)}{f_0(X_i)}\right]}~. \end{align*}\]

Let’s look at a specific example, a Bernoulli random variable.

Bernouli Example

Consider a situation where we observe one at a time some binary event \(X_k \in \{0,1\}\). Let \(S_n = \sum_{i=1,\ldots,n} X_n\) represent the sum of the successes up to time \(n\). The likelihood ratio statistic is then

\[ \Lambda_n = \frac{p_1^{S_n}(1-p_1)^{n-S_n}}{p_0^{S_n}(1-p_0)^{n-S_n}}~. \]

It’s often better to work at log scale whenever likelihood ratios are involved, in which case

\[ \log \Lambda_n = S_n \log\left[\frac{p_1(1-p_0)}{p_0(1-p_1)}\right] + n \log\left(\frac{1-p_1}{1-p_0}\right) \]

If we let \(\alpha = 0.05\) and \(\beta = 0.10\) then we end up with bounds (on log scale) of

\[ \log A = 2.890372 \quad\text{and}\quad \log B = -2.251292 \]

Under \(H_0: p = 0.15\) and \(H_1: p = 0.1575\) then

\[ \log \Lambda_n = 0.05765285 \cdot S_n - 0.008862687 \cdot n \]

Note that, if the success rate is \(0.15375 = (\theta_0 + \theta_1)/2\) then the expected value of this is zero. This suggests that our test will take the longest at or near that value of \(\theta\).

I did a simulation of 100,000 runs of the SPRT under the case \(p=p_0\), \(p=p_1\), and \(p=(p_0 + p_1)/2\) and found the following decision rates and expected run-times:

True p Choose H0 Choose H1 Exp. Runtime
\(p_0\) 95.1% 4.9% 9,332.4
\(p_1\) 9.9% 90.1% 11,000.3
\((p_0 + p_1)/2\) 55.9% 44.1% 15,154.2

Bernouli Expected Stopping Time

We can work out the expected stopping time for this Bernoulli example since the Kullback-Leibler divergences are simple. Explicitly, \[\begin{align*} \mathbb{E}_0\left[\log \frac{f_1(X_i)}{f_0(X_i)}\right] &= p_0 \log \frac{p_1}{p_0} + (1-p_0) \log \frac{1-p_1}{1-p_0},\quad\text{and} \\ \mathbb{E}_1\left[\log \frac{f_1(X_i)}{f_0(X_i)}\right] &= p_1 \log \frac{p_1}{p_0} + (1-p_1) \log \frac{1-p_1}{1-p_0}~. \end{align*}\] The expected stopping time is thus \[\begin{align*} \mathbb{E}_0[K] &\approx \frac{\alpha \log \frac{1-\beta}{\alpha} + (1-\alpha) \log \frac{\beta}{1-\alpha}}{p_0 \log \frac{p_1}{p_0} + (1-p_0) \log \frac{1-p_1}{1-p_0}} \quad\text{and}\\ \mathbb{E}_1[K] &\approx \frac{(1 - \beta) \log \frac{1-\beta}{\alpha} + \beta \log \frac{\beta}{1-\alpha}}{p_1 \log \frac{p_1}{p_0} + (1-p_1) \log \frac{1-p_1}{1-p_0}}~. \end{align*}\]

For our example with \(\alpha = 0.05\), \(\beta=0.1\), \(p_0=0.15\), and \(p_1=0.1575\) we find the predicted value is very close to the empirical value:

True \(p\) Analytical \(\mathbb{E}[K]\) Empirical \(\mathbb{E}[K]\)
\(p_0\) 9,285.8 9,332.4
\(p_1\) 10,918.2 11,000.3

Comparing Bernouli Distributions

Let’s say we have a two sample problem where we sample from a control \(X_i \sim \mathbf{Bernouli}(p_x)\), \(i=1,\ldots,n_x\) and from a challenger \(Y_i \sim \mathbf{Bernouli}(p_y)\), \(i=1,\ldots,n_y\) and where these results come in one at an arbitrary ordering of \(X_i\)’s and \(Y_i\)’s over time. We follow the approach used by Johari et al. (2022) for a different version of the SPRT and examine the statistic \(Z\).

\[ Z = \frac{\hat{p}_y - \hat{p}_x - \theta}{\sqrt{\hat{p_x}(1-\hat{p}_x)+\hat{p_y}(1-\hat{p}_y)}} \]

where \[\begin{align*} \hat{p} &= \frac{\sum_{i=1}^{n_x} X_i + \sum_{i=1}^{n_y} Y_i}{n_x + n_y}~,\\ \hat{p}_x &= \hat{p} - \frac{\theta}{2}~,\quad\text{and}\\ \hat{p}_y &= \hat{p} + \frac{\theta}{2}~. \end{align*}\]

Asymptotically, \(Z \sim \mathrm{N}(0,1)\), so the standard normal likelihood is used in this test. Johari et al. (2022) note that early on the sample size may be too small for the approximation to be appropriate, but if the value of \(\alpha\) is not too large then it’s not a problem. Another option would be to use a higher order approximation for binomial or some biasing approach that has better small sample performance impact while having a negligible large sample impact. (Note that Johari et al. (2022) go on to use a James-Stein estimator for effect sizes, which we do not use here).

Another option is to use the binomial distribution directly.

Johari’s Approximation

Using the plugin estimation method of Johari we perform a simulation similar to the last.

Again with \(p_0 = 0.15\), \(p_1 = 0.1575\), \(\alpha = 0.05\) and \(\beta = 0.10\), simulating 100,000 runs under various conditions, we find:

True p Choose H0 Choose H1 Exp. Runtime
\(p_0\) 95.0% 4.9% 36,374.8
\(p_1\) 10.0% 89.8% 44,100.6
\((p_0 + p_1)/2\) 55.1% 42.7% 58,784.7

Problems with Wald’s SPRT

While Wald’s method is surprising and elegantly leverages a lot of lucky simplifications, it’s not (and there is not) a universal solution to sequential testing. Some criticisms of Wald’s method are:

  • There’s no guaranteed stopping time which is unrealistic because no one will run a test with this for 1,000 years.
  • The specific trade-offs implied in the constant probability of type 1 and type 2 error at each stopping time may not vibe with you very well.
  • Can be less efficient than group sequential methods if fewer check-ins are required.
  • “Requires knowing the likelihood” which I guess, I dunno, I read that and that’s true so I put it here.
  • Assumes no nuisance parameters - though there’s some techniques for handling this IIRC.
  • “Only tests simple alternative not composite” - I think this only has teeth when Wald-Wolfowitz optimality is bad in terms of Kiefer-Wolfowitz optimality, or when you want confidence intervals to not be weird.

Anyway, that’s it. That’s what I got. I plan more about Lorden’s 2-SPRT as an approximate solution to the Kiefer-Wolfowitz problem and the mixture SPRT due to Robbins (1970) that Optimizely uses, but at some other time, maybe never.

References

Friedman, M., and R. D. Friedman. 1998. Two Lucky People: Memoirs. University of Chicago Press. https://books.google.com/books?id=\_rY3wdaDZ5UC.
Johari, Ramesh, Pete Koomen, Leonid Pekelis, and David Walsh. 2022. “Always Valid Inference: Continuous Monitoring of a/b Tests.” Operations Research 70 (3): 1806–21.
Robbins, Herbert. 1970. Statistical Methods Related to the Law of the Iterated Logarithm.” The Annals of Mathematical Statistics 41 (5): 1397–1409. https://doi.org/10.1214/aoms/1177696786.
“Ville’s Inequality.” 2023. Wikipedia. Wikimedia Foundation. https://en.wikipedia.org/wiki/Ville%27s_inequality.
Wald, A. 1945. Sequential Tests of Statistical Hypotheses.” The Annals of Mathematical Statistics 16 (2): 117–86. https://doi.org/10.1214/aoms/1177731118.
Wald, A., and J. Wolfowitz. 1948. Optimum Character of the Sequential Probability Ratio Test.” The Annals of Mathematical Statistics 19 (3): 326–39. https://doi.org/10.1214/aoms/1177730197.