Notes on Robbins’ SPRT

The mixture sequential probability ratio test is an extension of Wald’s SPRT that allows us to reduce the worst case performance by integrating the likelihood ratio over a family of alternative values. The resultant test guarantees our frequentist false alarm rate.

Published

June 30, 2023

Warning

This is in-progress and may be terribly wrong. In fact, it probably is wrong.

The mixture SPRT (mSPRT) of Robbins (1970) is based on the integrated likelihood ratio

\[ \Lambda_n(\theta_0) = \frac{ \int_{-\infty}^\infty \left[\prod_{i=1}^n f(x_i; \theta)\right] \pi_\tau(\theta) ~\mathrm{d}\theta }{ \prod_{i=1}^n f(x_i; \theta_0) } ~. \]

We can view it as an extension of Wald’s SPRT that allows us to reduce the worst case performance by integrating the likelihood ratio over a family of alternative values \(pi(\theta)\).

We can use this to still produce a test that guarantees our type 1 error rates. Our ability to do this relies on \((\Lambda_n)_{n=1}^\infty\) being a martingale, which can be seen since \[\begin{align} \mathbb{E}_{X_n}[\Lambda_n | \Lambda_{n-1}] &= \int_{-\infty}^\infty \frac{ \int_{-\infty}^\infty \left(\prod_{i=1}^n f(x_i; \theta)\right) \pi_\tau(\theta) \mathrm{d}\theta }{ \prod_{i=1}^n f(x_i; \theta_0) } f(x_n; \theta_0) \mathrm{d}x_n \\ &= \frac{ 1 }{ \prod_{i=1}^{n-1} f(x_i; \theta_0) } \int_{-\infty}^\infty \int_{-\infty}^\infty \frac{\prod_{i=1}^n f(x_i; \theta)}{f(x_n; \theta_0)} f(x_n; \theta_0) \pi_\tau(\theta) ~\mathrm{d}\theta~ \mathrm{d}x_n \\ &= \frac{ 1 }{ \prod_{i=1}^{n-1} f(x_i; \theta_0) } \int_{-\infty}^\infty \left(\prod_{i=1}^{n-1} f(x_i; \theta)\right) \left[\int_{-\infty}^\infty \frac{f(x_n;\theta)}{f(x_n;\theta_0)} f(x_n; \theta_0) ~\mathrm{d}x_n\right] \pi(\theta) ~\mathrm{d}\theta \label{eq:a} \\ &= \frac{ \int_{-\infty}^\infty \left(\prod_{i=1}^{n-1} f(x_i; \theta)\right) \pi(\theta)~\mathrm{d}\theta }{ \prod_{i=1}^{n-1} f(x_i; \theta_0) } = \Lambda_{n-1} \end{align}\] where exchange of integration above is allowed by Fubini’s theorem.

We can use Ville’s inequality (“Ville’s Inequality” 2023) as \(\mathbb{E}[\Lambda_1] = 1\) to get

\[ P_{H_0}\left(\Lambda_n > \frac{1}{\alpha}\right) \le \alpha~. \]

This will be the basis of our \(p\)-value and confidence interval approach.

Gaussian Example

Consider a specific Gaussian model where \(X_i | \mu \sim \mathrm{N}(\mu, \sigma^2)\) and given the prior distribution for \(\mu \sim \mathrm{N}(\mu_0, \tau)\) as in Pekelis, Walsh, and Johari (2015). Let \(f(x; \mu, \sigma^2)\) represent the normal density with mean \(\mu\) and variance \(\sigma^2\). We can decompose the product \[\begin{equation} \tag{1}\label{eq-a} \prod_{i=1}^n f(x_i; \mu, \sigma^2) = \mathrm{exp}\left\{-\frac{\sum_{i=1}^n (X_i - \bar{X}_n)^2}{2 \sigma^2}\right\} \left[2 \pi \sigma^2\right]^{-\frac{n}{2}} ~\mathrm{exp}\left\{ -\frac{(\bar{X}_n - \mu)^2}{2\frac{\sigma^2}{n}} \right\}~. \end{equation}\] Note that most of \(\eqref{eq-a}\) do not involve \(\mu\).

Then we may state the statistic \(\Lambda_n\) as

\[ \Lambda_n = \frac{ \int_{-\infty}^\infty \left[\prod_{i=1}^n f(x_i; \mu, \sigma^2)\right] \pi(\mu) ~\mathrm{d}\mu }{ \prod_{i=1}^n f(x_i; \mu_0, \sigma^2) } = \frac{ \int_{-\infty}^\infty \mathrm{exp}\left\{-\frac{(\bar{X}_n - \mu)^2}{2 \sigma^2/n}\right\}\pi(\mu) ~\mathrm{d}\mu }{ \mathrm{exp}\left\{-\frac{(\bar{X}_n - \mu_0)^2}{2 \sigma^2/n}\right\} }~. \]

The numerator

\[ \int_{-\infty}^\infty \mathrm{exp}\left\{-\frac{(\bar{X}_n - \mu)^2}{2 \sigma^2/n}\right\}\pi(\mu) ~\mathrm{d}\mu = \sqrt{\frac{\sigma^2/n}{\tau + \sigma^2/n}}~\mathrm{exp}\left\{-\frac{(\bar{X}_n - \mu_0)^2}{2(\tau + \sigma^2/n)}\right\} \]

and thus

\[ \Lambda_n = \sqrt{\frac{\sigma^2/n}{\tau + \sigma^2/n}}~\mathrm{exp}\left\{ \frac{ \tau }{ 2\frac{\sigma^2}{n} \left(\tau + \frac{\sigma^2}{n}\right) } (\bar{X}_n - \mu_0)^2 \right\}~. \]

I ran a simulation to determine what the empirical type 1 error rates are when using Gaussian random variables with \(\sigma^2=1\), \(\mu_0 = 0\), \(\mu_1 = 0.1\),

  • \(\tau \in [0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 1, 5\), and
  • \(\alpha \in [0.001, 0.01, 0.025, 0.05, 0.10]\).

Below are the observed type 1 error rates given 10k simulated runs of 100k samples.

Above we see that the result is always conservative, but the degree of conservatism varies a lot. We could choose a \(\tau\) that is unfortunate and end up with a type 1 error rate that is too small (and hence severely retard our power).

Next, let us construct a confidence interval. Solving the inequality \(\Lambda_n \ge 1/\alpha\) and letting \(V_n = \sigma^2/n\) we get any value outside of

\[ \bar{X}_n \pm \sqrt{ \log\left(\frac{\tau + V_n}{\alpha^2 V_n}\right) \frac{V_n^2}{\tau} + \log\left(\frac{\tau + V_n}{\alpha^2 V_n}\right) V_n } \]

which we can re-arrange into the form given in Pekelis, Walsh, and Johari (2015) where they reject when

\[ \left|\bar{X}_n\right| > \sqrt{\left(2 \log \frac{1}{\alpha} - \log \frac{V_n}{V_n + \tau}\right)\left(\frac{V_n(V_n + \tau)}{\tau}\right)}~. \]

A simulation to ensure coverage produced the following;

Here we find our empirical coverage meets or exceeds the nominal level (which is good). But the intervals are sometimes very conservative depending on the choice of \(\tau\). For example, the 90% confidence intreval in this simulation is really about a 95% confidence interval with \(\tau=0.0001\) or \(\tau=5\).

Note that the good values of \(\tau\) vary with effect size and variance, so our best strategy would be to choose a \(\tau\) that minimizes expected run time given an expected variance and effect size.

Expected Run Time

Wald’s equation doesn’t apply here because we can’t express \(\log \Lambda_K\) as the sum of independent and identically distributed increments. If it did apply then it would give us that

\[ \mathbb{E}[\log \Lambda_K] = \mathbb{E}[K] \mathbb{E}[\log \Lambda_1]~. \]

Robbins (1970) gives no general method for expected run time calculation.

We can utilize the expression for the confidence interval when \(H_0: \mu = \mu_0\) versus an alternative \(\mu = \mu_1\) and solve for \(n\) the following:

\[ \mu_1 - \mu_0 - \sqrt{\left(2 \log \frac{1}{\alpha} - \log \frac{V_n}{V_n + \tau}\right)\left(\frac{V_n(V_n + \tau)}{\tau}\right)} \]

This produces slightly conservative estimates of expected run-lengths, as can be seen in the following simulation. Note that the multiple points in each color represent levels \(\alpha \in [0.001, 0.01, 0.025, 0.5, 0.1]\).

The run-time estimate is affected by the level \(\alpha\) but its conservatism does not seem to be affected by the level of \(\alpha\).

Using the numerical search for the run-length, we can search for the value \(\tau\) that minimizes this expected run-length and produce a curve valid for our simple Gaussian example.

I fitted the following regression model that captures these values with very high fidelity:


Call:
lm(formula = opt.tau ~ 0 + mu1 + mu1:alpha + I(mu1^2) + mu1:I(alpha^2) + 
    I(mu1^2):I(alpha^2), data = opt.tau)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0043198 -0.0015742 -0.0003756  0.0013820  0.0095305 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
mu1                  0.0221561  0.0010085   21.97   <2e-16 ***
I(mu1^2)             0.9019282  0.0009884  912.47   <2e-16 ***
mu1:alpha           -0.6553401  0.0115957  -56.52   <2e-16 ***
mu1:I(alpha^2)       2.8556012  0.0742924   38.44   <2e-16 ***
I(mu1^2):I(alpha^2) -1.8441438  0.0580722  -31.76   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.002264 on 283 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 3.836e+06 on 5 and 283 DF,  p-value: < 2.2e-16

Plotting the true and fitted values further shows the high fidelity of this model.

This looks like a pretty good fit but a problem exists in that it admits zero and very small negative values near the left side, so I am using the maximum of the regression model and \(10^{-8}\) as the optimal value for \(\tau\).

Problems

One problem not addressed is that our estimates of effect sizes at the end of a procedure may be biased. The following holds \(\mu_0 = 0\) and \(\sigma^2 = 1\), then varies \(\mu_1 \in (0, 1]\) for levels \(\alpha \in \{0.01, 0.05, 0.1\}\). The optimal \(\tau\) from above was used.

This is kind of a lot of bias, but is unlikely to be this bad in practice as we’ve used the specific \(\tau\) to shorten the run-time, and in sequential experiments we’re trading the length of run-time with the quality of the estimate at the end. Still, it’s very concerning.

References

Pekelis, Leo, David Walsh, and Ramesh Johari. 2015. “The New Stats Engine.” Internet. Retrieved December 2020.
Robbins, Herbert. 1970. Statistical Methods Related to the Law of the Iterated Logarithm.” The Annals of Mathematical Statistics 41 (5): 1397–1409. https://doi.org/10.1214/aoms/1177696786.
“Ville’s Inequality.” 2023. Wikipedia. Wikimedia Foundation. https://en.wikipedia.org/wiki/Ville%27s_inequality.