Alpha Spending for Group Sequential Trials

Group Sequential Designs
Response Adaptive Methods
A/B Testing

Group sequential designs offer analysts a lot of flexibility in that a test can be monitored with multiple analyses, and it can be stopped early when the test is likely to end up futile or already has strong evidence that the null hypothesis isn’t correct. This is my notes on these types of designs.

Published

January 6, 2023

A group sequential design is a test where we will collect data in \(K\) batches over time and do analysis after we’ve collected each new batch of data. We create a rejection region based on the upper, lower, or both boundaries for our normalized \(Z\)-scores, and we stop the experiment as soon as a \(Z\)-score passes one of these boundaries. Below is an example of a design where, if our \(Z\)-score crosses the top boundary we reject our null hypothesis, but if it crosses the lower boundary we stop the test and fail to reject our null hypothesis.

The above design shows the boundaries in black and an example sample shown in green. It is a one-sided version of such a test. When our statistic exceeds the top boundary we reject our null hypothesis. The top boundary is defined by an \(\alpha\)-spending function that describes how much of our overall \(\alpha\) we wish to spend at each of the \(K\) possible tests. Similarly, a \(\beta\)-spending function to decide how much type 2 error we’re willing to expend at each of the \(K\) possible tests is used to construct the lower boundary. If the lower boundary is exceeded then we stop the test and fail to reject \(H_0\).

A two-sided version of such a test can be constructed, but this usually takes the form of an upper and lower boundary both causing us to reject \(H_0\), and only when the statistic stays within the boundaries for all \(K\) periods do we fail to reject \(H_0\).

Running Such Tests

The nice thing about such a test is that we can stop early if we positive or negative impacts. If we did a two-sided version we’d run longest when no impact was detected (maximizing our opportunity cost). If we let our null hypothesis be that some improvement is possible then we at-least run longest when some improvement is seen, and stop early when no improvement or a very large improvement is seen, but the analysis of such a setup become nuanced. If we can construct designs like the one in the graph, that seems preferable many times.

A problem emerges though in that the end points don’t always match. To get them to match requires that the sample sizes at each check-in be very carefully control, something that is often not reasonable in web testing. Variability in samples sizes can leave us having to make changes to our analysis in a way that may impact our error rates. These problems might not exist in an industrial or clinical setting. Never-the-less, let us proceed boldly down this rabbit hole!

Constructing Such Tests

We want to do group sequential analysis on these observations following the derivation in Jennison and Turnbull (2001), so in this post I’m going to follow their notation closely. Suppose we wanted to do \(K\) analyses over the sequential statistics \(\{Z_1, \ldots, Z_k\}\). We define the canonical joint distribution with information levels \(\{{\mathcal{I}}_1,\ldots,{\mathcal{I}}_K\}\) for the parameter \(\theta\) by

  1. \((Z_1, \ldots, Z_K)\) is multivariate normal,
  2. \(\mathbb{E}(Z_k) = \theta \sqrt{{\mathcal{I}}_k},\quad k=1,\ldots,K,\quad\) and,
  3. \(\mathrm{Cov}(Z_{k_1}, Z_{k_2}) = \sqrt{{\mathcal{I}}_{k_1}/{\mathcal{I}}_{k_2}}\).

The “information level” is just the reciprocal of the variance of the original data.

Normal Example

Suppose we were looking at normally distributed data \(X_k \overset{\mathrm{iid}}{\sim} \mathrm{N}(\mu, \sigma_k^2)\) and our test is about \(\mu\). In this case

\[ Z_k = \sum_{i=1}^k X_k \sqrt{{\mathcal{I}}_k} \quad\text{where}\quad {\mathcal{I}}_k = \frac{1}{\sum_{i=1}^k \sigma_i^2}~. \]

We then have \({\mathbb{E}}[Z_k] = \mu \sqrt{{\mathcal{I}}_k}\) so \(\theta = \mu\).

Binomial Example

Let’s say we’re doing experiments with binomially distributed success outcomes and want to look at the difference between a control group and a challenger group. Let \(X_k \sim \mathrm{Binom}(n_k, p)\) be the number of successes out of \(n_k\) trials and \(Y_k \sim \mathrm{Binom}(m_k, q)\) the number of successes after \(m_k\) trials. Then

\[ \hat{p}_k = \frac{\sum_{i=1}^k X_i}{\sum_{i=1}^k n_i}, \quad \hat{q}_k = \frac{\sum_{i=1}^k Y_i}{\sum_{i=1}^k m_i} \quad \text{and} \quad Z_k = \frac{\hat{p}_k - \hat{q}_k}{\sqrt{\frac{p (1 - p)}{\sum_{i=1}^k n_i} + \frac{q (1 - q)}{\sum_{i=1}^k m_i}}}~. \]

We can plug in the estimates for \(p\) and \(q\) as done in the “fixed horizon” two-sample test of proportions without causing much damage. Next, note the following gives the parameter \(\theta = p-q\) and the information \({\mathcal{I}}_k\):

\[ {\mathbb{E}}[Z_k] = (p - q) \sqrt{{\mathcal{I}}_k} \quad\text{where}\quad {\mathcal{I}}_k = \frac{1}{ \frac{p (1 - p)}{\sum_{i=1}^k n_i} + \frac{q (1 - q)}{\sum_{i=1}^k m_i} }~. \]

Other Models

Jennison and Turnbull (2001) give other models:

  • Normal linear model (regression model) on page 62.
  • Correlated observations and longitudinal studies on page 67.
  • “Other Parametric Models” on page 70.
  • Log-rank tests for survival data on page 77.
  • Logistic regression models on page 256.
  • It’s got a lot of stuff. It’s a classic.

Alpha Spending (and Beta Spending)

Our test will proceed as follows:

  • We will construct intervals \([a_k, b_k]\) for \(k=1,\ldots,K\).
  • At each analysis in order, we will look to see if \(Z_k \in [a_k,b_k]\).
    • If so, we continue to the next analysis.
    • If not, we stop the analysis and draw a conclusion (perhaps based on which side \(Z_k\) fell).

For any group sequential trial we are concerned about a partition of the event space based on time of when the statistic crossed its boundaries. Let \(0 < t_1 < \ldots < t_K = 1\) represent times when we’ll do an analysis, and let \(t_0 = 0\). At analysis \(k=1,\ldots,K\) we will ask if the statistic \(Z_k\) has met our stopping requirement that it leaves the region \([a_k, b_k]\). It is thus natural to partition the event space this way, letting

\[\begin{align*} R_1 &= \bigl\{Z_1 \notin [a_1, b_1]\bigr\}~, \\ R_2 &= \bigl\{Z_1 \in [a_1, b_1], Z_2 \notin [a_2, b_2]\bigr\}~,\\ \vdots&\\ R_K &= \bigl\{Z_1 \in [a_1, b_1], \dots, Z_{K-1} \in [a_{K-1}, b_{K-1}], Z_K \notin [a_K, b_K]\bigr\}~,\quad\text{and}\\ R_\emptyset &= Z_1 \in [a_1, b_1], \dots, Z_K \in [a_K, b_K]\bigr\}~. \end{align*}\]

An \(\alpha\)-spending approach lets the experimentor choose a function \(\alpha(t)\) for \(t \in [0,1]\) where:

  • The function starts at zero, so \(\alpha(t_0) = \alpha(0) = 0\),
  • \(\alpha(t)\) is strictly increasing so \(\alpha(t_i) < \alpha(t_{i+1})\) for \(i=0,\ldots,K\),
  • The increase \(\alpha(t_{i+1}) - \alpha(t_{i})\) represents the amount of error we spend in constructing the interval \([a_k, b_k]\)

For the two-sided approach we often let \(a_k = -b_k\) and calculate symmetric intervals. However, we can also define a \(\beta\)-spending function \(\beta(t)\) that chooses how we spend our type 2 error at each analysis in much the same way, and use that with the \(\alpha\)-spending function do decide on \([a_k, b_k]\) for each analysis.

Let’s define the timing in terms of the information fraction

\[ t_k = {\mathcal{I}}_k/{\mathcal{I}}_K\quad\text{for}~~i=k=1,\ldots,K. \]

Our incremental type 1 and type 2 errors are thus \[\begin{align*} \pi_{1,1} &= \alpha(t_1) - \alpha(t_0), \\ \pi_{1,k} &= \alpha(t_k) - \alpha(t_{k-1}), \\ \pi_{2,1} &= \beta(t_1) - \beta(t_0),~~\text{and} \\ \pi_{2,k} &= \beta(t_k) - \beta(t_{k-1})~. \\ \end{align*}\]

Written in terms of probability statements (under \(\theta_0\) for \(H_0\) and \(\theta_1\) for \(H_A\)) we will have \[\begin{align*} \pi_{1,k} &= P_{\theta_0}(a_1 < Z_1 < b_1, \ldots, a_{k-1} < Z_{k-1} < b_{k-1}, Z_k \ge b_k)~,\quad\text{and}\\ \pi_{2,k} &= P_{\theta_1}(a_1 < Z_1 < b_1, \ldots, a_{k-1} < Z_{k-1} < b_{k-1}, Z_k < a_k)~. \end{align*}\]

We need to make sure, via planning (covered later) that we get \(a_K=b_K\). If not then either:

  • \(P_{\theta_0}(Z_K \in [a_K, b_K]) > 0\) and our type 1 error will increase above \(\alpha(1)\), or
  • \(P_{\theta_1}(Z_K \in [a_K, b_K]) > 0\) and our type 2 error will increase above \(\beta(1)\).

Thus we will need to ensure \(P(R_\emptyset) = 0\).

Recursive Formulation of Error Spending

We can give the density of a sum of random variables \(X + Y\). Let \(X\) be defined over \(\Omega_X\) and density \(f_X(\cdot)\), and let \(Y\) be defined over \(\Omega_Y\) and density \(f_Y(\cdot)\) is

\[ f_{X+Y}(c) = \int_{\Omega_X} f_X(x)~ f_Y(c-x) ~\mathrm{d}x = \int_{\Omega_Y} f_Y(y)~ f_X(c-y) ~\mathrm{d}y~. \]

The probability that the sum exceeds some value is then

\[ P(X+Y < c) = \int_{\Omega_X} f_X(x) ~F_Y(c-x)~\mathrm{d}x = \int_{\Omega_Y} f_Y(y) ~F_X(c-y)~\mathrm{d}y~. \]

The other identity we’ll use is change of variables for integration.

The reason we can construct such designs is because of a recursive relationship that exists in numerical evaluation of integrals that are important to our computations. First, let’s define our exceedence functions to see the problem: \[\begin{align*} \psi_k(a_1, b_1, \ldots, a_k, b_k; \theta) &= P_{\theta}(a_1 < Z_1 < b_1,\ldots,a_{k-1} < Z_k < b_{k-1}, Z_k > b_k)~,\quad\text{and}\\ \xi_k(a_1, b_1, \ldots, a_k, b_k; \theta) &= P_{\theta}(a_1 < Z_1 < b_1,\ldots,a_{k-1} < Z_k < b_{k-1}, Z_k < a_k)~. \end{align*}\] These would grow exponentially in complexity if we just tried to evaluate them directly. Instead, as was first noted by Armitage, McPherson, and Rowe (1969), we can rely on the following Markov property.

Note that \(f_1(z)\), the density for \(Z_1\), is

\[ f_1(z) = \phi\left(z - \theta \sqrt{{\mathcal{I}}_1}\right)~, \]

where \(\phi(\cdot)\) is the standard normal density. Conditional density for

\[ Z_k | Z_{k-1} = z_{k-1}, \ldots, Z_{1} = z_{1} \]

is only dependent on \(z_{k-1}\) and (for \(k=2,\ldots,K\)) is equal to

\[ f_k(z_k, z_{k-1}; \theta) = \frac{\sqrt{{\mathcal{I}}_k}}{\sqrt{\Delta_k}} \phi\left( \frac{ z_k\sqrt{{\mathcal{I}}_k} - z_{k-1} \sqrt{I_{k-1}} - \theta \Delta_k }{ \sqrt{\Delta_k} } \right) \quad\text{where}\quad \Delta_k = {\mathcal{I}}_k - {\mathcal{I}}_{k-1}~. \]

Thus we can re-express our original exceedence functions (after some Fubini’s theorem) as: \[\begin{align*} \psi_k(\ldots) &= \int_{a_1}^{b_1} \ldots \int_{a_{k-1}}^{b_{k-1}} f_1(z_1; \theta) f_2(z_2, z_1; \theta) \ldots f_{k-1}(z_{k-1}, z_{k-2}; \theta) e_{k-1}(z_k, z_{k-1}; \theta) ~\mathrm{d}z_k\ldots\mathrm{d}z_1 \\ &\text{where}\quad e_{k-1}(z_k, z_{k-1}; \theta) = \Phi\left( \frac{ z_{k-1} \sqrt{{\mathcal{I}}_{k-1}} + \theta \Delta_k - b_k \sqrt{{\mathcal{I}}_k} }{ \sqrt{\Delta_k} } \right) \end{align*}\] and \(\Phi(\cdot)\) is the standard normal CDF. A note of motivation here. Because each \(Z_i\) is normalized we can efficiently use a fixed quadrature for each of these integrals, and reuse the evaluation of each of the \(K-1\) densities for different problems. Thus, assuming \(n\) abscissae on each dimension, instead of having \(n^K\) quadrature points overall we only have to worry of things on the order of \(nK\) points (which is much more manageable).

For completeness, our other exceedance function is \[\begin{align*} \xi_k(\ldots) &= \int_{a_1}^{b_1} \ldots \int_{a_{k-1}}^{b_{k-1}} f_1(z_1; \theta) f_2(z_2, z_1; \theta) \ldots f_{k-1}(z_{k-1}, z_{k-2}; \theta) e'_{k-1}(z_k, z_{k-1}; \theta) ~\mathrm{d}z_k\ldots\mathrm{d}z_1 \\ &\text{where}\quad e'_{k-1}(z_k, z_{k-1}; \theta) = \Phi\left( \frac{ a_k \sqrt{{\mathcal{I}}_k} - z_{k-1} \sqrt{{\mathcal{I}}_{k-1}} - \theta \Delta_k }{ \sqrt{\Delta_k} } \right)~. \end{align*}\]

Let’s get into the nuts and bolts of the recursion. Let \(g_1(z_1;\theta) = f_1(z_1;\theta)\) and

\[ g_k(z_k; \theta) = \int_{a_{k-1}}^{b_{k-1}} g_{k-1}(z_{k-1}; \theta) f_k(z_k, z_{k-1}; \theta)~\mathrm{d}z_{k-1} \quad\text{for}~k=2,\ldots,K~. \]

The function \(g_k(\cdots;\theta)\) is a subdensity (it does not integrate to one but to the probability that the experiment has reached analysis \(k\)). It is the density of \(Z_k\) when \(Z_{k-1} \in [a_{k-1}, b_{k-1}]\).

Now we can re-express our exceedance functions thusly: \[\begin{align*} \psi_k(a_1, b_1, \ldots, a_k, b_k;\theta) &= \int_{b_k}^{\infty} g_k(z_k; \theta)~\mathrm{d}z_k \\ &= \int_{a_{k-1}}^{b_{k-1}} \int_{b_k}^{\infty} g_{k-1}(z_{k-1}; \theta) f_k(z_k, z_{k-1}; \theta)~\mathrm{d}z_k~\mathrm{d}z_{k-1} \\ &= \int_{a_{k-1}}^{b_{k-1}} g_{k-1}(z_{k-1}; \theta) e_{k-1}(z_{k-1}, b_k; \theta)~\mathrm{d}z_{k-1} \end{align*}\] and similarly for \(\xi(\cdot)\).

Our heuristic approach to integration will reuse fixed evaluations of \(g_k(\cdot)\) many times, and we’ll get to that later. It is worth noting that \(g_k(\cdot)\) is a sub-density — it integrates to something less than 1. It represents the density that you’ve gotten to your \(k\)th analysis without blowing through any of your earlier boundaries.

Planning and Analysis

I want to take a quick detour to talk about some issues of planning, estimation, confidence intervals, and \(p\)-values related to such an analysis.

Sample Size Note

Our sample size will necessarily change our information levels, and this in turn will change our calculated boundaries. When testing for the web we need to be careful that our increments of data are exchangeable. For example, if we see different customer behavior on weekends versus weekdays, then pretending every day is the same can cause our test to behave strangely. However, it is often less onerous to assume one week is about the same as another.

The complication this introduces is that:

  1. Our plan might not match our actual information amounts, so
  2. We might fall short or exceed our maximum information, and
  3. Our calculated final boundary may result in slightly conservative or liberal error rates.

If, in (3), \({\mathcal{I}}_K\) is too small then our boundary will have \(a_K < b_K\), \(P(R_\emptyset) > 0\), and our type 2 error will be larger than our nominal rate. Alternatively, if \({\mathcal{I}}_K\) is too big then \(a_K > b_K\), so it won’t be clear what to conclude in certain situation. Jennison and Turnbull (2001) suggests doing bisection search to get to a correct \({\mathcal{I}}_K\) for which \(a_K = b_K\).

An interesting relationship exists between \(g_k(\cdot; \theta)\) for different values of \(\theta\). As given in Jennison and Turnbull (2001) pg. 351, we can find value of \(g_k(z_k; \theta_2)\) if we have \(g_k(z_k; \theta_1)\) by using the relationship

\[ g_k(z_k; \theta_2) = g_k(z_k; \theta_1) l(z_k, {\mathcal{I}}_k; \theta_2, \theta_1) \]

where

\[ l(z_k, {\mathcal{I}}_k; \theta_2, \theta_1) = \mathrm{exp}\left\{ (\theta_2 - \theta_1)z_k \sqrt{I_k} - (\theta_2^2 - \theta_1^2) {\mathcal{I}}_k/2 \right\}~. \]

If \(\theta\) under the null hypothesis is zero then we might use this relation to come up with a next “best guess” by finding a \(\theta\) under the alternative where the endpoints meet, and then choosing the \({\mathcal{I}}_K\) that makes the lower boundary end close to the upper.

Expected Stopping Time

Let \(p(k, z; \theta)\) be the subdensity of the terminal statistic \(Z_k\) when the test stops at analysis \(k\). For our design

\[ p(k,z;\theta) = \left\{ \begin{array}{ll} g_k(z; \theta) & \text{if}~z \notin [a_k, b_k]~, \\ 0 & \text{if}~z \in [a_k, b_k]~. \end{array} \right. \]

The probability the stopping time \(T = k\) given parameter \(\theta\) is then

\[ P_\theta(T=k) = \int_{z \notin [a_k, b_k]} p(k, z; \theta) ~\mathrm{d}z \]

and our expected value is

\[ \mathbb{E}[T] = \sum_{i=1}^K k P_\theta(T=k)~. \]

Estimation

The sampling density of \(\hat{\theta}\) is

\[ f_{\hat{\theta}}(y) = \sum_{i=1}^K p(k, y\sqrt{{\mathcal{I}}_k}; \theta) \sqrt{{\mathcal{I}}_k}~. \]

The bias is then

\[ \mathbb{E}[\hat{\theta}]= \sum_{i=1}^K \int_{z \notin [a_k, b_k]} \frac{z}{\sqrt{{\mathcal{I}}_k}} p(k, z; \theta) ~\mathrm{d}z. \]

Since group sequential methods are response adaptive methods, their estimators tend to be biased. Some of the reason for this is that outlier-like values get locked in as they cause the test to stop early, and there is no delicate balance of positive and negative such values under the null hypothesis and point alternative hypotheses.

Jennison and Turnbull (2001) goes into a lot of detail here, so I’m just going to skim the important parts. Let \(T\) be the time the test terminated, so \(T \in [1,\ldots,K]\). We can estimate \(\theta\) with the maximum likelihood estimator \(\hat{\theta}\) where

\[ \hat{\theta} = Z_{T}/\sqrt{{\mathcal{I}}_{T}}~. \]

The sampling density of \(\hat{\theta}\), \(f_{\hat{\theta}}(y)\) is given by

\[ f_{\hat{\theta}}(y) = \sum_{k=1}^K p(k, y\sqrt{{\mathcal{I}}_k}; \theta) \sqrt{{\mathcal{I}}_k} \]

where

\[ p(k, z; \theta) = \left\{ \begin{array}{ll} g_k(z; \theta) & \text{if}~z \in [a_k, b_k]\\ 0 & \text{otherwise.} \end{array} \right. \]

This is not a smooth density, and Jennison and Turnbull (2001) give some plots. If we can evaluate the following integral then we can find the expected bias:

\[ {\mathbb{E}}_\theta[\hat{\theta}] = \sum_{k=1}^K \int_{z \notin [a_k, b_k]} \frac{z}{\sqrt{{\mathcal{I}}_k}}p(k, z;\theta)~\mathrm{d}z~. \]

If one can efficiently evaluate this then one could numerically search for the value of \(\theta\) where \({\mathbb{E}}_\theta[\hat{\theta}]\) matches the observed value of the estimator. Or, we can just estimate the bias plugging in our current estimate for \(\theta\) in as the true value, and subtracting that from our estimator. This latter approach doesn’t require search and showed lower mean squared error than other approaches (Jennison and Turnbull 2001, pg 179).

Confidence Intervals

Our approach thus-far has been to do a one-sided test. You can invert the \(\alpha\)-boundary and center it around the point estimate to make a collection of one sided intervals. Another option is to use a double wedge design where the power boundary only kicks in after some time and exists as a wedge within the outer two-sided \(\alpha\)-envelope. Since the \(\alpha\) boundary is two-sided we could invert that to get two-sided CIs as well. Neither of these options is great.

What if we calculated a second design with two \(\alpha\)-spending sides and just let the coverage be \(2\alpha(1)\) of our original design? Since the \(\beta\)-boundary stops the test and prevents a type 1 error, such an approach would yield conservative intervals. But, it may be sufficient as a way of visualizing what kinds of values of the estimator are not very inconsistent with the data.

P-values

If you turn to Jennison and Turnbull (2001) pg 179 you get to participate in a deep dive into the different ambiguities of \(p\)-value given our multiple-stopping-point setup. If you can, it’s better to just find the boundaries by shifting \(\alpha\) that match the statistic, and report that \(\alpha\) as the \(p\)-value. You can do this to produce a set of \(p\)-value for the test which can be reported during the analysis and not just at the end.

Numerical Integration

Jennison and Turnbull (2001) suggests using Simpson’s method for quadratically interpolating the evaluations at the abscissae and approximating the partial integrals. Consider a quadratic interpolant where for \(a < b < c\) and \(f_a = f(a)\), \(f_b = f(b)\), and \(f_c = f(c)\)

\[ q(x) = f_a\frac{(x-b)(x-c)}{(a-b)(a-c)} + f_b\frac{(x-a)(x-c)}{(b-a)(b-c)}+f_c\frac{(x-a)(x-b)}{(c-a)(c-b)} \]

then the definite integral is \[\begin{align*} \int_{l}^{u} q(x)~\mathrm{d}x &= \frac{1}{3}\left( \frac{f_a}{(a-b)(a-c)} + \frac{f_b}{(b-a)(b-c)} + \frac{f_c}{(c-a)(c-b)} \right)(u^3-l^3) \\ &\phantom{=}-\frac{1}{2}\left( \frac{(b + c) f_a}{(a-b)(a-c)} + \frac{(a + c) f_b}{(b-a)(b-c)} + \frac{(a + b) f_c}{(c-a)(c-b)} \right)(u^2 - l^2) \\ &\phantom{=}+ \left(\frac{bc f_a}{(a-b)(a-c)} + \frac{ac f_b}{(b-a)(b-c)} + \frac{ab f_c}{(c-a)(c-b)}\right)(u - l) ~. \end{align*}\]

In terms of code you can copy (and even inline maybe) one of the languages below.

simpson.region <- function(a, b, c, fa, fb, fc, l, u){
    (fa / (a - b) / (a - c) + fb / (b - a) / (b - c) + fc / (c - a) / (c - b)) * (-l ^ 3 + u ^ 3) / 3 + ((-b * fa - c * fa) / (a - b) / (a - c) + (-fb * a - c * fb) / (b - a) / (b - c) + (-fc * a - b * fc) / (c - a) / (c - b)) * (-l ^ 2 + u ^ 2) / 2 + b * c * fa / (a - b) / (a - c) * (u - l) + a * c * fb / (b - a) / (b - c) * (u - l) + a * b * fc / (c - a) / (c - b) * (u - l)
}
def simpson_region(a, b, c, fa, fb, fc, l, u):
    return (fa / (a - b) / (a - c) + fb / (b - a) / (b - c) + fc / (c - a) / (c - b)) * (-l ** 3 + u ** 3) / 3 + ((-b * fa - c * fa) / (a - b) / (a - c) + (-fb * a - c * fb) / (b - a) / (b - c) + (-fc * a - b * fc) / (c - a) / (c - b)) * (-l ** 2 + u ** 2) / 2 + b * c * fa / (a - b) / (a - c) * (u - l) + a * c * fb / (b - a) / (b - c) * (u - l) + a * b * fc / (c - a) / (c - b) * (u - l)
double simpson_region(double a, double b, double c, double fa, double fb, double fc, double l, double u){
    return((fa / (a - b) / (a - c) + fb / (b - a) / (b - c) + fc / (c - a) / (c - b)) * (-pow(l, 0.3e1) + pow(u, 0.3e1)) / 0.3e1 + ((-b * fa - c * fa) / (a - b) / (a - c) + (-fb * a - c * fb) / (b - a) / (b - c) + (-fc * a - b * fc) / (c - a) / (c - b)) * (-l * l + u * u) / 0.2e1 + b * c * fa / (a - b) / (a - c) * (u - l) + a * c * fb / (b - a) / (b - c) * (u - l) + a * b * fc / (c - a) / (c - b) * (u - l));

Jennison and Turnbull (2001) then gives the following abscissae suggestion, which they found worked well in practice.

\[ x_i = \left\{ \begin{array}{ll} -3-4\log(r/i), & i=1,\ldots,r-1, \\ -3-3(i-r)/2r, & i=r,\ldots,5r,\\ 3+4\log(r/(6r-i)) & i=5r+1,\ldots,6r-1~. \end{array} \right. \]

Constructing Designs in R

R has a package that’s good for calculating these designs. The gsDesign package allows us to generate actual Pocock and O’Brien-Fleming designs, as well as \(\alpha\)-spending designs. For \(k=5\) epochs we can generate the classic designs as follows:

gsDesign(k=5, test.type=2, sfu='Pocock')  # 2-sided Pocock Design
gsDesign(k=5, test.type=2, sfu='OF')      # 2-sided O'Brien-Fleming Design

The \(\alpha\)-spending approach requires a little more use with the gsDesign package. We need to define the spending functions in a way the package expects, and it’s a little verbose. For Pocock:

alpha.po <- function(alpha,  t,  param){
    checkScalar(alpha, "numeric", c(0, Inf), c(FALSE, FALSE))
    checkVector(t, "numeric", c(0, Inf), c(TRUE, FALSE))
    x <- list(
        name="Lan-DeMets Approximate Pocock", 
        param=param, 
        parname=character(0), 
        sf=alpha.po, 
        spend=alpha * log(1 + (exp(1) - 1)*t), 
        bound=NULL, 
        prob=NULL
    )  
    class(x) <- "spendfn"
    x
}
gsDesign(k=5, test.type=2, sfu=alpha.po)

For O’Brien-Fleming \(\alpha\)-spending designs:

alpha.ob <- function(alpha,  t,  param){
    checkScalar(alpha, "numeric", c(0, Inf), c(FALSE, FALSE))
    checkVector(t, "numeric", c(0, Inf), c(TRUE, FALSE))
    x <- list(
        name="Lan-DeMets Approximate OF", 
        param=param, 
        parname=character(0), 
        sf=alpha.ob, 
        spend=2 - 2 * pnorm(qnorm(alpha/2, lower.tail=FALSE) / sqrt(t)), 
        bound=NULL, 
        prob=NULL
    )  
    class(x) <- "spendfn"
    x
}
gsDesign(k=5, test.type=2, sfu=alpha.ob)

We can also construct the one-sided \(\alpha\)-spending with a \(\beta\)-spending lower bound as follows:

des <- gsDesign(
    k=K,
    test.type=3,
    sfu=alpha.spending.function, # Spending function, upper
    sfl=beta.spending.function,  # Spending function, lower
    alpha=alpha,
    beta=beta,
    delta=delta  # Effect size to plan for
)

An R Implementation

I build a design manually with R code. First, I define my parameters.

alpha <- 0.05
beta <- 0.1
K <- 5
theta0 <- 0
theta1 <- 0.1
# The next is a magic number we know from gsDesign, otherwise we'd have to 
# recalculate the design with different information until the end points meet.
period.information <- 235.6147  # Amount of information per period

Next, I define my spending functions.

##
## Alpha and beta spending function
##
spend.alpha <- function(t){alpha * log(1 + (exp(1) - 1)*t)}
spend.beta <- function(t){beta * log(1 + (exp(1) - 1)*t)}

Next, I define a function to generate the abscissae and generate them.

##
## Generate abscissae as described by Jennison and Turnbull
##
get.abscissae <- function(r){
    c(
        -3-4*log(r/(1:(r-1))),
        -3+3*(r:(5*r) - r)/(2*r),
        3+4*log(r/(6*r - (5*r+1):(6*r-1)))
    )
}

z <- get.abscissae(27)

Here, z is a vector over which we’ll do all our function evaluations. Next, I define the information parameters, and set up boundaries and the errors associated with each boundary.

# Assuming information I = sqrt(1:K)
Info.k <- (1:K)*period.information
Delta.k <- c(0, Info.k[-1] - Info.k[-K])

# Initialize the error rates
cumulative.alpha <- spend.alpha(seq(from=0, to=1, length.out=K+1))
alpha.k <- cumulative.alpha[-1] - cumulative.alpha[-length(cumulative.alpha)]
cumulative.beta <- spend.beta(seq(from=0, to=1, length.out=K+1))
beta.k <- cumulative.beta[-1] - cumulative.beta[-length(cumulative.beta)]

# Define the boundaries
a.k <- rep(0, K)
b.k <- rep(0, K)

# The first boundaries can be found with r's qnorm function
a.k[1] <- qnorm(beta.k[1], mean=theta1 * sqrt(Info.k[1]), lower.tail=TRUE)
b.k[1] <- qnorm(alpha.k[1], mean=theta0 * sqrt(Info.k[1]), lower.tail=FALSE)


# vector z is the n abcissae at which we will evaluate the functions
# matrix F0 is an n x K matrix containing at (i,k) the evaluations of g_k(z_i) 
#     under the null hypothesis theta=0
# matrix F1 is an n x K matrix containing at (i,k) the evaluations of g_k(z_i) 
#     under the specific alternative theta=delta
f.Z0 <- matrix(rep(0, length(z) * K), ncol=K)
f.Z1 <- matrix(rep(0, length(z) * K), ncol=K)

# We can again use R's functions to evaluate the points for the first period.
f.Z0[,1] <- dnorm(z, mean=theta0 * sqrt(Info.k[1]))
f.Z1[,1] <- dnorm(z, mean=theta1 * sqrt(Info.k[1]))

Next, I’ll define some functions to do the integration and evaluation of the densities involved.

##
## Integration of quadratic interpolation
## Given a function f with f(a)=fa, f(b)=fb, and f(c)=fc, returns the integral
## of the interpolating polynomial from l to u.
##
integrate.interpolant <- Vectorize(function(a, b, c, fa, fb, fc, l, u){
    (fa / (a - b) / (a - c) + fb / (b - a) / (b - c) + fc / (c - a) / (c - b)) * (-l ^ 3 + u ^ 3) / 3 + ((-b * fa - c * fa) / (a - b) / (a - c) + (-fb * a - c * fb) / (b - a) / (b - c) + (-fc * a - b * fc) / (c - a) / (c - b)) * (-l ^ 2 + u ^ 2) / 2 + b * c * fa / (a - b) / (a - c) * (u - l) + a * c * fb / (b - a) / (b - c) * (u - l) + a * b * fc / (c - a) / (c - b) * (u - l)
})


##
## Numerical integration using fixed abscissae
##
numerically.integrate <- function(f.z, lower=-Inf, upper=Inf){
    if(lower >= upper) stop('Illegal bounds.')
    inside <- which(z >= lower & z <= upper)
    
    if(length(inside) > 3){
        # integrate the center
        i.x <- seq(from=min(inside), to=max(inside) - 2, by=2)
        i.l <- i.x[1]
        i.u <- i.x[length(i.x)] + 2
        masses <- c(
            0,
            0,
            integrate.interpolant(
                z[i.x], z[i.x + 1], z[i.x + 2],
                f.z[i.x], f.z[i.x + 1], f.z[i.x + 2],
                z[i.x], z[i.x + 2]
            )
        )
        # fill in the little bit if it exists
        if(is.finite(lower) && lower < z[i.l]){
            masses[1] <- integrate.interpolant(
                z[i.l-1], z[i.l], z[i.l+1], 
                f.z[i.l-1], f.z[i.l], f.z[i.l+1], 
                lower, z[i.l]
            )
        }
        if(is.finite(upper) && upper > z[i.u]){
            masses[2] <- integrate.interpolant(
                z[i.u-1], z[i.u], z[i.u+1], 
                f.z[i.u-1], f.z[i.u], f.z[i.u+1], 
                z[i.u], upper
            )
        }
        # return the estimate
        return(sum(sort(masses)))
    }
    
    i.l <- max(which(z < lower))
    i.u <- min(which(z > upper))
    if(i.l < i.u - 2){  # Indicates it needs two evaluations
        return(
            integrate.interpolant(
                z[i.l], z[i.l+1], z[i.l+2], 
                f.z[i.l], f.z[i.l+1], f.z[i.l+2], 
                lower, z[i.l+2]
            ) + integrate.interpolant(
                z[i.l+2], z[i.l+3], z[i.l+4], 
                f.z[i.l+2], f.z[i.l+3], f.z[i.l+4], 
                z[i.l+2], upper
            )
        )   
    } else { # Indicates it needs one evaluation
        return(
            integrate.interpolant(
                z[i.l], z[i.l+1], z[i.l+2], 
                f.z[i.l], f.z[i.l+1], f.z[i.l+2], 
                lower, upper
            )
        )
    }
}


##
## This is the density of f_k in terms of f_{k-1}
##
fk <- function(k, z.i, theta){
    (sqrt(Info.k[k])/sqrt(Delta.k[k])) * dnorm(
        (z.i * sqrt(Info.k[k]) - z * sqrt(Info.k[k-1]) - theta * Delta.k[k])
        /
        sqrt(Delta.k[k])
    )
}

We’ve already filled in the values for the first step, so this loop fills in values for the second all the way to the \(K\)th analyses.

for(k in 2:K){
    f.Z0[,k] <- vapply(
        z,
        function(z.i){
            numerically.integrate(
                fk(k, z.i, theta0) * f.Z0[,k-1],
                a.k[k-1],
                b.k[k-1]
            )
        },
        0.
    )
    
    f.Z1[,k] <- vapply(
        z,
        function(z.i){
            numerically.integrate(
                fk(k, z.i, theta1) * f.Z1[,k-1],
                a.k[k-1],
                b.k[k-1]
            )
        },
        0.
    )
    
    b.k[k] <- uniroot(
        function(b){
            numerically.integrate(f.Z0[,k], b, Inf) - alpha.k[k]
        },
        lower=-5,
        upper=5,
        tol=1e-12
    )$root
    
    a.k[k] <- uniroot(
        function(a){
            numerically.integrate(f.Z1[,k], -Inf, a) - beta.k[k]
        },
        lower=-5,
        upper=5,
        tol=1e-12
    )$root
}

We can plot the constructed boundary and find that the boundary matches that from the gsDesign package.

Next, I construct the exit probabilities, and the expected stopping time under \(\theta_0\) and under \(\theta_1\).

## Expected Stopping Times
prob0.stop.at.k <- cbind(
    c(
        pnorm(a.k[1], mean=theta0*sqrt(Info.k[1])),
        pnorm(b.k[1], mean=theta0*sqrt(Info.k[1]), lower.tail=FALSE)
    ),
    vapply(
        2:K,
        function(k){
            c(
                numerically.integrate(f.Z0[,k], -Inf, a.k[k]),
                numerically.integrate(f.Z0[,k], b.k[k], Inf)
            )
        },
        c(0., 0.)
    )
)

These exit probabilities under \(\theta = \theta_0\) are as follows:

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
Lower Boundary 0.3621825 0.3047309 0.1732508 0.0809166 0.0289178
Upper Boundary 0.0147697 0.0113871 0.0092688 0.0078163 0.0067580

I repeated our calculations under \(\theta = \theta_1\) as follows.

prob1.stop.at.k <- cbind(
    c(
        pnorm(a.k[1], mean=theta1*sqrt(Info.k[1])),
        pnorm(b.k[1], mean=theta1*sqrt(Info.k[1]), lower.tail=FALSE)
    ),
    vapply(
        2:K,
        function(k){
            c(
                numerically.integrate(f.Z1[,k], -Inf, a.k[k]),
                numerically.integrate(f.Z1[,k], b.k[k], Inf)
            )
        },
        c(0., 0.)
    )
)

These exit probabilities are as follows: These exit probabilities under \(\theta = \theta_0\) are as follows:

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
Lower Boundary 0.0295395 0.0227743 0.0185376 0.0156327 0.0135160
Upper Boundary 0.2606844 0.2819827 0.1986904 0.1117025 0.0469428

Using these we can calculate the expect stopping time as follows.

ex.stop.theta0 <- sum((1:K) * (prob0.stop.at.k[1,] + prob0.stop.at.k[2,]))
ex.stop.theta1 <- sum((1:K) * (prob1.stop.at.k[1,] + prob1.stop.at.k[2,]))

These times are \(E_{\theta_0}[T] = 2.0900584\) and \(E_{\theta_1}[T] = 2.3630567\).

Finally, we calculate the bias we expect given our parameter assumptions as follows:

##
## Estimation and bias
##

bias0 <- sum(vapply(
    1:K,
    function(k){
        numerically.integrate((z/sqrt(Info.k[k])) * f.Z0[,k] * (z > b.k[k] | z < a.k[k]), -Inf, a.k[k]) +
            numerically.integrate((z/sqrt(Info.k[k])) * f.Z0[,k] * (z > b.k[k] | z < a.k[k]), b.k[k], Inf)
    },
    0.
)) - theta0

bias1 <- sum(vapply(
    1:K,
    function(k){
        numerically.integrate((z/sqrt(Info.k[k])) * f.Z1[,k] * (z > b.k[k] | z < a.k[k]), -Inf, a.k[k]) +
            numerically.integrate((z/sqrt(Info.k[k])) * f.Z1[,k] * (z > b.k[k] | z < a.k[k]), b.k[k], Inf)
    },
    0.
)) - theta1

We find \(\mathbb{E}_{\theta_0}[\hat{\theta} - \theta] = -0.0177018\) and \(\mathbb{E}_{\theta_1}[\hat{\theta} - \theta] = 0.0143384\).

Conclusion

Okay, so we did a lot here. We constructed the one-sided design for \(\alpha\)- and \(\beta\)-spending functions. We cheated a little in our manual construction in that we found the right period.information to make the endpoints meet, where as an operational version would need to search for such a value, or increase \(K\). We didn’t construct confidence intervals nor \(p\)-values:

  • We can invert our boundaries to get \(p\)-values as the largest \(\alpha\) where the statistic’s path crosses the upper boundary.
  • A one-sided confidence interval is awkward for presenting to stakeholders, so we probably want a two-sided interval. This is a different calculation but we can do it by inverting our two-sided version of the test (that ignores the \(\beta\)-spending but provides coverage). It’s possible the one-sided and two-sided tests can disagree, but I’m not sure what else to do about this.

So, yeah. These are my notes on this. Good luck!

References

Armitage, P., C. K. McPherson, and B. C. Rowe. 1969. “Repeated Significance Tests on Accumulating Data.” Journal of the Royal Statistical Society: Series A (General) 132 (2): 235–44. https://doi.org/https://doi.org/10.2307/2343787.
Jennison, Christopher, and Bruce W. Turnbull. 2001. “Group Sequential Methods with Applications to Clinical Trials.” Statistics in Medicine 20 (14): 390. https://doi.org/https://doi.org/10.1002/sim.900.