Your A/B Test May Be Three Different Tests

A/B Testing
Planning

People tend to think of A/B tests in terms of the simple test of a challenger over a control, but the data we collect often finds itself in three distinct uses that we often fail to plan for. Here, I discuss this and ways to address it.

Published

December 23, 2022

Planning an A/B test of conversions often focuses just on the hypothesis test of a challenger treatment outperforming a control treatment. However, the data collected is often used to do three different things:

  1. Perform the hypothesis test
  2. Give an estimate of the impact of the change
  3. Estimate the impact of all successfully tested changes this quarter or for a particular initiative

It’s not the case that the planning we did for (1) and the associated collected data is appropriate for all three concerns. For this discussion let’s assume our test will be a two-sample test of proportion where the control treatment converts at a rate of \(p_0\) and the challenger treatment converts at a rate of \(p_1\). Let \(\lambda\) represent how we’re going to split our \(n\) incoming samples, so control will get \(\lambda n = n_0\) samples and the challenger will get \((1 - \lambda) n = n_1\) samples.

Should We Plan for a One-Sided or Two-Sided Test?

The test in (1) is often just to see if there’s evidence that the challenger is better than the control1. When there is little value in knowing that the challenger is significantly worse than the control then efficiency requires we do the one-sided test.

Yet estimating the impact should use a two-sided interval: for example a \(100(1-\alpha)\%\) confidence interval for impact \([xxx\%, yyy\%]\). We can present this to stake holders as any increase between \(xxx\%\) and \(yyy\%\) is not inconsistent with the data we saw. If the \(\alpha\) used in (2) is the same as in (1) then we can potentially find that the challenger is significantly better but have the two-sided confidence interval of the difference pass zero and extend into negative territory.

We can address this problem in one of two ways:

  • Construct \(100(1 - 2 \alpha)\%\) confidence intervals
  • Do the two-sided test in (1) instead of the one-sided test

If we chose \(\alpha\) carefully based on our appetite for risk it may be that confidence interval coverage is more flexible, so adjusting the coverage of the confidence intervals is fine. But if we want our coverage to match the type 1 error rate from (1) then we should do the 2-sided test for (1). Either way, it should be something about which you make a reasoned decision.

Will the Confidence Interval be Too Wide?

When planning a test we plan for some difference of consequence - this ensures we have a certain power to detect this effect, and this causes the confidence interval to be short enough to not cover the null hypothesis when the specific difference of consequence is true. We don’t usually have to worry the confidence interval is too wide. No matter how we address the previous question, we should have decent confidence intervals.

Note that you could plan the test just in terms of length of the two-sided confidence intervals. Let \(n_0 = \lambda n\) and \(n_1 = (1-\lambda) n\) so that \(n\) is the combined sample size. When \(w\) is the desired width of the confidence interval then choose

\[ n = 4 \left(\frac{z_{\alpha/2}}{w}\right)^2~\left(\frac{p_0 (1 - p_0)}{\lambda} + \frac{p_1 (1 - p_1)}{1-\lambda}\right)~. \]

Many times I’ve seen tests come through with huge confidence intervals. This is a clue that the test was run inappropriately. Recall that our test should be run until all samples are collected and then analysis conducted. One common way to royally fuck this up is to monitor the “\(p\)-value” every day and stop when the test “reaches significance”. The quotes here are because the “\(p\)-value” is not really a \(p\)-value anymore, and “reaches significance” is hiding that they’re treating this fixed sample size test as though it were a sequential procedure.

Early in the experiment, the variability of the observed difference is high because the sample size is small. These swings in the mean difference early on correspond to swings in “\(p\)-value”, so when the experimenter terminates early because the “\(p\)-value” reached significance we are left with a small \(n\) and a big empirical difference. This causes the width of the confidence interval to be wide and the estimate of impact to be too high.

Whenever I see a too-wide confidence interval I immediately ask what effect size they planned for and their power to have detected that effect size. At that point the statistical sins are usually confessed.

Can We Add the Impact to our Overall Estimate of Impact?

The point estimate of the difference in mean conversion rates is an unbiased estimator for the true improvement, and so one might be tempted to simply add that into aggregated estimates of impact. But we only add those estimates that are large enough to have passed our thresholds for type 1 errors. This sum is no longer an unbiased estimate, as it contains

  • Some small number of false alarms, and
  • Some number of effects that by chance appeared larger than they were.

The smaller-by-chance observations tend to get filtered out by our thresholds, so the balance of negative and positive errors that make our estimate of individual impacts unbiased are no longer present in the aggregate sum.

If we want an unbiased estimate of the aggregate impact then we need an independent set of observations from which to generate an estimate. This can be done by extending the length of the experiment and using the new data, or by keeping a permanent hold-out set and observe the difference from them over time.

If we don’t necessarily care that the estimate is unbiased we might try penalizing our estimate of the total sum of impacts to account for our expectation that the raw sum overestimates impact. I’ll cover this in a separate post.

Two Sample Test of Proportions Example

Let’s work out the planning for the two sample test of proportions as this is one setting I see often. See Montgomery and Runger (2020) for other tests Let the type 1 error rate be \(\alpha\) and type 2 error rate be \(\beta\). Recall that the split between challenger and control is \(\lambda\) such that the number of people receiving the control treatment is \(n_0 = \lambda n\) and the number of people receiving the challenger treatment is \(n_1 = (1-\lambda) n\). Let

\[ p = \frac{p_0 n_0 + p_1 n_1}{n_0 + n_1} = \lambda p_0 + (1-\lambda)p_1~. \]

Under \(H_0\), \(p_0 = p_1 = p\), so the statistic \(Z\)

\[ Z = \frac{ p_1 - p_0 }{ \sqrt{p (1 - p)\left(\frac{1}{n_0} + \frac{1}{n_1}\right)} } \]

is asymptotically normally distributed, and we can replace \(p\), \(p_0\), and \(p_1\) with their estimates \(\hat{p} = \frac{X_0 + X_1}{n_0 + n_1}\), \(\hat{p}_0 = \frac{X_0}{n_0}\), and \(\hat{p}_1 = \frac{X_1}{n_1}\).

When \(\lambda = 0.5\) and \(n_0 = n_1 = n'\) then, for the two-sided test, the sample size required is \(n'\) where

\[ \frac{ \left[ z_\beta \sqrt{p_0(1-p_0) + p_1(1-p_1)} + z_{\alpha/2} ~ \sqrt{2p(1-p)} \right]^2 }{ (p_1 - p_0)^2 } = n'~. \]

We can solve for the imbalanced case too, but doing so involves roots of 4th order polynomials. The closed form solution exists but it’s ugly. I’ve included a calculator below.

Imbalanced Sample Size Calculator
\[n’:\] ?
\[n_0:\] ? \[n_1:\] ?
CI Width: ?

If you want to plan for an imbalanced split and a one-sided test simply enter an \(\alpha\) twice the size as you really want … that should work.

Conclusions

It’s important to think about all the ways you’re going to use the data you collect, and plan accordingly. It’s important for everyone involved to understand that the sum of observed positive impacts isn’t an unbiased estimator of the sum of impacts. Finally, people should demand confidence intervals and \(p\)-values - each offers a different insight into the data.

References

Montgomery, D. C., and G. C. Runger. 2020. Applied Statistics and Probability for Engineers. Wiley. https://books.google.com/books?id=c8SuzQEACAAJ.

Footnotes

  1. It may be that learning there is a negative impact is as important as learning there is a positive impact. I think having a lot of tests that only are one-sided is suggestive that you may be throwing “stuff” against the wall and just seeing what sticks instead of having a strategic testing program.↩︎