You Probably Don't Want To Pick The Winner

By Edgar Hassler
Category: OCEAn

Consider a web site having a search interface that they're looking to improve. They've developed \(k\) additional search interfaces to test along with the control (the existing interface) with the goal of improving the customer experience. Everyone involved in this project thinks they want to use statistics to help pick the best of the \(k+1\) possible search interfaces, but they don't really understand what that entails, and it's left to the analyst to try to explain something that's weird and complicated and ain't no one got time for that. Yet here we are.

Picking the best is terribly impractical for the following reasons:

  • You have to be able to tell the difference between best and second best, and this can be a very small difference requiring very large sample sizes. If you thought you sample size requirements were huge in your regular old A/B test then prepare yourself for a new unparalleled world of pain.
  • To know something is best means that it's better than the rest. This means you have to look at each variant and see if it's better than the other variants. This means you're comparing everything to everything. Hello, multiple comparison problems.

I advise people to make one of the following compromises:

  • Picking something better: Use Dunnett's test to compare each search interface to the control. Of those that are statistically superior, make a business decision which to use. You're guaranteed (up to your error rates) to have made an improvement in your process.
  • Picking amongst the best: Use Hsu's constrained multiple comparison with the best method to determine a statistically non-dominated set of interfaces. This is a set for which each interface was not statistically worse than another interface in the experiment including control). If control is in this set then keep control. Otherwise, make a business decision to pick from the members of this set. You're guaranteed (up to your error rates) to have made in improvement in your process.

Though Hsu's method is more aligned with what people most want to do, it also requires a larger sample size than Dunnett's method. But overall they are both far more practical choices of action than setting up your test to identify the unique best when it exists.

Separating Best From Next Best

If we only had a control and a new variant then we'd proceed using the sort of classical treatment you find in most textbooks. Our hypotheses would be

$$ H_0: p_1 = p_0 \quad\text{versus}\quad H_A: p_1 > p_0 $$

and we would begin planning by choosing a type 1 error rate \(\alpha\), then decide what is the minimum detectable effect size \(\delta\) at which we'd like to set our type 2 error rate \(\beta\)

$$ \beta=P\bigl(\text{we fail to reject}~H_0 | p_1 = p_0 + \delta\bigr)~. $$

Then we would plug that into our trusty sample size calculator to get a required \(n\) to achieve error rates \(\alpha\) and \(\beta\). Here, \(\delta\) is simply how much better than the old search interface do we want to be confident in detecting (when it is present).

Now, imagine there's a second, third, and fourth variant, and your goal is to pick the winner. To know something is a winner means knowing it's better than all the other variants. That means that we need to be able to detect the difference between the best and second best, and that means drastically smaller minimum detectable effect sizes are required.

Let's say that the control \(p_0\) and the variants \(p_1,\ldots,p_4\) are as follows:

\begin{align} p_0&=0.080~,\\ p_1&=0.084~,\\ p_2&=0.081~,\\ p_3&=0.079~,\quad\text{and}\\ p_4&=0.083~. \end{align}

To identify \(p_1\) as best we need to be able to detect it's better than \(p_4\) so we're looking at a \(\delta = 0.001\) as opposed to a \(\delta = 0.004\). As \(\delta\) grows the sample size requirements grow quadratically. For concreteness, if we want to use \(\delta=0.004\) at \(\alpha=0.05\) and \(\beta=0.10\) then \(n=80850\) is required. If instead we use \(\delta = 0.001\) then \(n=1267786\) is required, or more than 1,500% as large a sample.

Such dramatic increase in sample size does not even take into account that we're also facing an inflation in our errors due to multiple testing.

Two Types of Multiple Testing Adjustments

Most people who have heard the terms multiple comparison adjustment or multiple testing adjustment have heard it in connection with the Bonferroni correction. They know that each test has some probability of a type 1 error occurring, so if you do a lot of tests the probability that at least one may end in a type 1 error should be increased. To combat that they use this simple technique of dividing \(\alpha\) among the tests to ensure that, overall, the probability of one or more type 1 errors is guaranteed at the \(\alpha\) they started with.

Great.

PretentiousCool people call this controlling the disjunctive type 1 error rate. We can break down error rates for multiple comparisons as follows:

  • Conjunctive error rate - Sometimes called the "and" error rate, this is rate at which all tests end in the specific error.
  • Disjunctive error rate - Sometimes called the "or" error rate, this is the rate at which at least one test ends in the specific error.
  • Average error rate - This is the rate at which the tests end in error on average, and usually matches the unadjusted rate.

If you've seen "Family-wise Error Rate" this is the rate at which at least one error occurred and is hence a disjunctive error rate.

The Bonferroni correction above guarantees the disjunctive type 1 error rate, but the type 2 error rate is still an average error rate. We could do a Bonferroni correction there too, but now our sample size calculation has again ballooned.

Let's say we're wanting to compare all four new search interfaces to the control. Letting \(p_0 = 0.08\) and \(\delta=0.04\) then the required samples sizes per variant are as follows:

Bonferroni Adjusted Sample Size Requirements
Type 1 Error Rate Type 2 Error Rate \(\alpha\) \(\beta\) \(n\)
Average Average 0.0500 0.100 80850
Disjunctive Average 0.0125 0.100 116781
Average Disjunctive 0.0500 0.025 122271
Disjunctive Disjunctive 0.0125 0.025 166088
If we want to limit the appearance of any false alarms to 5% of the time then we need to set \(\alpha = 0.0125\). If we want to limit type 2 errors occurring at all to 10% of the time then we need to set \(\beta = 0.025\). We're left requiring 166,088 samples per variant, so 830k samples overall, to maintain those error rates.

So Do We Really Want To Maintain Those Error Rates?

We're faced with choosing to maintain the disjunctive, average, or conjunctive error rates. We don't really ever look at the conjunctive error rates. If we maintain the conjunctive type 2 error rate then, when all of the variants are better than control by the MDE, the probability we miss all of these is maintained at \(\beta\). But if only some are better and others are not then the number of times the experiment misses all of the improvements exceeds \(\beta\).

If we let \(\beta\) stay an average error rate then if only one variant is better than control we maintain \(\beta\), but if more are present then the probability of detecting at least one is better than control goes up. It may be worth it to have a smaller sample size requirement and know that you're likely to find some improvement over control but not necessarily know that you've identified all which improved over control.

Looking At All Pairwise Comparisons

We just looked at finding something better than a control but ignored the real goal of finding the winner. If we need to identify a winner then we have to do all \(\binom{5}{2}=10\) comparisons. That affects our table somewhat:

Bonferroni Adjusted Sample Size Requirements
Type 1 Error Rate Type 2 Error Rate \(\alpha\) \(\beta\) \(n\)
Average Average 0.050 0.10 80850
Disjunctive Average 0.005 0.10 140006
Average Disjunctive 0.050 0.01 148388
Disjunctive Disjunctive 0.005 0.01 226118
The effect quickly gets large as the number of comparisons grows with \(k\) to

$$ \frac{(k+1)k}{2}~. $$

In fact, the Bonferroni correction is conservative, and three's ways we can do better, namely Dunnett's test and Hsu's Method.

A Better Strategy: Dunnett's Test and Hsu's Method

Dunnett's test is a simultaneous test of all \(k\) variants against a control. Dunnett's test takes into account that the control is used in every comparison when defining critical values. For example, the plot below shows the relative width of Dunnett confidence intervals to Bonferroni corrected confidence intervals when the sample sizes are large.

plot of chunk you-probably-dont-want-to-pick-the-winner-3

As a sidebar, note that you can exploit this use of the control in each comparison to increase its sample size relative to the other variants.

Hsu's method is related to Dunnett's method but the goal is different. Below I've calculated sample size requirements (using optimal control sample sizes for Dunnett's test) when \(p_0 = 0.08\) and we want to detect \(\delta=0.004\) amongst the \(k=4\) non-control variants while holding \(\alpha=0.05\) and \(\beta=0.10\).

Sample Size Requirements for Multiple Comparison Adjustments
Type 1 Error Type 2 Error Control Treatment Total
Unadjusted, A-optimal Average Average 120757 60379 362273
Unadjusted, Balanced Average Average NA 80580 402900
Dunnett Familywise Average 172196 86098 516588
Dunnett w/Power Adj. Familywise Familywise 233378 116689 700134
Hsu's Method* Familywise Familywise NA 150739 753695

Above, the NA entries for control indicate that the control gets the same volume as the treatments. The thing that is most striking is that Dunnett's test rwith \(A\)-optimal allocation requires 516,588 runs whereas the Bonferonni correction with balanced samples requires 700,030 runs. That's quite a savings. Hsu's method requires 25% more samples than Dunnett's but it gives us that set of non-inferior options that I crave.

I hope the above example demonstrates the need for pragmatism in our approach. By trying to pick something better or among the best we can move more quickly than if we are intent on identifying the best.