Multivariate Tests on the Web
Factorial and orthogonal array designs have a place in marketing experimentation, despite their unpopularity amongst some in the A/B testing world. Here, we get into some of the efficiency and philosophy behind these experimental designs.
No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or, ideally, one question, at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will best respond to a logical and carefully thought out questionaire.
— Ronald Fisher, Rothamsted, 1929
Fisher called it, almost a century ago. People tend to think they need to separate and test things one at a time, when in reality good planning can yield a strategy that lets us test multiple things all-at-once. I will attempt to give an example of the kind of efficiency gains we find with factorial designs. The efficiency of such designs hinges on some philosophical assumptions that are true in many cases (and false in some important cases as well). Yet when these assumptions are met we can produce lots of learning from a very carefully selected set of feature combinations far smaller than every possible combination.
Naive Insufficiency
When not well thought out, the approach most people take is to test one thing at a time. This is inefficienct but also sometimes insufficient to find what you want.
When testing is done by changing one thing at a time then we are often completely blind to synergystic and antagonistic effects, or we see such effects and wrongly associate them with the individual change that we happen to make, rather than correctly attributing such things to the interactions of settings. Finding synergies, avoid antagonisms, and correctly understanding the impact of individual changes are critical for most marketing experiments where having an understanding to plan future efforts is critically important.
Even if there’s only one feature but multiple possible challengers, we need to be careful to test them together. The following example comes from John Cook’s blog post “A/B testing and a voting paradox”. Assume that we have 3 different treatments (A, B, and C) and can divide our population into three equal sized groups with the following preferences:
- Like \(A > B > C\),
- Like \(B > C > A\), and
- Like \(C > A > B\).
If we do this as a sequence of 2 tests then we get different results depending on the specific choice of treatments:
- Starting with \(A\) vs \(B\) yields \(A\) is superior for \(2/3\)s of the population, then testing \(A\) vs. \(C\) yields \(C\) as \(C\) is superior for \(2/3\)s of the population.
- Starting with \(A\) vs \(C\) yields \(C\) is superior for \(2/3\)s of the population, then testing \(B\) vs. \(C\) yields \(B\) as \(B\) is superior for \(2/3\)s of the population.
- Starting with \(B\) vs \(C\) yields \(B\) is superior for \(2/3\)s of the population, then testing \(A\) vs. \(B\) yields \(A\) as \(A\) is superior for \(2/3\)s of the population.
Cook refers to this as the Condorcet voting paradox. The general problem here is that group preferences may be nontransitive (do not have the property that \(A > B\) and \(B > C\) implies \(A > C\), see Wikipedia on dice with this property). If we had done a single A/B/n test with the three treatments we’d be able to notice that there really is no difference between the treatments.
Next, let us move to an example to demonstrate why factorial designs are so efficient.
Factorial Example
Factorial structures are easy to express mathematically but their efficiency is difficult to visualize. I’ll begin with a small motivating example. Let’s say we have 12 seeds and we wish to know how two factors affect the growth of the seeds in terms of the height of the plant after several weeks. Factor \(A\) will represent either plain soil (denoted \(a\) in the run) or fertilizer (denoted \(A\) in the run). Factor \(B\) will represent either irrigation (denoted \(b\) in the run) or misters (denoted \(B\) in the run). Each of the 12 runs of the experiment will have some level of factor \(A\) (either \(a\) or \(A\)) and factor \(B\) (either \(b\) or \(B\)):
- Run \(ab\) is regular soil and irrigation.
- Run \(Ab\) is fertilized soil and irrigation.
- Run \(aB\) is regular soil and misters.
- Run \(AB\) is fertilized soil and misters.
By comparing runs we can estimate the effect of each factor on the overall height of the plant. For example, if we measured the height of a plant grown under \(Ab\) and compared that to height of the plant grown under \(ab\) we could estimate the effect of fertilizer over regular soil on plant height. We could also estimate this effect by comparing \(AB\) to \(aB\). When we are using \(k\) pairs of runs to estimate an effect of a factor, say factor \(A\), I’ll denote this with the string
\[ \underbrace{A \ldots A}_{k~\text{times}}~. \]
This string represents how precise our estimate of the factor effect is, as more repetitions means having a lower variance of the estimator.
How should we conduct this experiment? There’s three obvious ways.
- We can run a sequence of two A/B experiments. The first experiment chooses one factor level to use, and the second experiment chooses the second given the first choice. This process is called one factor at a time (OFAT) experimentation.
- We can run this as an \(A/B/C\) experiment. Here we’ll have a control of \(ab\) and compare this to a run with fertilizer \(Ab\) and a run with mister \(aB\).
- We can run this as a factorial where every combination of levels \(ab\), \(Ab\), \(aB\) and \(AB\) are examined.
We begin with the OFAT design.
One Factor at a Time Design
To use A/B tests alone we have to start with one factor, so let’s begin with an experiment to determine if soil has an effect. We conduct an experiment using three seeds for control (regular soil and irrigation) and three seeds for the fertilizer/irrigation combination. We’ll then compare these to determine if there’s an effect due to soil. I’ll represent this experiment with the following picture:
Above, we could take the average of the three \(Ab\) measurements and subtract the mean of the three \(ab\) measurements as our estimate of the soil effect, and this estimate would have variance \(\sigma^2/3\) which we write as \(AAA\). Depending on the results we might follow up with one of the two following experiments:
Between the two experiments, we find the precision for estimating the soil effect to be \(AAA\) and the precision for estimating the watering effect to be \(BBB\).
A/B/n Design
In the OFAT approach above we used six runs for the regular soil irrigation combinations, three in each experiment. However, if we ran all of the experiments together we could use four runs for \(ab\), \(Ab\), and \(aB\). Graphically, this design is as follows:
With the above design we can make the following comparisons:
- The four runs at \(ab\) to the four runs at \(Ab\) give us precision \(AAAA\).
- The four runs at \(ab\) to the four runs at \(aB\) give us precision \(BBBB\).
This yields more precise estimates of the impact of each factor over the pair of A/B tests described. Let’s not look at a true factorial design.
Factorial Design
Consider now the design where \(ab\), \(Ab\), \(aB\), and \(AB\) occur three times each. Every combination of factor \(A\) and factor \(B\) occur the same number of times. Graphically, this is as follows:
We can make the following comparisons with the above design.
- The three runs at \(Ab\) to the three runs at \(AB\) gives us \(BBB\).
- The three runs at \(ab\) to the three runs at \(aB\) gives us another \(BBB\).
- The three runs at \(aB\) to the three runs at \(AB\) gives us \(AAA\).
- The three runs at \(ab\) to the three runs at \(Ab\) gives us another \(AAA\).
Thus our total precision in the factorial design is \(AAAAAA\) and \(BBBBBB\). This is an improvement over both the sequence of A/B tests and the A/B/n test.
When we’re estimating our main effects — that is to say estimating the effect due to changing from soil to fertilizer and the effect due to changing from irrigation to misters — our variance for the factorial design is much superior to those of the other two. These variances are (in terms of the variance of a sample \(\sigma^2\)):
\[ \mathrm{Var}(\text{main effect estimate}) = \left\{ \begin{array}{ll} \frac{\sigma^2}{3} & \text{when OFAT,} \\[5 pt] \frac{\sigma^2}{4} & \text{when A/B/n,} \\[5pt] \frac{\sigma^2}{6} & \text{when factorial.} \end{array} \right. \]
Simply by carefully arranging our runs we have managed to squeeze more information out of our experiment. In addition, the factorial is the only design of those above that can estimate all of the main effects and interaction effects. The picture we get from factorials are more complete than when using the other designs.
Let’s now move away from the specific example and towards a general formulation.
Factorial Efficiency
The field of optimal experimental design has various criteria for describing the performance of different designs. The \(A\)-optimality criterion is the sum of the effect estimate variances and ignores covariance. It is the trace of the covariance matrix. Here I am using a modified version of the \(A\) criterion where I ignore any intercept variances and instead look only at the variances of the estimated difference between the main effect and the control.
We’ll assume the design is a probability measure so that we can divide the sample size in different ways without worrying if it’s possible to do so with finite numbers of things. We begin by looking at the performance of the \(2^p\) full factorial.
\(2^p\) Full Factorial
Consider a \(2^p\) factorial design. Here we have \(p\) factors each at two possible levels. This design has \(2^p\) distinct combinations of factor levels representing every possible combination of factor levels.
Let’s assume each combination of factor levels occurs \(r\) times. Our total sample size is \(r 2^p\). Let the factors be encoded as \(0\) for off and \(1\) for on. For each factor the zero level represents the control version of the treatment and the one level represents the modified version of the treatment we wish to investigate. We examine the \(A\)-criterion, written \(\Psi_\mathrm{A}(\cdot)\), which looks at the sum of the variances of all the main effects. The experimental design with the lowest variances would then be our most efficient design. For the \(2^p\) full factorial the covariance matrix \(\boldsymbol\Sigma_\mathrm{Fac}\) is
\[ \boldsymbol\Sigma_\mathrm{Fac} = \left[ \sigma^2 \bigl(\mathbf{X}^\mathrm{T} \mathbf{X}\bigr)^{-1} \right]_{2:(p+1),2:(p+1)} = \mathrm{diag}_{i=1,\ldots,p} \left\{\frac{\sigma^2}{r ~2^{p-2}} \right\} \quad\text{so}\quad \Psi_\mathrm{A}(\boldsymbol\Sigma_\mathrm{Fac}) = \frac{p ~ \sigma^2}{r~ 2^{p-2}}~. \]
Note that the coding used by software such as JMP where an off state is given \(-1\) and the on state is given \(1\) is estimating twice the effect. Since it’s generally true that \(\mathrm{Var}[aX] = a^2 \mathrm{Var}[X]\) we’d expect the criterion for the \(\{-1,1\}\) coded covariance \(\boldsymbol\Sigma'_\mathrm{Frac}\) have the property
\[ \Psi_\mathrm{A}(\boldsymbol\Sigma_\mathrm{Fac}') = 4 \Psi_\mathrm{A}(\boldsymbol\Sigma_\mathrm{Fac}) \]
and this is in fact the case. The \(\{0,1\}\) coding is chosen as it captures the desire to measure the difference from a control \(0\) to a treatment \(1\) and thus is the most appropriate comparison to OFAT A/B tests and A/B/n tests.
One Factor at a Time: Sequence of A/B Tests
Consider a sequence of A/B tests where the \(i\)th test compares a control group to a group receiving the \(i\)th of \(p\) treatments. Assume that we’re testing using a \(z\)-test and that we know the variance. Note that the \(z\)-test doesn’t explicitly have an encoding, but it is modeling the difference from control and not twice the difference from control. Thus it is most naturally compared to the factorial under a \(\{0,1\}\)-coding. For a one-factor-at-a-time (OFAT) experiment, to do \(p\) comparisons against control with the same sample size as in the factorial design we must divide the sample into groups of size
\[ \frac{r~2^p}{2p} = \frac{r}{p} ~ 2^{p-1}~. \]
Under the standard \(z\)-test the covariance is
\[ \boldsymbol\Sigma_\mathrm{OFAT} = \mathrm{diag}_{i=1,\ldots,p} \left\{\frac{p}{r ~ 2^{p-2}} \sigma^2\right\} \quad\text{so}\quad \Psi_\mathrm{A}(\boldsymbol\Sigma_\mathrm{OFAT}) = \frac{p^2 \sigma^2}{r~2^{p-2}} = p~ \Psi_\mathrm{A}(\boldsymbol\Sigma_\mathrm{Fac}) ~. \]
The relative efficiency of OFAT to factorial is \(p\):
\[ \frac{ \Psi_\mathrm{A}(\boldsymbol\Sigma_\mathrm{OFAT}) }{ \Psi_\mathrm{A}(\boldsymbol\Sigma_\mathrm{Fac}) } = p~. \]
Under balance it must be that the standard errors of the main effects are equal, and so we see that the variances of our estimators of improvement from doing a sequence of \(p\) A/B tests is \(p\)-times as large as the the variances had we used a factorial design. The OFAT procedure is extremely inefficient compared to a factorial design, so factorial designs are to be preferred to doing a sequence of A/B tests.
An A/B/n Test with a Control Level
Instead of a sequence of A/B tests we could do one long test with the \(p\) levels (one for each main effect where just that effect is active) and a control level (so there are \(p+1\) levels overall). Let \(i=0\) be the control level. The covariance of the initial estimators of mean performance is
\[ \boldsymbol\Sigma_\mathrm{A/B/n} = \mathrm{diag}_{i=0,\ldots,p} \left[ \frac{\sigma^2}{n_i} \right] \]
Because of exchangeability of the non-control factors let \(n_1=\ldots=n_p\). Then the covariance of the estimators of the difference in mean from a factor to the control is
\[ \boldsymbol\Sigma_\mathrm{ABnDiff} = \mathrm{diag}_{i=1,\ldots,p} \left[ \left(\frac{1}{n_0} + \frac{1}{n_1}\right) \sigma^2 \right]~. \]
Note that we can choose from many values of \(n_0\) and \(n_1\) and such a choice results in different \(A\)-criterion values. The \(A\)-optimal design in terms of \(n_0\) and \(n_1\) puts more experimental units in the control group than in the other treatment groups. Under this optimal allocation our \(A\)-efficiency is
\[ \frac{ \Psi_\mathrm{A}(\boldsymbol\Sigma_\mathrm{A/B/n}) }{ \Psi_\mathrm{A}(\boldsymbol\Sigma_\mathrm{Fac}) } = \frac{1}{4} \bigl(\sqrt{p} + 1\bigr)^2 \]
which is clearly greater than one for \(p = 2,3,4,\ldots\) since
\[ \frac{1}{4} \bigl(\sqrt{p} + 1\bigr)^2 > \frac{p + 3}{4} > 1~. \]
Again, the factorial design produces estimators with variances on the order of \(p\) times more efficient than an A/B/n design.
The factorial design is much more efficient than the other designs. Looking at the sums of the variance of our main effects up to a scale of \(\Psi_\mathrm{A}(\boldsymbol\Sigma_\mathrm{Fac})\) we find:
\[\begin{align*} \text{Factorial} &:\quad 1~,\\ \text{A/B/n} &:\quad \frac{1}{4} \bigl(\sqrt{p} + 1\bigr)^2~,\\ \text{A/B} &:\quad p~.\\ \end{align*}\]
Sparsity of Effects
Our analysis thus far has ignored the possible interaction effect. What if the combination of fertilizer and misters had a synergistic effect? What if the combined effect differs from what we’d expected given the sum of the impacts due to each main effect? It would look something like
\[ \mu_{AB} \ne \underbrace{\frac{(\mu_{Ab} - \mu_{ab}) + (\mu_{AB} - \mu_{aB})}{2}}_{\text{effect for }A} + \underbrace{\frac{(\mu_{aB} - \mu_{ab}) + (\mu_{Ab} - \mu_{AB})}{2}}_{\text{effect for }B} + \mu_{ab} \]
where \(\mu_\cdot\) represents the mean at that level of the factors.
In a 2-level factorial design for \(p\) factors there are \(2^p\) possible outcomes we could model. The terms of such models that depend only on a single factor are called main effects. When terms in such models rely on two factors together they are called two-factor interactions. Terms with higher numbers of dependencies have higher ordered names, and in general there will be \(\binom{p}{k}\) of the \(k\)-factor interactions.
Montgomery (2008), Wu and Hamada (2011) and Box, Hunter, and Hunter (2005) all describe the principle of sparsity of effects. The idea is that most systems don’t have lots of higher order interaction terms. In industrial settings this is often the case due to physical principles at play. In marketing, some caution is required.
Imagine a \(2^8\) experiment on an ad where each factor represented some element of the visual composition. This produces 256 different ads. Do we really believe that a human being will have 256 distinct reactions to the different ads? Or are they likely to have a handful of reactions based on some important themes and not really notice the rest? If the latter is true, the sparsity of effects can be said to hold.
One important exception is when a test involves items that are reviewed by customers in a linear order. For example, consider a page with two places for a coupon. At the top, a 20% off coupon is tested, and at the bottom the same coupon is tested. Let’s say either coupon on their own produces the same increase in conversion rate \(\delta\). Do we expect that putting the coupon both places would yield a \(2\delta\) increase? No. In fact we’d expect the combination to produce the same improvement as either coupon on their own, \(\delta\). The linear order of experiences results in negative interactions.
If there were three such factors that a customer experiences in linear order then the interpretation could cause there to be two of the 2-factor interactions being negative, and then a 3-factor interaction being positive to account for the case when all three such factors are active at once. This continues as the number of linearly experienced factors increases.
Analysis Benefits of Sparsity
The traditional approach in industrial settings is to be a little loose with our error control and test for the presence of each order of effects before looking at the models that contain each individual effect. Another principle described in Montgomery (2008), Wu and Hamada (2011), and Box, Hunter, and Hunter (2005) is effect heredity. The idea here is that, if we don’t see a lower order effect then the factor probably doesn’t plan a part in higher order effects. Again, this may be appropriate for certain marketing experiments.
In these cases we may take a step-wise approach to the order of the model, first calculating main effects, then if the presence of interactions is present test the next level of the model. This does not strictly preserve our error rate guarantees, but the practical impact is somewhat negligible. Holm and Hochberg adjustments are available for those wanting more fine control over error probabilities, and a closure method could provide efficiency (see Tamhane (2012) pp. 128-132).
Another option is to simply schedule a confirmation study where we test the variants that seemed effective again. In this setup the probability of repeating a type 1 error causes the overall error rate to be diminished, and we gain some comfort that the results are reproducible later in time. A major assumption in marketing testing is that customers at the time of the test are generalizable to customers in the future, and that’s quite a perilous assumption.
Combinatorial Design
In software it os often easy to produce every possible combination of things, but many places the manual effort to create each combination can become overwhelming. When effect sparsity is reasonable we can assume that higher order effects are zero and carefully choose combinations of runs to be able to estimate the important combinations of effects.
Consider a \(2^3\) factorial design. Let’s use \(\{-,+\}\) to indicate the levels of each factor. Below is the full factorial design. Highlighted and bolded are the 4 runs required to estimate each of the main effects.
Full Factorial | ||
---|---|---|
\(X_1\) | \(X_2\) | \(X_3\) |
- | - | - |
+ | - | - |
- | + | - |
+ | + | - |
- | - | + |
+ | - | + |
- | + | + |
+ | + | + |
There’s lots of material on fractional factorial designs, but these all start from a restriction that the number of levels is the same for each factor. When this is not the case we have things called orthogonal arrays. These have the estimation capacity of many of the fractional factorial designs and have massive savings in the number of combinations of levels you need to test, but support different numbers of factors. It’s hard to undersell how hard of a problem finding orthogonal arrays is, but we can frequently use numerical optimization to generate nearly orthogonal arrays that have good properties for uses when an orthogonal array cannot be found. See Hedayat, Sloane, and Stufken (2012) for more on orthogonal arrays. Montgomery (2008) extensively covers fractional factorial designs.
Conclusion
Factorial and orthogonal array designs have a place in marketing experimentation, despite their unpopularity amongst some. Here I’ve sought to give a flavor for why they are efficient, before jumping into some real calculations of efficiency. In certain settings (and importantly not in certain settings as well) we can leverage such designs to quickly explore and understand a space of many parallel questions, answering them together while not greatly increasing our error rates.