Customers | Sales | Rate | SE | Z score | p-Value | |
---|---|---|---|---|---|---|
Control | 2,149 | 915 | 42.6% | 0.01067 | 2.43765 | 0.01478 |
Variant | 38,694 | 15,442 | 39.9% | 0.00249 |
A/B Testing: Oops I got A in my B!
Sometimes failed randomization can put people into both a control and challenger treatments, and just counting them for both or ignoring result in different conclusions…
I recently saw an experiment with results that looked totally wrong. Digging into it, I found that some users were part of both the control and challenger group. Problematically:
- If you just analyzed the data as is it looked like control was significantly better.
- If you threw out the data from customers assigned to both groups then the challenger was significantly better.
Which is right? This situation made plain the need to account for the deviation from simple random sampling, a problem people might often overlook if their experiments have the same problem but less severely. Let’s jump into how the best way to address this problem.
A Cursed Experiment
The experiment in question was one where customers who looked like they might abandon their cart were given a coupon for a discount on their order. The owners of the experiment had previous results suggesting that it would produce a good lift in sales so they kept 5% of the customers in a control group (that got no discount offer) and let the other 95% be in the challenger group (that got the discount offer). Since the offer serves to sweeten the deal we don’t have to worry about a negative impact - it’s just a question of how much bang are we going to get for our buck.
And yet… when looking at the results of the experiment something very wrong emerges.
Why would a discount cause fewer customers to buy than if we offered no discount? Obviously, something is rotten in the state of Denmark. Looking into the data, I found that several customers had revisited the site multiple times with long stretches in between their visits. Each time they were re-randomized into control or the variant. Not thinking, I simple decided to drop the users with multiple conflicting treatments and redo my analysis. This looks more right but something new is amiss - the conversion rates of the removed users was much higher than those that remained, and that indicated to me that our overall estimate might be biased to be low.
Customers | Sales | Rate | SE | Z score | p-Value | |
---|---|---|---|---|---|---|
Control | 1,261 | 409 | 32.4% | 0.01318 | -5.26985 | 0.00000 |
Variant | 37,806 | 14,936 | 39.5% | 0.00251 |
The lower conversion rates are perhaps unsurprising if we imagine that people revisiting the site and hemming and hawing over their order are more likely to order than those that visit a single time. Indeed, you could see it directly in the data.
Removing people that got placed into multiple arms meant that users with multiple visits were thinned out in our sample, so our sample was no longer representative. Thankfully the application of multiple treatments is a process that’s well understood - each visit the user had a 5% chance to enter in control and a 95% chance to be shown the discount, so conditional on the number of visits this is simply a binomial distribution. We then have to decide what to count as control, what to count as variant, and what to throw out.
- Strategy 1: Any customer that received any discount offer is considered part of the challenger group. The control group is everyone that never received the discount offer. We throw out no data.
- Strategy 2: Only customers that received no offer are left in the control group. Only customers that received the discount offer each time they visited are left in the challenger group. We throw out everyone who received a mix of control and challenger experiences.
For both strategies we still must account for the lack of simple random sampling. We can get unbiased estimators of the rate and the variance using something called a Horvitz-Thompson estimator.
The Horvitz-Thompson Estimator
Consider a stream of \(N\) users arriving at a web site for an A/B test of control versus a variant. If each user \(i\) has a probability of being assigned to the variant of \(\pi_i\), then we’d expect for each user in the variant there are around \(1/\pi_i\) users assigned to the control group. Let \(1_i\) be a random variable where
\[ 1_i = \left\{ \begin{array}{ll} 1 & \text{if}~i~\text{got the variant, and} \\ 0 & \text{if}~i~\text{got control} \end{array} \right. \]
with \(P(1_i = 1) = \pi_i\) and \(P(1_i = 0) = 1-\pi_i\).
If there’s some value \(y_i\) we are concerned with for each user \(i\) then we might estimate the mean of that variable for the population of \(N\) users with \(\bar{Y}\) where
\[ \bar{Y} = \frac{1}{N} \sum_{i=1}^N \frac{y_i 1_i}{\pi_i}~. \]
It’s easy to see that
\[ \mathbf{E}[\bar{Y}] = \frac{1}{N} \sum_{i=1}^N \frac{y_i \pi_i}{\pi_i} = \frac{\sum_{i=1}^N y_i}{N} = \bar{y}~. \]
Further, we know we can calculate this value since the only \(y_i\) we need to use are the ones where \(1_i = 1\). The variance of our estimator is
\[ \mathrm{Var}\left(\bar{Y}\right) = \mathbb{E}\left[\bar{Y}^2\right] - \bar{y}^2 \]
where
\[ \mathbb{E}\left[\bar{Y}^2\right] = \frac{1}{N^2}\mathbb{E}\left[\left(\sum_{i=1}^N \frac{y_i 1_i}{\pi_i}\right)^2\right] = \frac{1}{N^2}\mathbb{E}\left[\sum_{i=1}^N \frac{y_i^2 1_i}{\pi_i^2} + \sum_{i=1}^N \sum_{j \ne i} \frac{y_i y_j 1_i 1_j}{\pi_i \pi_j}\right] = \frac{1}{N^2}\left(\sum_{i=1}^N \frac{y_i^2}{\pi_i} + \sum_{i=1}^N \sum_{j \ne i} y_i y_j\right) \]
and
\[ \bar{y}^2 = \frac{1}{N^2} \left(\sum_{i=1}^N y_i^2 + \sum_{i=1}^N \sum_{j \ne i} y_i y_j\right)~. \]
Thus the variance is
\[ \mathrm{Var}(\bar{Y}) = \mathbb{E}\left[\bar{Y}^2\right] - \bar{y}^2 = \frac{1}{N^2}\left(\sum_{i=1}^N \frac{y_i^2}{\pi_i} - \sum_{i=1}^N y_i^2\right) = \frac{1}{N^2}~\sum_{i=1}^N \frac{(1 - \pi_i)}{\pi_i} ~y_i^2 ~. \]
Unfortunately, we don’t have \(y_i\) for \(i\) not in the sample, so an estimable version of the variance must be constructed. Note that
\[ \mathbb{E}\left[ \frac{1}{N^2}~\sum_{i=1}^N \frac{(1 - \pi_i)}{\pi_i} ~y_i^2~\frac{1_i}{\pi_i} \right] = \mathrm{Var}(\bar{Y}) \]
so an unbiased estimator for the variance \(\widehat{\mathrm{Var}}(\bar{Y})\) is
\[ \widehat{\mathrm{Var}}(\bar{Y}) = \frac{1}{N^2}~\sum_{i=1}^N \frac{(1 - \pi_i)}{\pi_i^2} ~y_i^2~1_i~. \]
Strange Behaviors of Improbable Events
One strange behavior of the HT estimator is that it’s not necessarily true that adding more data decreases variance, and in fact we can increase our variance sizeably by adding observations of less-likely events. Recall the true variance is
\[ \mathrm{Var}(\bar{Y}) = \frac{1}{N^2}~\sum_{i=1}^N \frac{(1 - \pi_i)}{\pi_i} ~y_i^2 ~. \]
It is instructive to look at the negative derivative to see how the variance is changing as a probability of inclusion that is small becomes even smaller:
\[ -\frac{\partial \mathrm{Var}(\bar{Y})}{\partial \pi_i} = \frac{1}{\pi_i^2} \left(\frac{y_i^2}{N^2}\right)~. \]
If control and variant had been assigned at 50%/50% split then a user with 4 visits would have a \(1/16\) chance of getting control every time, and an infinitessimal decrease in in inclusion probability \(pi_i\) would see the variance rise by only about 0.00000016. In the case of a 5%/95% split things are very different. These users had a \(1/160000\) probability of getting control every time, and an infinitessimal decrease in in inclusion probability \(pi_i\) would increase the variance around 16 times.
If we could add another person to the experiment and we knew they ended up being assigned control 4 consecutive times then the change in variance wold be
\[ 0.999975 ~(\text{Old Variance}) + 4.00448 ~ y_{N+1}^2~. \]
Our variance would increase due to adding such a user.
Another pathology related to the small events is that we can have
\[ \sum{i \in \text{Sample}} \frac{1}{\pi_i} > N \]
so that our estimate of the rate can be larger than 1. This is disturbing in practice, and Thompson (2012) notes that we can use \(\sum_{i=1}^N \frac{1}{\pi_i}\) in place of \(N\) as the sum estimates the total \(N\). The result is that our estimate of the variance gains a bias. Thompson notes this bias is small, but when combined with the fact that the central limit theorem doesn’t kick in very well in this kind of pathological case, we may find our error probabilities to be off.
Applying the Estimator to the Two Strategies
Using strategy 1 (where we let a customer be uniquely assigned to control if they only received control assignments, and letting those that at any point received the discount code count uniquely to the challenger group), the application of the HT estimator results in the following:
n | x | Rate | SE | Z score | p-Value | |
---|---|---|---|---|---|---|
Control | 1,261 | 409 | 32.1% | 0.16548 | -0.4671 | 0.6405 |
Challenger | 38,694 | 15,442 | 39.9% | 0.00062 |
Using strategy 2 (just as above but we throw out the users that got both a control and challenger assignment over multiple visits), we end up with the following analysis:
n | x | Rate | SE | Z score | p-Value | |
---|---|---|---|---|---|---|
Control | 1,261 | 409 | 32.1% | 0.16548 | -0.4451 | 0.6563 |
Challenger | 37,806 | 14,936 | 39.5% | 0.00085 |
What’s striking is that they’re nearly identical with strategy 1 having a slightly smaller standard error. This would be my preferred approach. But both have massive error terms that make even a 10-percentage-point difference fail to be statistically significant. The bulk of this error is coming from people with many visits, so a possible next step for our analysis is to condition our analysis on people with a small number of visits.
Conditioning on Number of Visits
I’ll apply the H-T estimator to the data but condition that the number of visits is below some threshold. This may be reasonable as the number of customers with many visits is very small, so our inference on this sub-pupulation may be close enough to the full population that we can accept it. If we decided to limit our analysis to those having 4 or fewer visits then we’d have:
n | x | Rate | SE | Z score | p-Value | |
---|---|---|---|---|---|---|
Control | 2,081 | 880 | 32.1% | 0.16548 | -0.4657 | 0.6414 |
Challenger | 38,293 | 15,273 | 39.8% | 0.00063 |
Conditioning on there being 3 or fewer decreases the standard error (which is still huge) and gives us a wildly different rate estimate for control.
n | x | Rate | SE | Z score | p-Value | |
---|---|---|---|---|---|---|
Control | 1,980 | 819 | 24.9% | 0.07566 | -1.9777 | 0.0480 |
Challenger | 37,680 | 15,052 | 39.9% | 0.00064 |
Looking at customers who have 2 or 1 visit we see it narrow further:
n | x | Rate | SE | Z score | p-Value | |
---|---|---|---|---|---|---|
Control | 1,752 | 704 | 31.2% | 0.03568 | -2.4464 | 0.0144 |
Challenger | 35,831 | 14,327 | 39.9% | 0.00067 |
Searching for a particular cutoff is problematic for our inferential error rates. But had we known about this problem ahead of time we could have chosen a cutoff for our conditional analysis based off the distribution of visit counts and that could have been a good way to address this problem.
Which Was Worse: Sloppy Assignment or Imblance?
To what extend was this a problem of unclean randomizations versus imbalanced design? You can think of the unclean randomization as worsening the imbalance since it further reduces the size of the clean control group. First, let’s look at the false alarm rate when we use a \(p\)-value of \(0.05\) as a cutoff. Here I simulated 100,000 experiments with 10,000 customers and a base conversion rate of 35%. We see that the H-T operator is a little liberal - for controlling the false alarm rate it seems clean randomization is more important than balance in our example:
Dirty | Clean | |
---|---|---|
Balanced | 6.9% | 5.0% |
Imbalanced | 11.2% | 5.0% |
Shifting the treatment conversion rate to 39% and re-running the simulation we see that the impact to power is almost non-existant when we have a balanced design. For this example it seems that balance is more important to power than is clean randomization:
Dirty | Clean | |
---|---|---|
Balanced | 98.0% | 98.5% |
Imbalanced | 20.8% | 44.8% |
Like everything in statistics, we don’t find a clear answer, it just seems to depend on a number of things. What is clear is that there should be strong justification for using strongly imbalanced sample sizes as it makes our procedures less robust to things like the particular failure of randomization we mentioned above.
Conclusion
- When a user can receive multiple arm assignments you can’t just drop those that got different assignments and pretend your sampling is “simple random sampling”.
- If this is due to a session-based as opposed to identifier-based randomization then you know the mechanism of multiple variant assignment is like a binomial distribution where the size is the number of visits.
- Dropping people with conflicting assignments would remove some of your most engaged users and bias some of your estimates downward.
- If this is due to a session-based as opposed to identifier-based randomization then you know the mechanism of multiple variant assignment is like a binomial distribution where the size is the number of visits.
- You need to use something like the Horvitz-Thompson estimator to account for the likelihood of getting particular assignments.
- A problem with Horvitz-Thompson estimators is that when probabilities of inclusions become very small they can blow up the variance, so:
- Avoid overly imbalanced sampling of arms, and
- Condition your analysis on customers below a cutoff of some number of visits (decided based on the distribution of visits, not hunting for a \(p\)-value) when you know you’re going to have this problem.
- The best solution is fix your randomization to avoid people getting multiple assignments.