Quantifying the CUPED Improvement with Binomial Responses

CUPED is a method for reducing the variance of an estimator by leveraging another correlated variable. Many tout the importance and impact of such a method on speeding up A/B tests, but CUPED is not a panacea, and here I show some cases where it provides little value.

Published

April 6, 2023

CUPED (Deng, Yuan, and Salama-Manteau 2021) is a method to reduce the variance of a target random variable by involving a correlated random variable with known expectation. The magnitude of the savings is related to the magnitude of the correlation. After a short review of how CUPED works, I give a relationship between a predictive model with a certain precision and recall and the reduction in variance of the outcomes when CUPED is employed. Next, I assume the true population probability of conversion is described by a beta distribution, and show how sometimes even knowledge of this secret number can fail to significantly reduce the variance via CUPED in a somewhat degenerate case.

How CUPED Works

The way CUPED works is simple. Consider a random variable \(Y\) with mean \(\mu_Y\) and variance \(\sigma_Y^2\) that is our target. We want to use \(Y\) to estimate \(\mu_Y\), but \(\sigma_Y^2\) is large.

Let’s say some random variable \(X\) exists with mean \(\mu_X\) and variance \(\sigma_X^2\) but where it varies with \(Y\) according to

\[ \mathrm{Cov}(X,Y) = \tau~. \]

We may use the new random variable \(Y'\) to estimate \(\mu_Y\) where

\[ Y' = Y - \theta X + \theta \mu_X \]

for some value of \(\theta\) and note that \(\mathbb{E}[Y'] = \mathbb{E}[Y] = \mu_Y\). The variance pf \(Y'\) is then

\[ \mathrm{Var}(Y') = \sigma_Y^2 + \theta^2 \sigma_X^2 - 2 \theta \tau \]

for our choice of \(\theta\). We can choose an optimal \(\theta\) to minimize the variance of \(Y'\) by taking

\[ \theta = \frac{\mathrm{Cov}(X,Y)}{\mathrm{Var}(X)} = \frac{\tau}{\sigma_X^2}~. \]

Using this, we find variance of \(Y'\) to be better than \(Y\) when

\[ %\theta^2 \sigma_X^2 - 2 \theta \tau < 0 %\quad\text{or equivalently when}\quad \frac{\tau^2}{\sigma_X^2} > 0~. \]

Since \(\sigma_X^2 > 0\) by definition, when \(\tau^2 > 0\) and we find an improvement!

Wow! What a panacea to all our problems! But there’s a devil in the details here. As both estimators \(Y\) and \(Y'\) are unbiased we find the relative efficiency of \(Y\) to \(Y'\) (using optimal \(\theta\)) to be

\[ \frac{\mathrm{Var}(Y')}{\mathrm{Var}(Y)} = \frac{\sigma_Y^2 + \theta^2 \sigma_X^2 - 2 \theta \tau}{\sigma_Y^2} = 1 - \frac{\tau^2}{\sigma_X^2 \sigma_Y^2} = 1 - \rho^2 \]

where \(\rho\) is Pearson’s correlation between \(Y\) and \(X\).

This gives us what proportion of variance our augmented estimator \(Y'\) is in terms of the variance of the original estimator \(Y\). The problem is that \(\rho\) might not be very big at all, so let’s characterize this in terms of a predictor’s precision and recall.

Efficiency In Terms Of Precision and Recall

Let \(Y\) be the target Bernoulli random variable with success rate \(p\) and \(\mathrm{Var}(Y) = p(1-p)\), and let \(X\) be something that predicts \(Y\) with precision and recall:

  • Precision \(a = P(Y=1|X=1)\), and
  • Recall \(b = P(X=1|Y=1)\).

We know that

  • \(P(X=1,Y=1) = bp = P(Y=1|X=1)P(X=1) = a P(X=1)\), so
  • \(P(X=1) = \frac{b}{a}p\) and
  • \(P(X=0) = 1-\frac{b}{a}p~.\)

To calculate correlation we need some moments of \(X\) and the covariance of \(X\) and \(Y\). Note that \[\begin{align*} \mathbb{E}[X] &= P(X=1) = \frac{b}{a}p = \mathbb{E}[X^2]~\quad\text{and}\\ \mathrm{Var}(X) &= \frac{b}{a}p\left(1-\frac{b}{a}p\right)~. \end{align*}\]

For covariance, \[\begin{align*} \mathbb{E}[XY] &= P(X=1,Y=1) = bp~,\quad\text{so}\\ \mathrm{Cov}(X,Y) &= \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = bp - \frac{b}{a} p^2 \\ &= bp \left(1 - \frac{1}{a}p\right)~. \end{align*}\]

The Pearson correlation is then

\[ \rho= \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X)}\sqrt{\mathrm{Var}(Y)}} = (a-p)\sqrt{\frac{b}{(1-p)(a-bp)}} \]

Let’s look at what the savings looks like at four different values for \(p\). Note that the axis of our plots will be from \(p\) to 1 since we shouldn’t be able to do worse than having precision \(p\) and recall \(p\). To see why, note that random guessing can achieve a precision and recall of \(p\). Assume \(X\) and \(Y\) are independent, then

\[ \text{Precision} = \frac{P(X=1,Y=1)}{P(X=1)} = \frac{p^2}{p} = p = \frac{P(X=1,Y=1)}{P(Y=1)} = \text{Recall}~. \]

Below are four contour plots of the variance savings across precision and recall, the top-left in the case that \(p=0.5\) and the top-right when \(p=0.1\). On the bottom left and right are the plots for very small values of \(p\).

One thing that is surprising about this (to me at least) is that precision is more important than recall. Anyway, here is a calculator to play around with some values.

Calculate Variance Ratio
Precision
Recall
Conversion rate
Relative Variance

Precision and recall, combined with the overall success rate, can give us an idea of how much we can save with a CUPED estimator. Next, let’s examine a different model of what’s going on with CUPED, one based upon having true knowledge about how likely each person is to make a purchase.

A Beta-Binomial Model of Conversions

Consider a population of \(n\) different people each with a random probability of converting \(p_i\) for \(i=1,\ldots,n\) where:

\[ p_i \overset{\mathrm{iid}}{\sim}\mathrm{Beta}(\alpha,\beta),\quad \mathbb{E}[p_i^k] = \prod_{r=0}^{k-1} \frac{\alpha + r}{\alpha + \beta+1},\quad \mathrm{Var}(p_i) = \frac{\alpha \beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}~. \]

We observe subject \(i\)’s conversion \(X_i\) with the probability:

\[ X_i | p_i \sim \mathrm{Bin}(1,p_i),\quad \mathbb{E}[X_i|p_i] = p_i,\quad \mathrm{Var}(X_i|p_i)=p_i (1-p_i)~ \text{for}~i=1,\ldots,n. \]

This model is flexible in that, for \(\alpha > 1\) and \(\beta > 1\) we get a smooth unimodal distribution where a heterogeneous population of somewhat similar probabilities of conversion exist. This case feels more reasonable to me as a model for an online expeirment.

If instead \(\alpha < 1\) and \(\beta < 1\) then we get a two-peaked distribution of our \(p_i\)’s where almost everyone is near zero or near one. This situation is interesting in that it may inform what happens if we have a good predictive model (high sensitivity and specificity).

We wish to use the CUPED estimate for \(\hat{p}\) where

\[ \hat{p} = \frac{1}{n} \sum_{i=1}^n X_i - \theta \frac{1}{n} \sum_{i=1}^n p_i + \theta \mathbb{E}[p_1]~. \]

Variance of CUPED Estimator

The law of total variation gives us

\[ \mathrm{Var}(\hat{p}) = \mathbb{E}[\mathrm{Var}(\hat{p}|p_i,~i=1,\ldots,n)] + \mathrm{Var}(\mathbb{E}[\hat{p}|p_i,~i=1,\ldots,n]) \]

so we can calculate the variance of our estimator piecewise. The first term is \[\begin{align*} \mathbb{E}\bigl[\mathrm{Var}(\hat{p}|p_i,i=1,\ldots,n)\bigr] &= \frac{1}{n^2} \mathbb{E}\left[ \sum_{i=1}^n p_i (1 - p_i)\right] = \frac{1}{n} \left(\mathbb{E}[p_1] - \mathbb{E}[p_1^2]\right) \\ &= \frac{1}{n} \frac{\alpha \beta}{(\alpha+\beta+1)(\alpha + \beta)}~. \end{align*}\] The second term is \[\begin{align*} \mathrm{Var}\bigl(\mathbb{E}[\hat{p}|p_i, i=1\ldots,n]\bigr) &= \mathrm{Var}\left( \frac{1-\theta}{n}\sum_{i=1}^n p_i + \theta \mathbb{E}[\hat{p}] \right) = \left(\frac{1-\theta}{n}\right)^2 \sum_{i=1}^n \mathrm{Var}(p_i) \\ &= \frac{(1-\theta)^2}{n} \frac{\alpha \beta}{(\alpha+\beta+1)(\alpha + \beta)}~. \end{align*}\] Thus, the total variance of the CUPED estimator is

\[ \mathrm{Var}(\hat{p}) = \left(1 + (1-\theta)^2\right) \frac{1}{n} \frac{\alpha \beta}{(\alpha+\beta+1)(\alpha + \beta)} \]

which is obviously minimized at \(\theta=1\) where

\[ \mathrm{Var}(\hat{p}) = \frac{1}{n} \frac{\alpha \beta}{(\alpha+\beta+1)(\alpha + \beta)}~. \]

Variance of Naive Estimator

Let’s now look at the ratio of the CUPED estimator variance to the original estimator \(\bar{X}\). Note that this is the inverse from the ratio we use in the last section, so larger is better here.

If we used the variance of the naive estimator \(\bar{X}\) in the numerator then

\[ \mathbb{E}\bigl[\mathrm{Var}(\bar{X}|p_i,i=1,\ldots,n)\bigr] = \frac{1}{n^2} \mathbb{E}\left[ \sum_{i=1}^n p_i (1 - p_i)\right] = \frac{1}{n} \frac{\alpha \beta}{(\alpha+\beta+1)(\alpha + \beta)}~. \]

Recall from above, the denominator variance term is

\[ \mathrm{Var}\bigl(\mathbb{E}[\bar{X}|p_i, i=1\ldots,n]\bigr) = \mathrm{Var}\left( \frac{1}{n}\sum_{i=1}^n p_i \right) = \frac{1}{n}\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}~. \]

Thus the total variance of the naive estimator is

\[ \mathrm{Var}(\bar{X}) = \left(1 + \frac{1}{\alpha + \beta}\right) \frac{1}{n}\frac{\alpha \beta}{(\alpha + \beta) (\alpha + \beta + 1)} \]

Ratio of Variances

The ratio of the naive estimator to the optimal CUPED estimator is thus:

\[ \frac{\mathrm{Var}(\bar{X})}{\mathrm{Var}(\hat{p})} = 1+\frac{1}{\alpha + \beta}~. \]

Note that this is inverse of the previous variance ratio from the precision and recall section, where a larger value means a smaller CUPED estimator variance.

If our prior is uniform then \(\frac{\mathrm{Var}(\bar{X})}{\mathrm{Var}(\hat{p})} = \frac{3}{2}\) which shows a substantial improvement of the CUPED estimator. Note that as \(\alpha\) and \(\beta\) get larger and larger the prior becomes more focused around smaller sets of values and the improvement in variance falls toward 1. For concreteness lets look at an example. Consider the following distribution of \(p_i\)’s with mean \(1/6\) having parameters \(\alpha = 10\) and \(\beta = 50\) (the mean is the red dashed line and plus/minus 2 standard deviations are given by the light pink dashed lines).

Having the value of the \(p_i\)’s is the absolute best anyone could do to describe the customers in this model, but the overall uncertainty due to the binarization of the response is too large for CUPED to be able to make a sizeable contribution when the distribution of customer conversion probabilities is too tight.

If we turn to a model where \(\alpha < 1\) and \(\beta < 1\) such that our prior spikes probability around 0 and 1 and we let \(\alpha \downarrow 0\) and \(\beta \downarrow 0\) we find obscene CUPED improvements. Such a situation indicates that our model of \(p_i\)’s only takes values very near to zero and one, so this is when you have a predictor with very high sensitivity and specificity.

Discussion

Lots of people like to use CUPED, and when we can find variables with good correlation (and within your filtration or whatever) it can produce some good savings in variance, but I don’t see a lot of people talking about how much savings there is and places where it doesn’t work great. Here, we saw two ways of looking at the problem, one giving us a way to estimate our savings from using CUPED, and the other suggesting certain situations aren’t helped much by CUPED.

Anyway, here’s some links about CUPED or whatever.

References

Deng, Alex, Lo-Hua Yuan, and Alexandre Salama-Manteau. 2021. “Variance Reduction for Experiments with One-Sided Triggering Using CUPED.” https://www.researchgate.net/publication/357365361_Variance_Reduction_for_Experiments_with_One-Sided_Triggering_using_CUPED.