Binomial Proportion Confidence Intervals are Ugly and Messy

Estimation
Confidence Intervals
Binomial Data

Binomial confidence intervals will let you down and hurt you. Here, I look at how, when, and why, and look at alternative formulations of the intervals that have better behavior.

Published

April 21, 2016

Confidence intervals on proportions get a lot of use in industrial settings. A \(100(1-\alpha)\%\) confidence interval is a function of data that covers the true value \(100 (1-\alpha) \%\) of the time. This can be difficult or impossible to do correctly, and when a consumer of analysis sees a confidence interval there’s no real tell when things went wrong.

What do I mean by “things went wrong?” I mean that the interval doesn’t cover the true value \(100(1-\alpha)\%\) of the time. When the interval covers the true value too often then we lose a little bit of power in being able to detect a difference from nearby values. When the interval covers the true value less often than expected then we’re more likely to commit a type 1 error and believe a difference exists where one does not. Note that in analyzing yield and defect rates, believing things are different that are not, and believing things are the same when they are different, can both be expensive missteps.

It turns out that the method you find in most textbooks for generating proportion confidence intervals tends to fail very often when the proportion is near 0 or 1. This is often the case in industrial settings, where you’d expect defect rates to be small or yields to be high. Here, I’ll describe some failures I’ve seen, some alternative intervals in the literature, and discuss the true coverage of such intervals.

A Common Occurence

Let’s say an engineer runs binary data \(X_i\), \(i=1,\ldots,n\) through Minitab’s interval plotting function. What comes out? Minitab’s interval plots are all based on the normal theory intervals, separately estimating the mean and standard deviation of the data, and constructing an interval using the critical values of the \(t\) distribution. If there are \(k\) “successes” out of the \(n\) “trials”, then the estimate of the mean is \(\hat{p} = \frac{k}{n}\). Minitab estimates the unbiased sample variance

\[ S^2 = \frac{\sum_{i=1}^n (X_i-\bar{X})^2}{n-1} = \frac{(n-k)\hat{p}^2+k(1-\hat{p})^2}{n-1} \]

and then constructs the interval

\[ \hat{p} \pm t_{\alpha/2,n-1} \sqrt{\frac{(n-k)\hat{p}^2+k(1-\hat{p})^2}{n(n-1)}}. \]

This isn’t the interval we’d get from the central limit theorem. \(\sqrt{n}(\hat{p} - p)\) converges to a normal distribution with mean zero and variance \(p(1-p)\). This is where the traditional Wald interval comes from, where the interval is

\[ \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}. \]

This is what you find in many textbooks. And while it’s more appropriate than the method Minitab uses by default, it’s still often not appropriate.

Things Go Off The Rails

Even though the Wald interval is in some sense “more” appropriate, both it and the normal theory interval almost never attain the coverage probability that we want. We may construct “\(95\%\)” confidence intervals with these procedures and only have it cover the true proportion \(60\%\) of the time or less.

It turns out that this approximation (though it appears in so many textbooks) is just awful. Many papers have been written on this problem, e.g. [@brown2001], [@brown2002], and [@brown2003]. The problems are most extreme when \(p\) is near 0 or 1 and is least bad near \(p=0.5\). Thus, for most applications involving defects or yield, these approximations are likely inappropriate.

There are many other confidence intervals for binomial proportions proposed in the literature, and they all have somewhat different coverage probabilities. I want to discuss a few of them, but first let’s discuss how to calculate the coverage probability in R.

Calculating Coverage Probabilities in R

The easiest way that I know to calculate coverage probabilities in R is to exploit the fact that there are only \(n+1\) possible values for \(k\). At each of these values we should construct the same interval. If not, then re-running the analysis on the same data would yield different results, and this makes everyone uncomfortable.

First, I’ll define some constants, a helper operator, and then the functions to construct the confidence intervals.

##
## Constants
##
n <- 35
alpha <- 0.05

##
## Utility function
##
`%between%` <- function(a,b){ a>=b[,1] & a<=b[,2] }

##
## Confidence Intervals
##
normal.ci <- function(k,n){
    p.hat <- k/n
    s <- sqrt( 
        (
            (n-k)*p.hat^2 + k*(1-p.hat)^2
        ) / (
            n*(n-1)
        )
    )
    cbind(
        p.hat + qt(alpha/2,n-1)*s,
        p.hat + qt(1-alpha/2,n-1)*s
    )
}

Wald.ci <- function(k,n){
    p.hat <- k/n
    cbind(
        p.hat + qnorm(alpha/2)*sqrt(p.hat*(1-p.hat)/n),
        p.hat + qnorm(1-alpha/2)*sqrt(p.hat*(1-p.hat)/n)
    )
}

Now that we have those, finding the coverage probability is as simple as using

get.coverage.probs <- function(f.ci){
    p.seq <- seq(from=0,to=1,length.out=1000)
    coverage.probs <- vapply(
        p.seq,
        function(p){
            sum(
                dbinom(0:n,n,p)*(p %between% f.ci(0:n,n))
            )
        },
        0.
    )
    return(coverage.probs)
}

on the confidence interval functions defined above.

What is happening is that for each \(p\) we’re looking at all of the possible \(k\) to observe. By adding up the probability of seeing a specific \(k\) and multiplying that by 0 if the true proportion is in the interval for \((k,n)\) or 1 if it is in the interval we arrive at the proportion of time the interval covers \(p\).

So with this we can start quantifying how bad things do in relationship to one another. For the two confidence intervals listed so far, here’s how bad:

Alternate Confidence Intervals

The series of papers by Brown et al. mentioned above reviewed several alternative confidence intervals. One of the most promising was the Agresti-Coull interval. For this interval, let \[\begin{align*} \tilde{k} &= k + \frac{Z_{\alpha/2}^2}{2},\\ \tilde{n} &= n + Z_{\alpha/2}^2, \end{align*}\] where \(Z_{\alpha/2}\) is the critical value of the normal distribution having \(\alpha/2\) probability to the left. Then the confidence interval is identical to the Wald interval except using \(\tilde{k}\) for \(k\) and \(\tilde{n}\) for \(n\). In R this is

AgrestiCoull.ci  <- function(k,n){
    kk <- k+qnorm(alpha/2)^2/2
    nn <- n+qnorm(alpha/2)^2
    p.hat <- kk/nn
    cbind(
        p.hat + qnorm(alpha/2)*sqrt(p.hat*(1-p.hat)/nn),
        p.hat + qnorm(1-alpha/2)*sqrt(p.hat*(1-p.hat)/nn)
    )
}

Coverage probabilities of Wald and Agresti-Coull intervals. This interval is nice in that it never drops below the nominal rate of coverage, but it’s also very conservative, which is a problem too. The intervals constructed using Agresti-Coull will be too wide.

Another option is less pretty, relying on finding the set of \(p\)’s under which the observed data \(k\) has no less than \(\alpha\) probability of observing a specific \(k\). Some people (me included) call these exact binomial confidence intervals, but they’re really not. Brown et al. refers to them as Clopper-Pearson intervals. Clopper-Pearson intervals are nice because they use the real likelihood whereas the other methods mentioned so far use a normal distribution. But this focus on the likelihood makes evaluating the interval difficult. Fortunately, we can use R’s code for finding quantiles of the beta distribution to find what we need.

ClopperPearson.ci <- function(k,n){
    cbind(
        qbeta(alpha/2,k+1,n-k),
        qbeta(1-alpha/2,k+1,n-k)
    )
}

A closely related distribution comes from the Bayesian analysis where the proportion \(p\) is treated as a uniform random variable in \([0,1]\). It turns out that constructing a credible interval (an interval that covers the region of highest probability with equal probability left out of each tail) yields a very similar function called the Jefferys interval.

JeffreysInterval.ci <- function(k,n){
    cbind(
        qbeta(alpha/2,k+0.5,n-k+0.5),
        qbeta(1-alpha/2,k+0.5,n-k+0.5)
    )
}

The performance of these two is interesting. Clopper-Pearson is very conservative as was Agresti-Coull. The Jeffreys interval is less conservative and still stays “near” the nominal \(\alpha\). Coverage probabilities of Wald and Agresti-Coull intervals.

The Clopper-Pearson and Aggresti-Coull intervals look similar, but require vastly different amounts of computation. Clopper-Pearson vs Agresti-Coull intervals.

Parsimony, When We Need It And When We Do Not

At the end of the day, what’s the best recommendation for the engineer? We want a parsimonious outcome — one which is not too complicated to convey and implement and is also very accurate. Brown et al. come up with different recommendations for different situations. I like the Agresti-Coull intervals overall for calculations on yield. They’re simple and their conservative nature feels safer than some of the other intervals.

The conservative nature of the Agresti-Coull intervals may be problematic if you’re more concerned about power than type 1 error, such that having conservative intervals lowers your overall power beyond what is acceptable. In these cases (e.g. defect monitoring) it may make more sense to use other intervals.

Today we have modern computers that do the hard work for us, so who cares how complex a method we use if it’s just a click of a button? The R library binGroup includes many types of proportion confidence intervals, but also includes a “second order corrected” one that produces confidence intervals that average (over \(p\)) to the nominal \(\alpha\) rate much better than Agresti-Coull and Wald intervals.

library(binGroup)
SecondOrderCorrected.ci <- function(k,n){
    t(vapply(
        k,
        function(k0){
            binSOC(n, k0,conf.level=0.95,alternative="two.sided")
        },
        c(0.,0.)
    ))
}

The resulting plot is interesting because it appears that the second order corrected method is a good tradeoff between the traditional Wald interval and the Agresti-Coull interval. Though coverage does move around more than we’d like, it’s got a better average performance than Wald or Agresti-Coull.

Note that, although these perform better than the Wald interval, they still perform poorly when used for rates too close to 0 or 1.

Another way we can compare these intervals is in terms of the average width of the confidence interval. When the width is small then we can more often detect differences from nearby values than when the width is large.

Let’s first construct an R function to calculate average confidence interval widths at each \(p\).

get.widths <- function(f.ci){
    p.seq <- seq(from=0,to=1,length.out=1000)
    ci.widths <- vapply(
        p.seq,
        function(p){
            ci <- f.ci(0:n,n)
            sum(
                dbinom(0:n,n,p)*(ci[,2]-ci[,1])
            )
        },
        0.
    )
    return(ci.widths)
}

Comparing confidence intervals with smaller expected widths is kind of a hack, but if an interval excludes more proportions than another it may be in some sense “better”. The following plot shows the average confidence interval width at a whole bunch of different proportions.

What’s worth noting is that the Agresti-Coull intervals are wider (average length) than Wald and the second order corrected intervals everywhere. The conservative nature of the Agresti-Coull intervals directly translates into width.

What is also interesting is that the second order corrected intervals have the thinnest average intervals over much of the space, even though the coverage oscillates somewhat. For this reason I like to use the second order corrected confidence intervals in every application. When people who don’t have access to R ask how to construct confidence intervals I have them either let me do it or tell them about how to use the Agresti-Coull itnervals.