Testing for the Best of Times Without Knowing the Worst of Times

Planning
Bayesian Methods

I was asked to answer a question about failure times when we had no data and the prospect of tremendous losses if we took a bad approach. Here’s my attempt to formulate some plan in such an environment.

Published

April 25, 2016

How do you determine how often to check a device that you’ve never seen fail? It’s a weird question that’s been on my mind today. At first I threw up my hands in the air — to choose any failure rate is folly! But then I had the idea of dusting off the old Bayesian Statistics book and going to town on the problem. Also, my reliability friends were all busy so I was on my own.

Consider a system that requires involved testing to determine if it’s functioning correctly. Testing currently occurs once a day (which is the most frequently as it can happen), and has been tested for \(n_0\) days in a row. Surely the fact that a failure hasn’t occured in \(n_0\) days should tell us something about the failure rate of the system.

Here I use an approach of applying a uniform prior to the failure rate \(p\), and then at each observation after \(n_0\) days have passed the next inspection date is chosen so that the probability of a failure occurring before the inspection is \(\gamma\). A second approach, suggested by DCM, where we simply assume a failure at time zero and then estimate the failure rate at each step, is seen as a specific prior of this same general approach.

Binomial Model

Let \(n_0\) represent the number of days without failure under a given failure rate \(p\). Further, let \(p \sim \mathrm{Beta}(\alpha,\beta)\) then \[\begin{align*} \pi(n_0|p) &= (1-p)^{n_0} \\ \pi(p) &\propto p^{\alpha-1} (1-p)^{\beta-1} \\ \pi(p|n_0) &\propto \pi(n_0|p) \pi(p) \\ &= p^{\alpha-1} (1-p)^{n_0+\beta-1}\\ &\therefore\quad p\sim \mathrm{Beta}(\alpha,\beta+n_0). \end{align*}\] Having \(n\) new samples without failure has \(\pi(n|p) = (1-p)^n\). The posterior predictive distribution is then \[\begin{align*} \pi(n|n_0) &= \int \pi(n|p)~\pi(p|n_0)~\mathrm{d}p \\ &= \int (1-p)^n \frac{\Gamma(\alpha+\beta+n_0)}{\Gamma(\alpha)\Gamma(\beta+n_0)} p^{\alpha-1} (1-p)^{\beta+n_0-1}~\mathrm{d}p \\ &= \frac{\Gamma(\alpha+\beta+n_0)}{\Gamma(\alpha)\Gamma(\beta+n_0)} \int p^{\alpha-1} (1-p)^{\beta+n_0+n-1}~\mathrm{d}p \\ &= \frac{\Gamma(\alpha+\beta+n_0)\Gamma(\beta+n+n_0)}{\Gamma(\alpha+\beta+n_0+n)\Gamma(\beta+n_0)}. \end{align*}\] Thus one simply searches for the smallest \(n\) given \(n_0\) so that \(\pi(n|n_0) \le \gamma\).

There is a technical detail required for implementation in R. Sometimes doing an observation the next day (the earliest possible) leaves us with more than \(\gamma\) probability of observing a failure in that day. In this case we just use 1 day anyway since we can sample no sooner. The R implementation follows:

##
## Posterior predictive interval
##
pos.pred.dist <- function(n,n0,alpha,beta){
    exp(
        lgamma(alpha+beta+n0) +
        lgamma(beta+n+n0) -
        lgamma(alpha+beta+n0+n) -
        lgamma(beta+n0)
    )
}

##
## Given n0 days without failure, find next date with 
## interval failure probability gamma.target
##
find.next.step.bayes <- function(gamma.target,n0,alpha,beta){
    q <- 1-gamma.target
    # If the probability of a failure for single day is too large then
    # simply force the use of a day - the smallest period.
    if(pos.pred.dist(1,n0,alpha,beta) <= q){
        return(1)
    }
    target.val <- uniroot(
        function(n){
            pos.pred.dist(n,n0,alpha,beta)-q
        },
        interval=c(1,n0/gamma.target),
        tol=1e-8
    )
    return(ceiling(target.val$root))
}

##
## Assume at time 0 there was a failure and use that to estimate failure rate.
##
find.next.step.assuming.t0.failure <- function(gamma.target,n0,alpha,beta){
    q <- 1-gamma.target
    p <- 1/(1+n0)
    if(1-p <= q){
        return(1)
    }
    target.val <- uniroot(
        function(n){
            exp(n*log(1-p))-q
        },
        interval=c(1,n0/gamma.target),
        tol=1e-8
    )
    return(ceiling(target.val$root))
}

##
## Given gamma.target, the max probability of failure in a period, construct
## a sampling plan out to max.time days.
##
get.sampling.plan <- function(
    gamma.target, 
    max.time, 
    alpha, 
    beta, 
    find.next.step=find.next.step.bayes
){
    plan.path <- c(1)
    while(sum(plan.path) < max.time){
        plan.path <- c(
            plan.path,
            find.next.step(gamma.target,sum(plan.path),alpha,beta)
        )
    }
    return(plan.path)
}

In terms of prior, I chose \(\alpha = \beta = 1\) to get a uniform distribution. Had we any idea what the failure rate might be we could adjust the prior accordingly.

Note that once a failure has occurred I’d switch to a different approach. Since we only detect failures at inspection we only know an interval in which the failure occurred, not the actual failure time.

Sampling Plans

For \(\gamma =0.1\) the sampling plan looks reasonable. Below, you see a plan that inspects every day for 9 days, then every other day for 4 days, and so on.

 [1]   1   1   1   1   1   1   1   1   1   2   2   2   2   3   3   3   4   4
[19]   4   5   5   6   7   7   8   9  10  11  12  14  15  17  19  21  23  26
[37]  29  32  35  39  44  49  54  60  67  74  82  91 102 113 126 139 155 172
[55] 191 213 236 262 291 324 360 400

If we havent seen a failure after an entire year then we’re inspecting every 32 days. And if we’ve gone almost 10 years without seeing a failure we only inspect once a year. In reality I’d put a cap on how far apart we test - maybe no less than once a month.

All of the sampling plans are in a serparte page noFailureSamplingPlan.htm due to how long and ugly they are. The two methods, Bayesian (uniform prior) and assumed failure methods, are given at \(\gamma \in \{0.5,0.25,0.1,0.05,0.01\}\). The number of steps of a particular size is given (I applied R’s table function to them so I could align them for comparison). For example, at Bayes with \(\gamma = 0.01\) the “1” step size has 99 listed, so you would do every day sampling for 99 days before moving to the next entry, \((50 \times 2)\) days of every other day sampling.

Note that the resulting sampling plans vary quite a bit based on \(\gamma\). If \(\gamma=0.01\) or \(\gamma=0.05\) you spend a lot of time daily sampling. At \(\gamma=0.1\) you have fewer daily runs. At \(\gamma = 0.5\) the sampling plan is very close to powers of 2.

Also striking is how close the uniform prior and the assume-one-failure approach are. Given that assuming a time-zero failure requires much mathematical hullabaloo it may be an easier method for some people to describe to non-statistician peers.

One way that we can evaluate sampling plans is by the expected length of the final interval before the failure was seen. This is still integrating over all \(p\), assuming the uniform prior, but gives us an idea of what we expect to see. The expected interval length at failure is given in the table below:

\(\gamma\) Bayes Fail at Time Zero
0.50 11.34778495 7.982190177
0.25 5.319543854 4.456042322
0.10 2.486273413 2.367727009
0.05 1.673004032 1.651998871
0.01 1.094079621 1.093274017

These expectations are comforting in that the average time after a failure that we would inspect is rather small for values as small as \(\gamma = 0.1\). For completeness I’m including the function I wrote to calculate these expectations:

##
## Find the expected length of the step with failure
##
expected.interval.of.failure <- function(plan,alpha,beta){
    cplan <- cumsum(plan)
    n.probs <- vapply(
        2:length(plan),
        function(i){
            n <- plan[i]
            n0 <- cplan[i]
            prob.at.n <- integrate(
                function(p){
                    # Step * Failure in n * no failure * prior
                    # n * (1-(1-p)^n) * (1-p)^n0 * dbeta(p,alpha,beta)
                    (1-exp(n*log(1-p))) * exp(n0*log(1-p)) * dbeta(p,alpha,beta)
                },
                lower=0,
                upper=1,
                rel.tol=1e-8
            )$value
        },
        0.
    )
    sum(plan[-1]*n.probs/sum(n.probs))
}

Extensions and Alternatives

One trivial extension is to use units of days, weeks, and months. Specifically, as you build, use the smallest unit less than or equal to the suggested unit. This prevents those very sparse end sections where one waits 2,000 days before the next inspection, and makes it easier to plan.

An alternative approach may be to take your method for analysis once failures are observed and begin treating the sequence as a designed experiment in terms of minimizing the variability of the failure rate estimate. I haven’t thought this one out very far.

Fake Data Sidebar

Some may object to simply pretending a failure has occurred. In turns out there’s quite a body of work out there discussing the uses of faking your data (at least some of it). Some Bayesian problems can be formulated by adding fake observations instead of incorporating a prior, which is one way of looking at this problem. “Pseudo-data” techniques allow the hard-core Bayesians to use both faked data an a prior to essentially blend priors in a convenient way. Other problems can benefit from fake data as well. Last QPRC Peter Qian gave a talk about using fake data in large matrices to make them amenable (usually by making them orthogonal) to processing with more efficient algorithms.

Conclusion

I still don’t know the right way to address this problem. I’m leaving the comments open in the hopes that someone will suggest a better way. That’s it. Ball’s in your court now.