Notes on Hsu’s Method for Conversion Data
Hsu’s method is a way to compare treatments to the best of the remaining treatments all while preserving the overall type 1 error rate. The result is confidence intervals that tell you if a particular treatment is dominated by another or if the treatment is among the best. Here are my notes on the procedure.
Author’s Note: This post is me pulling together notes and reworking some things, and so things are even less guaranteed to be correct than in other posts (which have no guarantee of correctness, so make of that what you will). If you find errors please let me know.
Hsu’s method, often called multiple comparisons with the best (MCB), allows us to construct confidence intervals to identify which treatments have the largest conversion and which ones are dominated by some other treatment. The confidence intervals produced have simultaneous coverage of the difference for treatment \(i\) of
\[ \mu_i - \max_{j \ne i} \mu_j~. \]
When the interval covers positive values then we know it is among the best. Otherwise, we know that it is inferior to some other treatment in our expeirment, and get some sense for how inferior it could be.
Hsu’s Method
First, define the convenience functions \(x^+\) and \(x^-\) where
\[ x^+ = \left\{ \begin{array}{ll} x & \text{if}~x > 0,\\ 0 & \text{otherwise,} \end{array} \right. \quad\text{and}\quad -x^- = \left\{ \begin{array}{ll} -x & \text{if}~x < 0,\\ 0 & \text{otherwise.} \end{array} \right. \]
Confidence intervals are of the form \([D_i^-, D_i^+],~i=1,\ldots,k\) where \[\begin{align*} D_i^+ &= \left(\min_{j \ne i} \left\{ \hat{\mu}_i - \hat{\mu}_j + d \sqrt{\sigma_i^2 + \sigma_j^2}\right\}\right)^+, \\ G &= \{i : D_i^+ > 0\},\quad\text{and} \\ D_i^- &= \left\{ \begin{array}{l} 0 \quad\text{if}\quad G = \{i\},\\ \min_{j \in G, j\ne i}\left\{\hat{\mu}_i - \hat{\mu}_j - d \sqrt{\sigma_i^2 + \sigma_j^2}\right\}\quad\text{otherwise.} \end{array} \right. \end{align*}\] The \(d\) are identical to the one-sided Dunnett’s constants as discussed in my notes on Dunnett’s procedure. Note that, when the sample sizes are imbalanced we’ll be generating a \(d_i\) treating level \(i\) as control for each \(i = 1,\ldots,m\) treatments.
Why Does This Work?
Let \((1),\ldots,(k)\) represent the unknown indicies such that \(\mu_{(1)} \le \mu_{(2)} \le \ldots \le \mu_{(k)}\). We don’t know \((k)\) but it is a quantity of interest. Consider the event \(E\)
\[ E = \left\{ \hat{\mu}_{(k)} - \mu_{(k)} > \hat{\mu}_i - \mu_i - d\sqrt{\sigma_i^2 + \sigma_{(k)}^2}~~\text{for all}~~i, i \ne (k) \right\}~. \]
By definition of \(d\) we have \(P(E) = 1-\alpha\).
We don’t know \((k)\) but we can give two larger events that don’t rely on knowledge of \((k)\) and whose intersection contains \(E\). First, \[\begin{align*} E &= \left\{ \hat{\mu}_{(k)} - \mu_{(k)} > \hat{\mu}_i - \mu_i - d\sqrt{\sigma_i^2 + \sigma_{(k)}^2}~~\text{for all}~~i, i \ne (k) \right\} \\ &= \left\{ \mu_{(k)} - \mu_{i} < \hat{\mu}_{(k)} - \hat{\mu}_i + d\sqrt{\sigma_i^2 + \sigma_{(k)}^2}~~\text{for all}~~i, i \ne (k) \right\} \\ &\subseteq \left\{ \mu_{(k)} - \max_{j \ne (k)} \mu_j < \hat{\mu}_{(k)} - \hat{\mu}_i + d\sqrt{\sigma_i^2 + \sigma_{(k)}^2}~~\text{for all}~~i, i \ne (k) \right\} \\ &= \left\{ \mu_{(k)} - \max_{j \ne (k)} \mu_j < \hat{\mu}_{(k)} - \max_{j \ne (k)} \hat{\mu}_j + d\sqrt{\sigma_i^2 + \sigma_{(k)}^2} \right\} \\ &= \left\{ \mu_{(k)} - \max_{j \ne (k)} \mu_j < \hat{\mu}_{(k)} - \max_{j \ne (k)} \hat{\mu}_j + d\sqrt{\sigma_i^2 + \sigma_{(k)}^2} \quad\text{and}\quad \mu_i - \max_{j \ne i} \mu_j \le 0 ~~\text{for all}~~i \ne (k)\right\} \\ &\subseteq \left\{ \mu_i - \max_{j \ne i} \mu_j \le \left(\hat{\mu}_{i} - \max_{j \ne i} \hat{\mu}_j + d\sqrt{\sigma_i^2 + \sigma_j^2}\right)^+~~\text{for all}~~i \right\} \\ &=: E_1~. \end{align*}\] Next, \[\begin{align*} E &= \left\{ \hat{\mu}_{(k)} - \mu_{(k)} > \hat{\mu}_i - \mu_i - d\sqrt{\sigma_i^2 + \sigma_{(k)}^2}~~\text{for all}~~i, i \ne (k) \right\} \\ &= \left\{ \mu_{i} - \mu_{(k)} > \hat{\mu}_{i} - \hat{\mu}_{(k)} - d\sqrt{\sigma_i^2 + \sigma_{(k)}^2}~~\text{for all}~~i, i \ne (k) \right\} \\ &=\left\{ \mu_{i} - \max_{j \ne i} \mu_j > \hat{\mu}_{i} - \hat{\mu}_{(k)} - d\sqrt{\sigma_i^2 + \sigma_{(k)}^2}~~\text{for all}~~i, i \ne (k) \right\} \\ &\subseteq \left\{ \mu_{i} - \max_{j \ne i} \mu_j > \hat{\mu}_{i} - \max_{j \ne i} \hat{\mu}_{j} - d\sqrt{\sigma_i^2 + \sigma_{j}^2}~~\text{for all}~~i, i \ne (k) \right\} \\ &= \left\{ \mu_{i} - \max_{j \ne i} \mu_j > \hat{\mu}_{i} - \max_{j \ne i} \hat{\mu}_{j} - d\sqrt{\sigma_i^2 + \sigma_{j}^2}~~\text{for all}~~i, i \ne (k) \quad\text{and}\quad \mu_i - \max_{j \ne i} \mu_j \ge 0 ~~\text{for all}~~i \ne (k)\right\} \\ &\subseteq \left\{ \mu_i - \max_{j \ne i} \mu_j \ge -\left(\hat{\mu}_{i} - \max_{j \ne i} \hat{\mu}_j - d\sqrt{\sigma_i^2 + \sigma_j^2}\right)^-~~\text{for all}~~i \right\} \\ &=: E_2 \end{align*}\] By construction, \(E \subseteq E_1 \cap E_2\) so
\[ 1-\alpha = P(E) \le P(E_1 \cap E_2) = P(D_i^- \le \mu_i - \max_{j \ne i} \mu_j \le D_i^+~~\text{for}~~i=1,\ldots,m)~. \]
Sample Size
The definition of power is more confused in this case compared to Dunnett’s method. Consider the situation where the means \(\mu_1,\ldots,\mu_m\) are either \(p_0\) or \(p_1\) and \(p_0 = \mu_{(1)} < \mu_{(2)} = \mu_{(3)} = \ldots = \mu_{(m)} = p_1\). Power would be the rate at which we detect each \(\mu_{(2)},\ldots,\mu_{(m)}\) being greater than \(\mu_{(1)}\). But this is exactly the definition in Dunnett’s one-sided test (where \(\mu_{(1)}\) stands in for control), we just don’t know which one is \(\mu_{(1)}\) ahead of time. But it doesn’t matter, because whatever index it really is we have the correct probability of detecting the others are superior.
Just make sure that you take into account not having a control treatment indexed at zero.
If you want to get into the weeds on Hsu’s method check out his book (Hsu 1996). Details can also be found in the documentation of the PASS software.