HSEM 3010 notes 004    February 11, 2007

Statistical Power and Sample Size
---------------------------------

  Statistics is full of technical terms which in ordinary nontechnical language have 
emotional content: terms like 'significance' and 'influence' and 'normal'. 'Power' 
is one such term.  It has nothing to do with political power (as in the power of a 
dictator, or of an abusive family member) or with power in physics (the ability to 
do work; dynes, etc.) or with muscle power.  Briefly, it is the probability that an 
experiment will have a statistically significant outcome, given a hypothesis about 
the true underlying relationship of the treatment factors to the endpoints.  Some 
other concepts and terminology must be explained first.

1.  What is a p-value?

    Assume you are carrying out an experiment to test a certain hypothesis (which
    here will be called the NULL HYPOTHESIS, H_0.  The null hypothesis is that there is
    no true difference between two treatments, A and B.  You randomize some people to 
    either Treatment A or Treatment B.  You select a measure X as the outcome.  You
    collect data on X for each person in your study.  You find the mean value of X
    for treatment A and the mean value for treatment B.  Call these two values
    XBARA and XBARB.  You compute the difference, XBARA - XBARB.  These are SAMPLE
    MEANS.

    The null hypothesis says that the TRUE means,  MU_A and MU_B, are the same.  That is,

        H_0:  MU_A = MU_B.

    The outcome X for each person is a random variable.  The sample means XBARA and XBARB 
    are random variables.  That is, before you do the study, you cannot say what
    XBARA and XBARB will be.  They are unpredictable.  If you do the whole study over
    again, you will almost certainly get different values for XBARA and XBARB.  And, of
    course, XBARA will almost never equal the 'true' value MU_A.  Similarly XBARB in general
    will not equal the true mean MU_B.

    Sometimes the difference between XBARA and XBARB is small.  Sometimes it is large.

    A (two-sided) p-value for a given study is the probability that, if the null hypothesis
    is true, you would get an absolute value difference between the two sample means
    as large as, or larger than, what you actually observed.

    If the p-value is very small, that is an indication that, if the null hypothesis is
    true, you have observed an unusual event.

    If the p-value is large (e.g., p = .333), then you have observed an event which, if
    the null hypothesis is true, is not particularly unlikely.

    You take certain actions or make certain decisions depending on the size of the p-value.

    If it is really small, you decide to REJECT the null hypothesis.  That is, the results
    of the experiment are so unlikely if the null hypothesis is true that you decide it
    must be false.

    If the p-value is large, you do NOT reject the null hypothesis.

    Statisticians do not ever claim to have PROVED the null hypothesis, no matter how
    large the p-value is.  A large p-value is often evidence that the difference between
    the true means is not very large.   But it is never PROOF, in the mathematical sense
    that you can, for example, prove there are infinitely many prime numbers.  Similarly,
    a very small p-value does not DISPROVE the null hypothesis, even though people act
    like it does.  Even if p = 0.000001, there is a small probability that, if the null
    hypothesis is true, you have observed an unusual event with a large difference between
    the sample mean values XBARA and XBARB.

    A very common cutoff for deciding when you reject the null hypothesis is p < 0.05.
    This is widely accepted as 'significant' in medical studies.  It is a fact that,
    under the null hypothesis, p-values have what is called a uniform distribution.
    What this means is that a p-value has an equal chance of taking on any value
    between zero and one.  More precisely, the probability that a p-value is less
    than any given number, t, between 0 and 1, is equal to t itself.  For example,
    the probability that a p-value is less than .333 is .333; the probability that
    it is less than .777 is .777.  Again, all this is assuming the null hypothesis
    is true.  The probability that p < 0.05 is 0.05, which happens to equal 1/20.
    That is, the probability of obtaining a 'signficant' p-value, if the null hypothesis
    is true, is about 1 in 20.  Not really all that unlikely.

    Some statisticians think that saying 0.05 is the threshold for significance is
    arbitrary, silly, and unscientific.  Why should one threshold work for all sorts
    of decisions?  In one case, you may be using it decide which of two brands of
    popsicles tastes better.  In another case, you may be using it to
    decide whether a certain cancer drug is more or less toxic than the standard
    therapy.  In the latter case, the decision could affect patients' chances of
    surviving chemotherapy.  You might want a stronger degree of certainty.

    R. A. Fisher, the most famous statistician of the 20th century, actually did not
    like using 0.05 as a threshold.  He preferred 0.01.  I assume he wanted to be
    more certain about rejecting null hypotheses.

2.  What is power?

    You start with a test statistic (like the chi-square statistic, or the t-statistic)
    and a level of significance, alpha.  You specify a null hypothesis.  If your experiment
    (or study, or clinical trial) gives you a p-value below alpha, you reject the null 
    hypothesis.  That's the test procedure.

    Now suppose that some hypothesis other than the null hypothesis is true.  This is called
    the alternative hypothesis, H_A.  If the null hypothesis is:

          H_0:  MU_A = MU_B   (or equivalently, H_0:  MU_A - MU_B = 0,

    then the alternative hypothesis might be:

          H_A:  MU_A - MU_B = 2.0

    In general, a two-sided test is going to tend to have smaller p-values if the
    the alternative hypothesis is true than if the null hypothesis is true.

    The probability that, if the alternative hypothesis is true, then the
    observed p-value is less than the specified significance level alpha, is
    called POWER.

    In symbols, you can say

          power = prob(p-value < alpha | H_A),

    that is, the power is the probability that your test statistic gives a p-value
    less than your significance level alpha, given that the alternative hypothesis
    is true.

    So: power depends on the alternative hypothesis.  The farther away the alternative
    hypothesis is from the null the hypothesis, the greater the power.

    Power also depends on the significance level, alpha.  The smaller alpha is, the
    [greater, smaller] the power is.  [Choose one!]

    Finally, also, power depends on sample size.  If the alternative hypothesis happens
    to be true, and you carry out a clinical trial with a lot of participants (say,
    10,000), you have a better chance of getting a small p-value than if you do a
    small clinical trial (with, say, 20 participants).  For some study designs and some
    statistical tests, it is possible to write down a formula for computing power
    as a function of sample size, alpha, and the alternative hypothesis.

3.  How do you compute power?

    Here is perhaps the simplest situation where you can compute power.

    Suppose you have a random variable X which has a normal distribution with
    standard deviation 1, and an unknown mean, mu.  That is,

        X ~ N(mu, 1).

    You consider two hypotheses: the null hypothesis, H_0, and an alternative
    hypothesis, H_A.

          H_0:  mu = 0.  That is, X ~ N(0, 1)

          H_A:  mu = some nonzero value.  Then X ~ N(mu, 1).

    You specify a significance level, alpha.

    You take a random sample value from X's distribution: say, X_obs.

    If the probability is computed under the assumption that the null hypothesis holds, then
    the p-value is the probability that a standard normal random variable takes
    on a value bigger than or equal to X_obs.  Call this p-value p_obs.

    If p_obs < alpha, then you *** reject *** the null hypothesis.

    If p_obs >= alpha, then you *** do not reject *** the null hypothesis.

    DEFINITION.  Power is the probability that you will reject the null hypothesis,
    given that some alternative hypothesis is true.  In symbols,

          power = prob(reject H_0 | H_A is true) = prob(p_obs < alpha | H_A).

    Another way to say this:

          power = prob(p_obs < alpha | X ~ N(mu, 1).

    Notation: c_alpha = the value such that if Z ~ N(0, 1), the prob(Z > c_alpha) = alpha.

    For example, if alpha = 0.05, then c_alpha = 1.645.  (See normal tables).

    So another way to express power is:

          power = prob(X > c_alpha | X ~ N(mu, 1)).

    This still is not quite enough that we can actually compute power.  But note that

          prob(X > c_alpha ) = prob(X - mu > c_alpha - mu).

    This follows just from subtracting mu from both sides of the inequality inside the parentheses.

    But if X ~ N(mu, 1), then X - mu ~ N(0, 1).

    Let W = X - mu.  Then W ~ N(0, 1) if the alternative hypothesis is true.

    So what you conclude from this is that

        power = prob(W > c_alpha - mu | W ~ N(0, 1)).

    This now does make it possible to compute power.  Here is an example.

    Say the significance level alpha is 0.01.  Then c_alpha = 2.32.

    Say the alternative hypothesis is

        H_A: X ~ N(2.5, 1).

    Then

        power = prob(X - mu > c_alpha - mu) = prob(W > c_alpha - mu | W ~ N(0, 1)).

    But c_alpha - mu = 2.32 - 2.50 = -0.18.  Thus

        power = prob(W > -0.18 | W ~ N(0, 1)).

    You can look this up in the standard normal tables.  You find that

        prob(W < -0.18) = 0.4286.

    Therefore

        power = prob(W > -0.18) = 1 - 0.4286 = .5714.


    In words: the probability of rejecting the null hypothesis, given that
    X ~ N(2.5, 1), is .5714.


    Example: Suppose the alternative hypothesis is X ~ N(2.0, 1) and
    the significance level alpha is alpha = 0.025.  What is the power?

    Answer:  c_alpha = 1.96.  The power is:

        power = prob( W > c_alpha - mu) = prob(W > 1.96 - 2.00) = prob(W > -0.04),

    where W has a standard normal distribution, W ~ N(0, 1).

    From the normal tables, you find:

        prob(W < -0.04) = .4840, so power is:

        power = prob(W > -0.04) = 1 - .4840 = .5160.

    Essentially, what you do to compute power is start with a random variable X having
    a distribution under the alternative hypothesis, and try to find a transformation
    of X which has a known, tabled distribution.  You must correspondingly transform
    the inequality which X must satisfy.  This makes it possible to estimate the power.

    =============================================================================

    Now we complicate things a little bit.

    The null hypothesis is:         H_0: X ~ N(0, s^2).

    The alternative hypothesis is:  H_A: X ~ N(mu, s^2).

    The question again is: what is the power?  That is, what is

        power = prob(p-value < alpha | H_A is true).

    The test which you use to decide whether or not to reject H_0 this time is
    based on

              X
        U =  --- .
              s

    Note that under the null hypothesis, U has the distribution N(0, 1).

    So what we are asking is: what is

        prob(U > c_alpha | H_A is true).

    What we have to do is work with the inequality

             U > c_alpha

    and end up with an equivalent inequality which has a random variable on the
    left side that has distribution N(0, 1), provided the alternative hypothesis is true.

    Here is how we do this.  I will write  <===> to mean "is equivalent to".

          U > c_alpha is   <===> X = U * s > c_alpha * s.                    [multiply both sides by s]

    Then  X > c_alpha * s  <===> X - mu > c_alpha * s - mu.                  [subtract mu from both sides]

    Then  X - mu > c_alpha * s - mu <===> (X - mu)/s > (c_alpha * s - mu)/s. [divide both sides by s]}

    Finally, you have


             X - mu
            -------  > c_alpha - mu/s.
               s

    an inequality equivalent to the original expression.


    But now, under the alternative hypothesis, the left side has distribution N(0, 1).

                             X - mu
    That is, if I let  Z =  --------.  then Z ~ N(0, 1).
                               s

    Therefore the question of what the power is has been transformed to: what is


       prob(Z > c_alpha - mu/s),  where Z ~ N(0, 1).

    The right side of the inequality is computable : we know alpha, mu, and s.
    So in theory, all we have to do is compute it and then look up the answer
    using the normal tables.

    Here is an example.  Say the alternative hypothesis is X ~ N(3, 4).  That is,
    mu = 3 and s = 2.  Say alpha = 0.05.  What is the power?

    Answer:

       power = prob(Z > c_alpha - mu/s) = prob(Z > 1.645 - 3/2)

             = prob(Z > 1.645 - 1.500) = prob(Z > .145)

   The normal table says prob(Z < .145) = .5572, so the power is:

       power = 1 - .5572 = .4428.

    All of the above concerns power for studies in which the outcome variable is
    a continuous measure with a normal distribution.  There are lots of studies
    in which the outcome variable is NOT normally distributed.  For example, a study
    of a resuscitation method for use in hospital emergency rooms, in which the
    outcome variable is survival (vs. death).  You would probably assign a numeric
    code for the outcome, like X = 1 indicates survival, X = 0 indicates death.  The
    random variable X is very definitely not normally distributed.

    However there is a deep and famous theorem in probability theory which says:
    suppose you have independent outcome variables X1, X2, ..., Xn for each of n
    people, and each of X1, X2, ..., Xn has the same distribution with the same 
    mean value mu and standard deviation s.  Let XBAR be the
    average of X1, X2, ..., Xn - that is,

      XBAR = (X1 + X2 + ... + Xn) / n.

    Then XBAR has an **approximately** normal distribution with mean equal to mu,
    and standard deviation equal equal to s/sqrt(n).  You can transform XBAR to
    have approximately the standard normal distribution N(0, 1) by letting

              XBAR - mu
      Z =  --------------.
            (s / sqrt(n))

    That is, Z ~a  N(0, 1), where "~a" is defined as "approximately distributed as".


    Here is how this gets used for dichotomous endpoints like survival as described
    above.  You let X1 = 1 if the person survived and X1 = 0 if the person died.
    You do the same for X2, X3, ..., Xn.  You compute the average XBAR.  Since the
    survivors have X's equal to 1 and the people who die have X = 0, the sum of the
    X's is just the number of survivors in your sample.  The average is the
    ***proportion*** who survived.  You assume at the beginning that each person
    has some probability, p, of surviving.  As mentioned earlier in these notes,
    the standard deviation of each of the X's is sqrt(p*(1 - p)).  Therefore the
    standard deviation of the proportion who survive is

          s = sqrt(p * (1 - p) / n).

    So here is how you would compute power for a simple survival study in which
    you want to compare proportion survival in a given sample of n people with a fixed
    hypothesized proportion:

       1.  Assume significance level alpha = .025.
       2.  Assume n = 100 people will be studied.
       3.  Null hypothesis: assume the probability of surviving is p_0 = 0.50
       4.  Alternative hypothesis: assume the probability of surviving is p_A = 0.60.

    Let Y be the proportion who survive in your sample of 100 people.  Under the
    null hypothesis, the expectation of Y is p_0.  Under the alternative hypothesis,
    the expectation of Y is 0.60.

    The standard deviation of Y under H_0 is:  sqrt(.5 * .5 / 100) = 0.05.

    Let X = Y - .5.  Then the expectation of X is 0.  Its standard deviation is
    the same as that of Y, that is, 0.05.

    Then the null hypothesis and the alternative hypothesis can be stated in terms 
    of X as follows:

         H_0:  mean of X is 0

         H_A:  mean of X is .10.

     Now as above, define W = X / s = X / .05.

     The null hypothesis and alternative hypothesis can now be re-stated in terms of
     W as:

        H_0:   mean of W is 0

        H_A:   mean of W is .10/.05 = 2.0

     Let Z = W - 2.0.  Then under H_A, W  ~a  N(0, 1).

     Since Z has approximately a normal distribution with standard deviation 1,
     you can compute power using the methods above.  Since alpha = .025,
     c_alpha = 1.96.

        prob(W > c_alpha | H_A) = prob(W - .20 > c_alpha - 2.0)

                                = prob(Z > 1.96 - 2.0) = prob*(Z > -0.04) = .5160.

     So the power in this case is about equal to 52%.

    ==================================================================================================

    Now a still more complicated situation: Comparison of two samples.

    Assume you have two drugs, A and B.  Assume that you randomize N people to
    each drug.  Your outcome is success or failure.  The probability of success
    for drug A is p_A.  The probability of success for drug B is p_B.  The null
    hypothesis and the alternative hypothesis are specified as follows:

         H_0:  p_A - p_B = 0.

         H_A:  p_A - p_B = d > 0.

    Now some technical notation is needed.  Assume that the random variable Z
    has a standard normal distribution - that is,   X ~ N(0, 1).
    For c a number between 0 and 1, define Z_c to be the number such that

         prob(X > Z_c) = c.

    For example: say c = 0.025.  Then Z_c = 1.96, because the probability that
    a standard normal random variable is greater than 1.96 is 0.025.

    Back to your clinical trial.  Assume you randomize N people to each of drug A
    and drug B.  Assume your significance level is alpha.  Let p_A and p_B be the
    success probabilities under the alternative hypothesis, and let pbar be the
    average of p_A and p_B.  Let d = abs(p_A - p_B).  Then power is the probability
    that a standard normal random variable X will exceed W, where

              Z_c * sqrt(2*pbar*(1 - pbar)/N) - d
      W =  ----------------------------------------- ,
            sqrt{[p_A*(1 - p_A) + p_B*(1 - p_B)]/N}


    and where c = alpha/2.

    This looks horrendous, no question about it, and deriving this formula requires
    some algebra and some rather subtle thinking.  And it is an approximation.

    But it's not so hard to use.  Here is an example.

    Suppose you are going to randomize 60 people to each of drug A and drug B.  Under
    the null hypothesis, you assume p_A = p_B = .50 - that is, there is a 50% success
    rate for each drug.  Under the null hypothesis, you assume

         p_A = .40   and   p_B = .60.

    Then the absolute difference d between the success rates is d = 0.20.

    Note that d is also known as the "effect size" or "treatment effect".

    Assume a significance level alpha = 0.05.  Thus c = 0.025, and Z_c = 1.96.

    Note that pbar = (.40 + .60)/2 = .50.

    We now have all the ingredients to compute W in the formula above:

              Z_c * sqrt(2*pbar*(1 - pbar)/N) - d
      W =  -----------------------------------------
            sqrt{[p_A*(1 - p_A) + p_B*(1 - p_B)]/N}


             1.96 * sqrt(2 * .5 * .5 / 60) - 0.20
        =  -----------------------------------------
               sqrt{[.4 * .6 + .6 * .4]/60}


             1.96 * sqrt(2 * .5 * .5 / 60) - 0.20
        =  ----------------------------------------- = -0.236
                sqrt{[.4 * .6 + .6 * .4]/60}


    Now what you need is the probability that a standard normal random variable is
    larger than -0.236.  From tables, you find that this is approximately 0.593.  The
    power of the clinical trial is therefore approximately 59%.  That is, with the
    specified alternative hypothesis and a sample size of 60 in each group, the
    probability of obtaining a two-sided p-value less than alpha = 0.05 is about 0.59.

-----------------------------------------------------------------------------------------

Related issue: proportions and chi-square statistics:
-----------------------------------------------------

  Assume you carry out a clinical trial of the ever-popular Drug A versus Drug B.  Assume
a dichotomous endpoint (success vs. failure).  You randomize m people to each of the two
drugs.  The data are represented as:

                Drug A     Drug B
              ---------------------
              |         |         |
   Success    |    a    |    b    |  n1
              |         |         |
              ---------------------
              |         |         |
   Failure    |    c    |    d    |  n2
              |         |         |
              ---------------------
                   m1        m2     N


  You can test for whether there is a difference in success rates between
the two drugs by using the chi-square statististic,

                N * (ad - bc)^2
        X2 =   ----------------
                m1 * m2 * n1 * n2


and comparing it to the chi-square distribution table for 1 degree of
freedom.

  However, you could also test for a difference by first computing the
proportions of successes in each drug group:

                       Number of       Proportion of
   Drug Group  Total   Successes         Successes
   ----------  -----   ---------       -------------

       A         m1        a             p1 =  a/m1

       B         m2        b             p2 =  b/m2


  The difference in success rates between the drug groups is

      d = p1 - p2.

  Under the null hypothesis, p1 and p2 would have the same
expectation.  To approximate a common probability of success
in the two groups, we let

     pbar = (a + b)/(m1 + m2) = (a + b)/N.

  An approximate standard error for d is:

     s.e.(d) = sqrt(pbar*(1 - pbar)/m1 + pbar*(1 - pbar)/m2)

             = sqrt(pbar*(1 - pbar)*(1/m1 + 1/m2)).

  The null hypothesis here is that the true proportions of
success in the two groups are equal.  This can also be stated as:

    H_0: the expectation of d is 0.

  This can be tested by computing a statistic which is approximately normally
distributed, as follows:

   Z = d / s.e.(d),

where the approximate standard error of d is as shown above.

  As it turns out, the chi-square statistic X2 discussed above
exactly equals the square of the normal statistic Z.


4.  How do you compute sample size?

    People who want to carry out a clinical trial need to know how many people they
    should randomize.  This is essentially the inverse problem to computing power.
    That is, given the power that you want, you can solve the equation

              Z_c * sqrt(2*pbar*(1 - pbar)/N) - d
      W =  ----------------------------------------- ,
            sqrt{[p_A*(1 - p_A) + p_B*(1 - p_B)]/N}

    for N.

    First you replace W by Z_b, where b = 1 - power.  Then the solution for N is:


          {Z_c * sqrt(2*pbar*(1 - pbar)) + Z_b * sqrt(p_A*(1 - p_A) + p_B*(1 - p_B))}^2
     N =  -----------------------------------------------------------------------------
                                            d^2


    It is simple algebra to go from the formula for W to the formula for N.  Simple,
    but tedious.

    Here is an example.  Suppose you want to carry out a clinical trial of drug A
    versus drug B, and for your alternative hypothesis you assume that the success rate 
    for drug A is 75% and that for drug B is 50%.  You want equal sample sizes in both 
    groups.  You assume a two-sided significance level alpha = 0.05.  You want
    your statistical power to be 85%.  The effect size d = .75 - .50 = .25.

    Again, Z_c = 1.96.  For Z_b, note that b = 1 - power = 1 - .85 = .15.  From tables,
    you find that Z_.15 = 1.04.  You find pbar = (.50 + .75)/2 = .625.  Putting all this
    in the formula above for N, you get:

          {1.96*sqrt(2*.625*.375) + 1.04*sqrt(.75*.25 + .5*.5)}^2
     N =  ---------------------------------------------------------
                                 (.25)^2

       = 65.9

    So you need about 66 people in each of groups A and B.


    One important thing about the sample size formula.  Notice that the difference d
    in success rates between the two groups occurs in the denominator, and it is
    squared.  What this means is that the size of d has enormous leverage in determining
    the sample size.  In the example above, d = .25, and d^2 = .0625, and the inverse of .0625
    is 16.  If instead d = .20, then d^2 = .04 and its inverse is 25.  If d = .1, the
    inverse of d^2 is 100.  So the effect of decreasing d is to dramatically increase
    the sample size.

    In fact it is useful to know how the sample size changes as a function of the
    other variables:

    1.  If the effect size d decreases, sample size increases

    2.  If power increases, sample size increases

    3.  If significance level alpha decreases, sample size increases.


  The approximate formulas above for sample size and power apply to studies where
the results are judged according to the proportions of successes in each of two
equal-sized groups.  There are many other variations.  You may want to plan a study
with more than two groups.  This increases the sample size.  You may want to consider
an outcome which is quantitative (like change in blood pressure) rather than success
proportions.  The sample size formula is similar, but not the same.  A good reference
for this situation is: Statistical Methods, by Snedecor and Cochrane (Iowa State
University Press, 1967).  You may want to consider a clinical trial in which the outcome 
is measured as the time to an event (e.g., time to having a heart attack or death, 
whichever comes first).  There is a different sample size formula for this.  

  The common factors in all sample size formulas are significance level, power, and the 
expected size of the treatment effects under the alternative hypothesis.  These are the 
key ingredients.

  There are good references for sample size and power.  One of the best for dichotomous
outcomes is the book Statistical Methods for Rates and Proportions, 3rd Edition, by Joseph 
Fleiss et al. (Wiley Interscience, 2003).  See also the web site:

   http://obssr.od.nih.gov/Conf_Wkshp/RCT03/Lectures/Catellier_Sample_Size.pdf


Last date revised: February 13, 2007