HSEM 3010 notes 004 February 11, 2007 Statistical Power and Sample Size --------------------------------- Statistics is full of technical terms which in ordinary nontechnical language have emotional content: terms like 'significance' and 'influence' and 'normal'. 'Power' is one such term. It has nothing to do with political power (as in the power of a dictator, or of an abusive family member) or with power in physics (the ability to do work; dynes, etc.) or with muscle power. Briefly, it is the probability that an experiment will have a statistically significant outcome, given a hypothesis about the true underlying relationship of the treatment factors to the endpoints. Some other concepts and terminology must be explained first. 1. What is a p-value? Assume you are carrying out an experiment to test a certain hypothesis (which here will be called the NULL HYPOTHESIS, H_0. The null hypothesis is that there is no true difference between two treatments, A and B. You randomize some people to either Treatment A or Treatment B. You select a measure X as the outcome. You collect data on X for each person in your study. You find the mean value of X for treatment A and the mean value for treatment B. Call these two values XBARA and XBARB. You compute the difference, XBARA - XBARB. These are SAMPLE MEANS. The null hypothesis says that the TRUE means, MU_A and MU_B, are the same. That is, H_0: MU_A = MU_B. The outcome X for each person is a random variable. The sample means XBARA and XBARB are random variables. That is, before you do the study, you cannot say what XBARA and XBARB will be. They are unpredictable. If you do the whole study over again, you will almost certainly get different values for XBARA and XBARB. And, of course, XBARA will almost never equal the 'true' value MU_A. Similarly XBARB in general will not equal the true mean MU_B. Sometimes the difference between XBARA and XBARB is small. Sometimes it is large. A (two-sided) p-value for a given study is the probability that, if the null hypothesis is true, you would get an absolute value difference between the two sample means as large as, or larger than, what you actually observed. If the p-value is very small, that is an indication that, if the null hypothesis is true, you have observed an unusual event. If the p-value is large (e.g., p = .333), then you have observed an event which, if the null hypothesis is true, is not particularly unlikely. You take certain actions or make certain decisions depending on the size of the p-value. If it is really small, you decide to REJECT the null hypothesis. That is, the results of the experiment are so unlikely if the null hypothesis is true that you decide it must be false. If the p-value is large, you do NOT reject the null hypothesis. Statisticians do not ever claim to have PROVED the null hypothesis, no matter how large the p-value is. A large p-value is often evidence that the difference between the true means is not very large. But it is never PROOF, in the mathematical sense that you can, for example, prove there are infinitely many prime numbers. Similarly, a very small p-value does not DISPROVE the null hypothesis, even though people act like it does. Even if p = 0.000001, there is a small probability that, if the null hypothesis is true, you have observed an unusual event with a large difference between the sample mean values XBARA and XBARB. A very common cutoff for deciding when you reject the null hypothesis is p < 0.05. This is widely accepted as 'significant' in medical studies. It is a fact that, under the null hypothesis, p-values have what is called a uniform distribution. What this means is that a p-value has an equal chance of taking on any value between zero and one. More precisely, the probability that a p-value is less than any given number, t, between 0 and 1, is equal to t itself. For example, the probability that a p-value is less than .333 is .333; the probability that it is less than .777 is .777. Again, all this is assuming the null hypothesis is true. The probability that p < 0.05 is 0.05, which happens to equal 1/20. That is, the probability of obtaining a 'signficant' p-value, if the null hypothesis is true, is about 1 in 20. Not really all that unlikely. Some statisticians think that saying 0.05 is the threshold for significance is arbitrary, silly, and unscientific. Why should one threshold work for all sorts of decisions? In one case, you may be using it decide which of two brands of popsicles tastes better. In another case, you may be using it to decide whether a certain cancer drug is more or less toxic than the standard therapy. In the latter case, the decision could affect patients' chances of surviving chemotherapy. You might want a stronger degree of certainty. R. A. Fisher, the most famous statistician of the 20th century, actually did not like using 0.05 as a threshold. He preferred 0.01. I assume he wanted to be more certain about rejecting null hypotheses. 2. What is power? You start with a test statistic (like the chi-square statistic, or the t-statistic) and a level of significance, alpha. You specify a null hypothesis. If your experiment (or study, or clinical trial) gives you a p-value below alpha, you reject the null hypothesis. That's the test procedure. Now suppose that some hypothesis other than the null hypothesis is true. This is called the alternative hypothesis, H_A. If the null hypothesis is: H_0: MU_A = MU_B (or equivalently, H_0: MU_A - MU_B = 0, then the alternative hypothesis might be: H_A: MU_A - MU_B = 2.0 In general, a two-sided test is going to tend to have smaller p-values if the the alternative hypothesis is true than if the null hypothesis is true. The probability that, if the alternative hypothesis is true, then the observed p-value is less than the specified significance level alpha, is called POWER. In symbols, you can say power = prob(p-value < alpha | H_A), that is, the power is the probability that your test statistic gives a p-value less than your significance level alpha, given that the alternative hypothesis is true. So: power depends on the alternative hypothesis. The farther away the alternative hypothesis is from the null the hypothesis, the greater the power. Power also depends on the significance level, alpha. The smaller alpha is, the [greater, smaller] the power is. [Choose one!] Finally, also, power depends on sample size. If the alternative hypothesis happens to be true, and you carry out a clinical trial with a lot of participants (say, 10,000), you have a better chance of getting a small p-value than if you do a small clinical trial (with, say, 20 participants). For some study designs and some statistical tests, it is possible to write down a formula for computing power as a function of sample size, alpha, and the alternative hypothesis. 3. How do you compute power? Here is perhaps the simplest situation where you can compute power. Suppose you have a random variable X which has a normal distribution with standard deviation 1, and an unknown mean, mu. That is, X ~ N(mu, 1). You consider two hypotheses: the null hypothesis, H_0, and an alternative hypothesis, H_A. H_0: mu = 0. That is, X ~ N(0, 1) H_A: mu = some nonzero value. Then X ~ N(mu, 1). You specify a significance level, alpha. You take a random sample value from X's distribution: say, X_obs. If the probability is computed under the assumption that the null hypothesis holds, then the p-value is the probability that a standard normal random variable takes on a value bigger than or equal to X_obs. Call this p-value p_obs. If p_obs < alpha, then you *** reject *** the null hypothesis. If p_obs >= alpha, then you *** do not reject *** the null hypothesis. DEFINITION. Power is the probability that you will reject the null hypothesis, given that some alternative hypothesis is true. In symbols, power = prob(reject H_0 | H_A is true) = prob(p_obs < alpha | H_A). Another way to say this: power = prob(p_obs < alpha | X ~ N(mu, 1). Notation: c_alpha = the value such that if Z ~ N(0, 1), the prob(Z > c_alpha) = alpha. For example, if alpha = 0.05, then c_alpha = 1.645. (See normal tables). So another way to express power is: power = prob(X > c_alpha | X ~ N(mu, 1)). This still is not quite enough that we can actually compute power. But note that prob(X > c_alpha ) = prob(X - mu > c_alpha - mu). This follows just from subtracting mu from both sides of the inequality inside the parentheses. But if X ~ N(mu, 1), then X - mu ~ N(0, 1). Let W = X - mu. Then W ~ N(0, 1) if the alternative hypothesis is true. So what you conclude from this is that power = prob(W > c_alpha - mu | W ~ N(0, 1)). This now does make it possible to compute power. Here is an example. Say the significance level alpha is 0.01. Then c_alpha = 2.32. Say the alternative hypothesis is H_A: X ~ N(2.5, 1). Then power = prob(X - mu > c_alpha - mu) = prob(W > c_alpha - mu | W ~ N(0, 1)). But c_alpha - mu = 2.32 - 2.50 = -0.18. Thus power = prob(W > -0.18 | W ~ N(0, 1)). You can look this up in the standard normal tables. You find that prob(W < -0.18) = 0.4286. Therefore power = prob(W > -0.18) = 1 - 0.4286 = .5714. In words: the probability of rejecting the null hypothesis, given that X ~ N(2.5, 1), is .5714. Example: Suppose the alternative hypothesis is X ~ N(2.0, 1) and the significance level alpha is alpha = 0.025. What is the power? Answer: c_alpha = 1.96. The power is: power = prob( W > c_alpha - mu) = prob(W > 1.96 - 2.00) = prob(W > -0.04), where W has a standard normal distribution, W ~ N(0, 1). From the normal tables, you find: prob(W < -0.04) = .4840, so power is: power = prob(W > -0.04) = 1 - .4840 = .5160. Essentially, what you do to compute power is start with a random variable X having a distribution under the alternative hypothesis, and try to find a transformation of X which has a known, tabled distribution. You must correspondingly transform the inequality which X must satisfy. This makes it possible to estimate the power. ============================================================================= Now we complicate things a little bit. The null hypothesis is: H_0: X ~ N(0, s^2). The alternative hypothesis is: H_A: X ~ N(mu, s^2). The question again is: what is the power? That is, what is power = prob(p-value < alpha | H_A is true). The test which you use to decide whether or not to reject H_0 this time is based on X U = --- . s Note that under the null hypothesis, U has the distribution N(0, 1). So what we are asking is: what is prob(U > c_alpha | H_A is true). What we have to do is work with the inequality U > c_alpha and end up with an equivalent inequality which has a random variable on the left side that has distribution N(0, 1), provided the alternative hypothesis is true. Here is how we do this. I will write <===> to mean "is equivalent to". U > c_alpha is <===> X = U * s > c_alpha * s. [multiply both sides by s] Then X > c_alpha * s <===> X - mu > c_alpha * s - mu. [subtract mu from both sides] Then X - mu > c_alpha * s - mu <===> (X - mu)/s > (c_alpha * s - mu)/s. [divide both sides by s]} Finally, you have X - mu ------- > c_alpha - mu/s. s an inequality equivalent to the original expression. But now, under the alternative hypothesis, the left side has distribution N(0, 1). X - mu That is, if I let Z = --------. then Z ~ N(0, 1). s Therefore the question of what the power is has been transformed to: what is prob(Z > c_alpha - mu/s), where Z ~ N(0, 1). The right side of the inequality is computable : we know alpha, mu, and s. So in theory, all we have to do is compute it and then look up the answer using the normal tables. Here is an example. Say the alternative hypothesis is X ~ N(3, 4). That is, mu = 3 and s = 2. Say alpha = 0.05. What is the power? Answer: power = prob(Z > c_alpha - mu/s) = prob(Z > 1.645 - 3/2) = prob(Z > 1.645 - 1.500) = prob(Z > .145) The normal table says prob(Z < .145) = .5572, so the power is: power = 1 - .5572 = .4428. All of the above concerns power for studies in which the outcome variable is a continuous measure with a normal distribution. There are lots of studies in which the outcome variable is NOT normally distributed. For example, a study of a resuscitation method for use in hospital emergency rooms, in which the outcome variable is survival (vs. death). You would probably assign a numeric code for the outcome, like X = 1 indicates survival, X = 0 indicates death. The random variable X is very definitely not normally distributed. However there is a deep and famous theorem in probability theory which says: suppose you have independent outcome variables X1, X2, ..., Xn for each of n people, and each of X1, X2, ..., Xn has the same distribution with the same mean value mu and standard deviation s. Let XBAR be the average of X1, X2, ..., Xn - that is, XBAR = (X1 + X2 + ... + Xn) / n. Then XBAR has an **approximately** normal distribution with mean equal to mu, and standard deviation equal equal to s/sqrt(n). You can transform XBAR to have approximately the standard normal distribution N(0, 1) by letting XBAR - mu Z = --------------. (s / sqrt(n)) That is, Z ~a N(0, 1), where "~a" is defined as "approximately distributed as". Here is how this gets used for dichotomous endpoints like survival as described above. You let X1 = 1 if the person survived and X1 = 0 if the person died. You do the same for X2, X3, ..., Xn. You compute the average XBAR. Since the survivors have X's equal to 1 and the people who die have X = 0, the sum of the X's is just the number of survivors in your sample. The average is the ***proportion*** who survived. You assume at the beginning that each person has some probability, p, of surviving. As mentioned earlier in these notes, the standard deviation of each of the X's is sqrt(p*(1 - p)). Therefore the standard deviation of the proportion who survive is s = sqrt(p * (1 - p) / n). So here is how you would compute power for a simple survival study in which you want to compare proportion survival in a given sample of n people with a fixed hypothesized proportion: 1. Assume significance level alpha = .025. 2. Assume n = 100 people will be studied. 3. Null hypothesis: assume the probability of surviving is p_0 = 0.50 4. Alternative hypothesis: assume the probability of surviving is p_A = 0.60. Let Y be the proportion who survive in your sample of 100 people. Under the null hypothesis, the expectation of Y is p_0. Under the alternative hypothesis, the expectation of Y is 0.60. The standard deviation of Y under H_0 is: sqrt(.5 * .5 / 100) = 0.05. Let X = Y - .5. Then the expectation of X is 0. Its standard deviation is the same as that of Y, that is, 0.05. Then the null hypothesis and the alternative hypothesis can be stated in terms of X as follows: H_0: mean of X is 0 H_A: mean of X is .10. Now as above, define W = X / s = X / .05. The null hypothesis and alternative hypothesis can now be re-stated in terms of W as: H_0: mean of W is 0 H_A: mean of W is .10/.05 = 2.0 Let Z = W - 2.0. Then under H_A, W ~a N(0, 1). Since Z has approximately a normal distribution with standard deviation 1, you can compute power using the methods above. Since alpha = .025, c_alpha = 1.96. prob(W > c_alpha | H_A) = prob(W - .20 > c_alpha - 2.0) = prob(Z > 1.96 - 2.0) = prob*(Z > -0.04) = .5160. So the power in this case is about equal to 52%. ================================================================================================== Now a still more complicated situation: Comparison of two samples. Assume you have two drugs, A and B. Assume that you randomize N people to each drug. Your outcome is success or failure. The probability of success for drug A is p_A. The probability of success for drug B is p_B. The null hypothesis and the alternative hypothesis are specified as follows: H_0: p_A - p_B = 0. H_A: p_A - p_B = d > 0. Now some technical notation is needed. Assume that the random variable Z has a standard normal distribution - that is, X ~ N(0, 1). For c a number between 0 and 1, define Z_c to be the number such that prob(X > Z_c) = c. For example: say c = 0.025. Then Z_c = 1.96, because the probability that a standard normal random variable is greater than 1.96 is 0.025. Back to your clinical trial. Assume you randomize N people to each of drug A and drug B. Assume your significance level is alpha. Let p_A and p_B be the success probabilities under the alternative hypothesis, and let pbar be the average of p_A and p_B. Let d = abs(p_A - p_B). Then power is the probability that a standard normal random variable X will exceed W, where Z_c * sqrt(2*pbar*(1 - pbar)/N) - d W = ----------------------------------------- , sqrt{[p_A*(1 - p_A) + p_B*(1 - p_B)]/N} and where c = alpha/2. This looks horrendous, no question about it, and deriving this formula requires some algebra and some rather subtle thinking. And it is an approximation. But it's not so hard to use. Here is an example. Suppose you are going to randomize 60 people to each of drug A and drug B. Under the null hypothesis, you assume p_A = p_B = .50 - that is, there is a 50% success rate for each drug. Under the null hypothesis, you assume p_A = .40 and p_B = .60. Then the absolute difference d between the success rates is d = 0.20. Note that d is also known as the "effect size" or "treatment effect". Assume a significance level alpha = 0.05. Thus c = 0.025, and Z_c = 1.96. Note that pbar = (.40 + .60)/2 = .50. We now have all the ingredients to compute W in the formula above: Z_c * sqrt(2*pbar*(1 - pbar)/N) - d W = ----------------------------------------- sqrt{[p_A*(1 - p_A) + p_B*(1 - p_B)]/N} 1.96 * sqrt(2 * .5 * .5 / 60) - 0.20 = ----------------------------------------- sqrt{[.4 * .6 + .6 * .4]/60} 1.96 * sqrt(2 * .5 * .5 / 60) - 0.20 = ----------------------------------------- = -0.236 sqrt{[.4 * .6 + .6 * .4]/60} Now what you need is the probability that a standard normal random variable is larger than -0.236. From tables, you find that this is approximately 0.593. The power of the clinical trial is therefore approximately 59%. That is, with the specified alternative hypothesis and a sample size of 60 in each group, the probability of obtaining a two-sided p-value less than alpha = 0.05 is about 0.59. ----------------------------------------------------------------------------------------- Related issue: proportions and chi-square statistics: ----------------------------------------------------- Assume you carry out a clinical trial of the ever-popular Drug A versus Drug B. Assume a dichotomous endpoint (success vs. failure). You randomize m people to each of the two drugs. The data are represented as: Drug A Drug B --------------------- | | | Success | a | b | n1 | | | --------------------- | | | Failure | c | d | n2 | | | --------------------- m1 m2 N You can test for whether there is a difference in success rates between the two drugs by using the chi-square statististic, N * (ad - bc)^2 X2 = ---------------- m1 * m2 * n1 * n2 and comparing it to the chi-square distribution table for 1 degree of freedom. However, you could also test for a difference by first computing the proportions of successes in each drug group: Number of Proportion of Drug Group Total Successes Successes ---------- ----- --------- ------------- A m1 a p1 = a/m1 B m2 b p2 = b/m2 The difference in success rates between the drug groups is d = p1 - p2. Under the null hypothesis, p1 and p2 would have the same expectation. To approximate a common probability of success in the two groups, we let pbar = (a + b)/(m1 + m2) = (a + b)/N. An approximate standard error for d is: s.e.(d) = sqrt(pbar*(1 - pbar)/m1 + pbar*(1 - pbar)/m2) = sqrt(pbar*(1 - pbar)*(1/m1 + 1/m2)). The null hypothesis here is that the true proportions of success in the two groups are equal. This can also be stated as: H_0: the expectation of d is 0. This can be tested by computing a statistic which is approximately normally distributed, as follows: Z = d / s.e.(d), where the approximate standard error of d is as shown above. As it turns out, the chi-square statistic X2 discussed above exactly equals the square of the normal statistic Z. 4. How do you compute sample size? People who want to carry out a clinical trial need to know how many people they should randomize. This is essentially the inverse problem to computing power. That is, given the power that you want, you can solve the equation Z_c * sqrt(2*pbar*(1 - pbar)/N) - d W = ----------------------------------------- , sqrt{[p_A*(1 - p_A) + p_B*(1 - p_B)]/N} for N. First you replace W by Z_b, where b = 1 - power. Then the solution for N is: {Z_c * sqrt(2*pbar*(1 - pbar)) + Z_b * sqrt(p_A*(1 - p_A) + p_B*(1 - p_B))}^2 N = ----------------------------------------------------------------------------- d^2 It is simple algebra to go from the formula for W to the formula for N. Simple, but tedious. Here is an example. Suppose you want to carry out a clinical trial of drug A versus drug B, and for your alternative hypothesis you assume that the success rate for drug A is 75% and that for drug B is 50%. You want equal sample sizes in both groups. You assume a two-sided significance level alpha = 0.05. You want your statistical power to be 85%. The effect size d = .75 - .50 = .25. Again, Z_c = 1.96. For Z_b, note that b = 1 - power = 1 - .85 = .15. From tables, you find that Z_.15 = 1.04. You find pbar = (.50 + .75)/2 = .625. Putting all this in the formula above for N, you get: {1.96*sqrt(2*.625*.375) + 1.04*sqrt(.75*.25 + .5*.5)}^2 N = --------------------------------------------------------- (.25)^2 = 65.9 So you need about 66 people in each of groups A and B. One important thing about the sample size formula. Notice that the difference d in success rates between the two groups occurs in the denominator, and it is squared. What this means is that the size of d has enormous leverage in determining the sample size. In the example above, d = .25, and d^2 = .0625, and the inverse of .0625 is 16. If instead d = .20, then d^2 = .04 and its inverse is 25. If d = .1, the inverse of d^2 is 100. So the effect of decreasing d is to dramatically increase the sample size. In fact it is useful to know how the sample size changes as a function of the other variables: 1. If the effect size d decreases, sample size increases 2. If power increases, sample size increases 3. If significance level alpha decreases, sample size increases. The approximate formulas above for sample size and power apply to studies where the results are judged according to the proportions of successes in each of two equal-sized groups. There are many other variations. You may want to plan a study with more than two groups. This increases the sample size. You may want to consider an outcome which is quantitative (like change in blood pressure) rather than success proportions. The sample size formula is similar, but not the same. A good reference for this situation is: Statistical Methods, by Snedecor and Cochrane (Iowa State University Press, 1967). You may want to consider a clinical trial in which the outcome is measured as the time to an event (e.g., time to having a heart attack or death, whichever comes first). There is a different sample size formula for this. The common factors in all sample size formulas are significance level, power, and the expected size of the treatment effects under the alternative hypothesis. These are the key ingredients. There are good references for sample size and power. One of the best for dichotomous outcomes is the book Statistical Methods for Rates and Proportions, 3rd Edition, by Joseph Fleiss et al. (Wiley Interscience, 2003). See also the web site: http://obssr.od.nih.gov/Conf_Wkshp/RCT03/Lectures/Catellier_Sample_Size.pdf Last date revised: February 13, 2007