HSEM 3010 Spring 2007 Clinical Trials notes.001 Some statistical methods - ========================================================================================== 1. Chi-square test for 2 x 2 tables: Assume a clinical trial is conducted with a dichotomous endpoint (success or failure), and the following data are observed: Active Drug Placebo ------------------------- | | | Failure | a | b | n1 | | | ------------------------- | | | Success | c | d | n2 | | | ----------------------------- m1 m2 | N The column margins are m1 and m2. The row margins are n1 and n2. The total number of people studied in this clinical trials is N = n1 + n2 = m1 + m2. The numbers of people within the cells are a, b, c, and d. The column margins, m1 and m2, are approximately equal (because people are randomly assigned with equal probability to either Active Drug or Placebo). The (uncorrected) chi-square statistic X2 is computed as follows: N * (a*d - b*c)^2 X2 = ------------------- m1 * m2 * n1 * n2 The Yates-corrected chi-square is: N * (|a*d - b*c| - N/2)^2 X2c = -------------------------- m1 * m2 * n1 * n2 In general, if the Active Drug does not produce different results from the Placebo, both X2 and X2c are small. If the proportion of failures in the Active Drug group is very different from that in the Placebo group, X2 and X2c are large. Note that X2 is always larger than X2c. If the Active Drug has no real effect, then X2 and X2c will have a distribution which is approximately chi-square with 1 degree of freedom. You can use tables to find a probability associated with a given value of X2 or X2c. This is the probability observing a difference in the proportions of failures in the two groups which is as large as, or larger than, that which occurred in data from the clinical trial, assuming that the Active Drug is really not different from the Placebo. Example: Active Drug Placebo ------------------------- | | | Failure | 20 | 10 | 30 | | | ------------------------- | | | Success | 80 | 90 | 170 | | | ----------------------------- 100 100 | 200 Note that there the failure rates in the two groups are: Active drug: 20% Placebo : 10%. The failure rate is sometimes called RISK. Thus the risk in the Active Drug group is 20%, while the risk in the Placebo group is 10%. RELATIVE RISK is defined as the quotient of the two risks. In this case, the relative risk is 0.20 / 0.10 = 2.00. You can interpret this as saying the risk of failure in the Active Drug group is twice as large as in the Placebo group. Here X2 = 3.922. The corresponding probability is 0.0477. And X2c = 3.176, with corresponding probability 0.0747. Verify the computations of X2 and X2c with a calculator. Check that the p-values agree with the class handout on the chi-square distribution. There is another test statistic for 2 x 2 tables: the Fisher Exact Test. The method of computation of the Fisher Exact Test is too complicated to describe here. It also produces a probability. In this case, the Fisher Exact Test probability is: Fisher Exact Test probability: 0.0734. Note that the Yates-correctected chi-square probability agrees closely with the Fisher Exact Test probability. Which of these should you use? Statisticians actually don't completely agree on this. However, most statisticians prefer to report the Fisher Exact Test value. If they cannot compute the Fisher Exact Test, they will usually prefer the Yates-corrected chi-square probability. All these probabilities are also referred to as "p-values". There is a Web site which will perform the computations for you if you enter the appropriate data: http://www.graphpad.com/quickcalcs/index.cfm To do the chi-square and Fisher exact test computations, click on the third button, labelled "Fisher's and chi-square. Analyze a 2x2 contingency table." Enter the data and choose the test that you want. Also, choose the "Two-tailed" test option rather than the one-tailed. Try this for the Example table above and confirm that you get the same numbers. Try this also for Table 2 in the NEJM paper on the clinical trial of Dexamethasone as a treatment for cerebral malaria. See if you get the same p-values as are shown in the paper. ========================================================================================== 2. The Normal distribution A random variable is a measurement which (in general) which can take on values that are not completely predictable. For example, diastolic blood pressure (DBP). If you measure DBP on a sample of people, you will find values which you cannot predict in advance. However you will know in advance something about the kind of values that you will see - you will not find values much less than 40 or much bigger than 110 - values in the those ranges might occur a few times in a sample of 1000 people. Values like 20 or 200 simply will not occur in normal healthy people, unless the blood pressure measuring equipment malfunctions or is used incorrectly. There is a DISTRIBUTION of possible values for DBP. The average value across a large population of adults might be about 80 mm Hg (millimeters of mercury). About half of the values will be between 74 and 86. If you measure 1000 people and you plot the frequency of observed values as a bar graph, you may get something that looks like the following: ----------------------------------------------------------------------------------- Histogram of Sample of 10,000 Diastolic Blood Pressures 1 17:59 Tuesday, January 23, 2007 Percentage 8 + * * * | * * * * | * * * * * * | * * * * * * * 6 + * * * * * * * * | * * * * * * * * | * * * * * * * * * * * | * * * * * * * * * * * 4 + * * * * * * * * * * * * | * * * * * * * * * * * * * | * * * * * * * * * * * * * * * | * * * * * * * * * * * * * * * * 2 + * * * * * * * * * * * * * * * * * * | * * * * * * * * * * * * * * * * * * * * | * * * * * * * * * * * * * * * * * * * * * | * * * * * * * * * * * * * * * * * * * * * * * * * * ---------------------------------------------------------------- 1 1 1 1 1 1 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 0 0 0 0 0 1 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 dbp Midpoint ----------------------------------------------------------------------------------- The histogram approximately follows a 'bell-shaped' curve. There is a formula for this curve, which describes the DISTRIBUTION of blood pressures. In general the formula is: 1 (x - m)^2 f(x) = ---------------- * exp(- ------------), s * sqrt(2 * pi) 2 * s^2 where pi = 3.14159... and 'm' is the overall mean value (80 in the case of the DBP distribution) and 's' is the standard deviation of the x-values. The 'exp' term represents exponenation to the base 'e' for natural logarithms; 'e' is approximiately equal to 2.71828. The overall mean m is also the where the curve f(x) has its peak. The standard deviation 's' of the x-values is a measure of how 'spread out' the distribution is. If 's' is small, the curve has a narrow peak around the mean value m. If 's' is large, the curve is wider and has a more 'gentle' peak. Probability distribution functions like f(x) above are constructed so that the total area under the curve is 1.00. Note also that the curve is SYMMETRIC around the mean value 'm' - that is, an observation is just as likely to be x units below the mean value as it is to be x units above the mean value. Normal distributions are always symmetric around their mean. There are other distributions which are asymmetric (i.e., 'lopsided'). There is one special case of the normal distribution which is called the STANDARD normal distribution. This has mean m = 0 and standard deviation s = 1. The probability distribution function for the standard normal distribution is 1 x^2 f(x) = ---------------- * exp(- ------). sqrt(2 * pi) 2 so it is a little simpler than other normal distributions. There are tables which relate the standard normal distribution values to certain probabilities. See the class handout for such a table. If X has a standard normal distribution, you can use the table to evaluate certain probabilities. For example, the table indicates the following: 1. The probability that X is less than 1.00 is p = 0.8413. 2. The probability that X is less than 1.65 is p = 0.9505. 3. The probability that X is less than 2.50 is p = 0.9938. You can compute other probabilities from such tables. For example, since the total probability must add up to 1, the probability that X is greater than 1.00 is 1 - 0.8413 = 0.1587. The probability that X is between 1.65 and 2.50 is .9938 - .9505 = .0433. Since the standard normal distribution is symmetric around the mean m = 0, you can say that: 1. prob (X < 0) = .5. 2. prob (X < -1.65) = 1 - .9505 = .0495 3. prob ( -2.50 < X < -1.65) = .0433. ---------------------------------------------------------------------------------------- Exercise: Find the following probabilities for a standard normal random variable: a) prob (X < 0.2) b) prob (X < 0.3) c) prob (X > 0.5) d) prob (0.4 < X < 0.8) e) prob (-1 < X < +1) f) prob (X > 1.96 or X < -1.96) g) prob (X > 2.32) h) prob (X^2 < 2) i) prob (X + 1 < 1.5) j) prob ((X - 1)^2 < 1.5) ---------------------------------------------------------------------------------------- If X has a normal distribution with mean 'm' and standard deviation (SDEV) 's', then X can be related back to a standard normal distribution by a TRANSFORMATION. In this case, the transformation is: X - m Z = ------- . s That is: if X has a normal distribution with mean 'm' and standard deviation 's', then the random variable Z given by the formula above has a normal distribution with mean 0 and standard deviation 1. This makes it possible to use the tables for the standard normal distribution to compute probabilities for other normal distributions. For example: Assume X has mean 80 and standard deviation 10. What is the probability that X is between 70 and 90? Answer: let Z = (X - 80)/10. When X = 70, Z = -1. When X = 80, Z = 0. When X = 90, Z = +1. Therefore, prob( 70 < X < 90) = prob( -1 < Z < 1). Because Z has a standard normal distribution, you can use the standard normal table handout to compute prob(-1 < Z < 1). Check to see that you get 0.6826. You can state this in another way: if X has a normal distribution, the probability that values of X will lie within 1 standard deviation of the mean value is 0.6826. Exercise: Suppose X has a normal distribution with mean 100 and standard deviation 25. Compute the following probabilities: a) The probability that X > 125 b) Prob (X > 110) c) Prob (X < 0) d) Prob ( 80 < X < 120) e) Prob ( X > 130) f) Prob (X < 60) g) The probability that X lies within 2 standard deviations of the mean h) The probability that X lies within .5 standard deviations of the mean i) Find the value c such that Prob(X > c) = 0.20. j) Find the value c such that Prob(X < c) = 0.30. Some facts about the normal distribution: 1) If X has a normal distribution with mean 'm', then X + h has a normal distribution with mean 'm + h'. 2) If X and Y both have normal distributions with means 'u' and 'v', the W = X + Y has a normal distribution with mean 'u + v'. Definition: The VARIANCE is the square of the standard deviation. Notation: If X has a normal distribution with mean 'm' and variance 'v', then we say: X ~ N(m, v). Thus to say that X has a standard normal deviation is the same as saying X ~ N(0, 1). More facts: 3) For any random variables X and Y (not just normal ones), it is true that mean(X + Y) = mean(X) + mean(Y). 4) In general, it is NOT true that SDEV(X + Y) = SDEV(X) + SDEV(Y). 5) It is also NOT true in general that VAR(X + Y) = VAR(X) + VAR(Y). 6) For any random variable, VAR(X) = mean(X^2) - (mean(X))^2, that is, the variance of X is the mean of the square of X minus the the square of the mean of X. Note that I have not yet given the definition of 'mean' for any random variable other than normal random variables. The general definition requires calculus. The basic idea however is fairly simple. A mean is essentially a weighted average. To find the mean of a random variable, you find all the possible values that the random variable can take on. You multiply each value by the probability that that value can occur. You add up all these products. The resulting value is the mean. The complication is that if the random variable can take on infinitely many values, then you need to add up infinitely many products. That is where calculus comes in. You might think that the mean value of a random variable X is the value such that the probability that X is less than this value is 0.50 and the probability that X is greater than this value is also 0.50. This is true for normal random variables, but it is not true in general. There is another name for the 'center' of a distribution. The value w such that prob(X < w) = 0.50 is called the MEDIAN. In general the median is not the same as the mean. You might also think that the mean of a distribution X which has distribution function f(X) is the value where the function f(X) has its highest point. Again, this is true for the normal distribution - that is, if X ~ N(m, v), then f(X) has its highest point at the mean 'm' - but this is not true for other distributions. The point where f(X) has its highest value is called the MODE of the distribution. Suppose you have a random variable X which is a measure of some kind - for example, X is the height of a person. You can estimate the mean of X by taking a sample of people and measuring their heights and then taking the average. If there are n people and their measured heights are X1, X2, X3, ..., Xn, then the SAMPLE MEAN is defined as XBAR = (X1 + X2 + X3 + ... + Xn) / n. Note that the sample mean in general does not equal the 'true' mean. That is, if X ~ N(m, v), then in general you do not get m = XBAR. However, if the sample is large, the sample mean XBAR tends to be close to the true mean m. You can also estimate the variance v from a sample. See the next section for how to compute the sample variance and sample standard deviation. Exercises: a) Serum cholesterol has a normal distribution with a mean of 200 and a standard deviation of 25. You go to the doctor for a checkup. He orders a measurement of serum cholesterol. The value is 250 mg/dl (milligrams per deciliter). The doctor says, "You realize that only _____ percent of the population has a higher serum cholesterol than you do." [Fill in the blank]. b) The doctor says, "90% of the population has a serum cholesterol that is within ______ units of the mean." c) The doctor says, "I want you to be down at the 30th percentile of the population. I am going to give you a cholesterol-lowering drug. That should bring your serum cholesterol down to ______ mg/dl." d) You are measuring lengths of turtles. You know that Species A of turtle has lengths which are normally distributed with a mean of 20 cm and a standard deviation of 8 cm. One of your turtles has a length of 40 cm. You want to compute the probability that, if the the turtle belongs to Species A, you would encounter a length as great as or greater than 40 cm. This last exercise is a common kind of statistical problem, the situation of testing a hypothesis. You want to test the hypothesis that your turtle belongs to Species A. You state the hypothesis in the form of a NULL HYPOTHESIS, H0, based on the mean length of turtles in Species A. H0: m = 20 cm. You are asking the question: if the null hypothesis H0 is true, what is the probability that I would observe a turtle with length 40 cm or greater? The answer in this case turns out to be: about 0.00621. In certain situations, the average of two measurements of a random variable has a smaller standard deviation than a single measurement has. By 'certain situations', I mean that the two measurements are INDEPENDENT. This means that the value of one measurement does not depend on the value of the other measurement. This is a slightly difficult concept. It might be best explained by an example where two measurements are NOT independent. Suppose you are testing people's ability to estimate lengths. You make a straight mark on a piece of paper and you bring Mr. Smith into the testing room and you ask him to guess how long the mark is. He says "3 inches". Then you dismiss Mr. Smith and bring in Mr. Jones and ask him the same thing. He says "4 inches". Of course, Mr. Jones might give a different answer if he was able to overhear Mr. Smith's answer. If that happened, it is likely that Mr. Smith's answer and Mr. Jones's answer would NOT be independent. However, if they could not overhear each other or otherwise communicate their guesses to each other, their answers would most likely be independent. If X and Y are independent measurements of the same thing, and X and Y have the same standard deviation s, then the average is (X + Y)/2, and the standard deviation of the average is s/sqrt(2). In general, if X1, X2, X3, ..., Xn are independent measurements, all of which have the same standard deviation s, then the standard deviation of the average is: s/sqrt(n). This has a special name. It is called the Standard Error of the Mean (SEM). Exercises: a) Suppose X is height in inches. It has a mean of 68 and a standard deviation s of 10 inches. What is the standard deviation of the average of 5 heights? b) Suppose you measure 9 turtles from the same swamp. As above, you know that turtles from Species A have a mean length of 20 cm, with a standard deviation of 8 cm. The average length of your 9 turtles is 30 inches. What is the probability that you would observe an average length of at least 30, given the null hypothesis that all 9 turtles belong to Species A ? A lot of distributions are studied by statisticians. However, the most important distribution by far is the normal distribution. There are several reasons for this. One is that many random variables that are based on measurements have approximately a normal distribution - for example, blood pressure (both systolic and diastolic), serum cholesterol, kidney function measurements, heights, weights, lung volume, IQ, hematocrit values, and many others. A second reason is that if you add two random variables together which are not normally distributed, the sum very often is much closer to having a normal distribution than either of the two original variables, even if the two random variables are not related to each other. This tendency becomes even stronger when you add together many random variables, or compute their average. This is a consequence of a very deep theorem in probability called the Central Limit Theorem. You will learn about this if you take a more advanced course in probability and statistics. Note that if X has a normal distribution, it is theoretically possible for X to have any value from minus infinity to plus infinity. If X ~ N(0, 1), it is possible, but extremely unlikely, that X might take on a value as low as -8. It is equally unlikely that X will be larger than +8. In fact the probability of either of these happening is approximately 0.000000000000000622. Note that some random variables are NOT normally distributed. An example is your platelet count. This tends to be high in most people (on the order of 300,000 per cc in your blood) but for a few people with certain diseases it will be very low. This is an example of a what is called a skewed distribution. Another example: if X has a standard normal distribution, X^2 has a very skewed distribution, with values tending to cluster close to 0 and no negative values at all. In fact if Y is the square of a random variable X which has the N(0, 1) distribution, then Y has a chi-square distribution with degrees of freedom equal to 1. Finally: it is important to distinguish between the 'true' mean and the mean of a sample. The 'true' mean of X is the average of all possible measurements of X. In general you may not know what the true mean is. In practice, you take a sample from a large background population - for example, you choose 10 people at random in Ramsey County, and you measure their weights - and you find the average of the random variable X as XBAR = (X1 + X2 + X3 + ... + X10) / 10, and that is the SAMPLE ESTIMATE of the true mean m. If you take a different sample you will (almost always) get a different sample estimate. The general objective of the science of statistics is to estimate the truth about a population from relatively small samples. ========================================================================================== 3. T-test for differences in the mean values of two groups: Suppose you have 10 observations on diastolic blood pressure (DBP) in each of two groups: Group 1: 92, 80, 70, 104, 86, 86, 72, 98, 74, 68 Group 2: 68, 90, 64, 88, 76, 70, 82, 94, 76, 80 You compute the MEAN, or average, DBP in each group as: Group 1 mean: (92 + 80 + 70 + ... + 68) / 10 = 83.0 Group 2 mean: (68 + 90 + 64 + ... + 80) / 10 = 78.8 You compute the VARIANCE for the sample from each group as Group 1 variance VAR1: [(92^2 + 80^2 + 70^2 + ... + 68^2) - 10 * MEAN1 * MEAN1]/9 = [70260 - 10 * 83 * 83]/9 = 152.22 Group 2 variance VAR2: [(68^2 + 90^2 + 64^2 + ... + 80^2) - 10 * MEAN2 * MEAN2]/9 = [62976 - 10 * 78.8 * 78.8]/9 = 97.96. -------------------------------------------------------------------------------------- | | | In general, to compute the VARIANCE, VAR, of a set of N numbers, | | | | x1, x2, x3, ... , xN: | | | | First compute the MEAN, and then: | | | | VAR = [(sum of squares of the N numbers) - N * (square of the mean) ] / (N - 1). | | | | The STANDARD DEVIATION, SDEV, of a set of N numbers is the square root of the | | variance. | | | | SDEV = sqrt(VAR). | | | | The STANDARD ERROR OF THE MEAN, SEM, of a set of N numbers is: | | | | SEM = SDEV / sqrt(N) = sqrt(VAR/N). | | | -------------------------------------------------------------------------------------- Note that for Group 1 and Group 2 above, the standard errors of the means are: Group 1: SEM1 = sqrt(152.22/10) = 3.90 Group 2: SEM2 = sqrt(97.96/10) = 3.13 The t-statistic is defined as: Mean1 - Mean2 t = ------------- , SEM12 where SEM12 = sqrt(SEM1*SEM1 + SEM2*SEM2) = sqrt(VAR1/N1 + VAR2/N2), and where N1 = count of values in Group 1, N2 = count of values in Group 2. For the data above, SEM12 is: SEM12 = sqrt(VAR1/N1 + VAR2/N2) = sqrt(15.222 + 9.796) = 5.002, and 83.0 - 78.8 t = ----------- = 4.2/5.002 = 0.8397. 5.002 To find the p-value corresponding to this t-statistic, you need a table of t-values. To use this table, you need to know the DEGREES OF FREEDOM of the t-statistic. This is defined as DF = N1 + N2 - 2. As a rule, you want a two-sided p-value for a given t-statistic. From the handout table, with t = 0.8397, the p-value for the upper tail of the t distribution with 18 degrees of freedom is between the 0.25 and 0.20. Therefore the 2-sided p-value is between 0.50 and 0.40: 0.40 < two-sided p-value < 0.50. The exact p-value for this t-statistic, from a statistical computing package, is p-value = 0.4121. Note that this p-value can also be computed by the use of the Website noted above, http://www.graphpad.com/quickcalcs/index.cfm ========================================================================================== Last update: January 22, 2007