HSEM 3010 Spring 2007    Clinical Trials     notes.001

Some statistical methods -

==========================================================================================

1.  Chi-square test for 2 x 2 tables:

    Assume a clinical trial is conducted with a dichotomous endpoint (success or failure),
    and the following data are observed:


                  Active
                   Drug       Placebo
               -------------------------
               |           |           |
     Failure   |     a     |     b     |  n1
               |           |           |
               -------------------------
               |           |           |
     Success   |     c     |     d     |  n2
               |           |           |
               -----------------------------
                    m1          m2     |  N


  The column margins are m1 and m2.  The row margins are n1 and n2.  The total number of
people studied in this clinical trials is N = n1 + n2 = m1 + m2.   The numbers of people
within the cells are a, b, c, and d.

  The column margins, m1 and m2, are approximately equal (because people are randomly
assigned with equal probability to either Active Drug or Placebo).

  The (uncorrected) chi-square statistic X2 is computed as follows:

               N * (a*d - b*c)^2
       X2  =  -------------------
               m1 * m2 * n1 * n2


  The Yates-corrected chi-square is:

              N * (|a*d - b*c| - N/2)^2
       X2c =  --------------------------
                 m1 * m2 * n1 * n2


  In general, if the Active Drug does not produce different results from the Placebo,
both X2 and X2c are small.  If the proportion of failures in the Active Drug group
is very different from that in the Placebo group, X2 and X2c are large.

  Note that X2 is always larger than X2c.

  If the Active Drug has no real effect, then X2 and X2c will have a distribution which
is approximately chi-square with 1 degree of freedom.

  You can use tables to find a probability associated with a given value of X2 or X2c.
This is the probability observing a difference in the proportions of failures in the
two groups which is as large as, or larger than, that which occurred in data from the
clinical trial, assuming that the Active Drug is really not different from the Placebo.

  Example:

                  Active
                   Drug       Placebo
               -------------------------
               |           |           |
     Failure   |    20     |    10     |   30
               |           |           |
               -------------------------
               |           |           |
     Success   |    80     |    90     |  170
               |           |           |
               -----------------------------
                   100         100     |  200


  Note that there the failure rates in the two groups are:

       Active drug:  20%

       Placebo    :  10%.

  The failure rate is sometimes called RISK.  Thus the risk in the Active Drug
group is 20%, while the risk in the Placebo group is 10%.

  RELATIVE RISK is defined as the quotient of the two risks.  In this case,
the relative risk is 0.20 / 0.10 = 2.00.  You can interpret this as saying
the risk of failure in the Active Drug group is twice as large as in the
Placebo group.

  Here X2  = 3.922.  The corresponding probability is 0.0477.

  And  X2c = 3.176, with corresponding probability    0.0747.

  Verify the computations of X2 and X2c with a calculator.  Check that the p-values
agree with the class handout on the chi-square distribution.

  There is another test statistic for 2 x 2 tables: the Fisher Exact Test.  The
method of computation of the Fisher Exact Test is too complicated to describe here.
It also produces a probability.  In this case, the Fisher Exact Test probability is:

      Fisher Exact Test probability:  0.0734.

  Note that the Yates-correctected chi-square probability agrees closely with the
Fisher Exact Test probability.

  Which of these should you use?

  Statisticians actually don't completely agree on this.  However, most
statisticians prefer to report the Fisher Exact Test value.  If they cannot compute
the Fisher Exact Test, they will usually prefer the Yates-corrected chi-square
probability.

  All these probabilities are also referred to as "p-values".

  There is a Web site which will perform the computations for you if you enter
the appropriate data:

    http://www.graphpad.com/quickcalcs/index.cfm

  To do the chi-square and Fisher exact test computations, click on the third
button, labelled "Fisher's and chi-square.  Analyze a 2x2 contingency table."
Enter the data and choose the test that you want.  Also, choose the "Two-tailed"
test option rather than the one-tailed.

  Try this for the Example table above and confirm that you get the same numbers.

  Try this also for Table 2 in the NEJM paper on the clinical trial of Dexamethasone 
as a treatment for cerebral malaria.  See if you get the same p-values as are shown
in the paper.

==========================================================================================

2.  The Normal distribution

  A random variable is a measurement which (in general) which can take on
values that are not completely predictable.  For example, diastolic blood pressure
(DBP).  If you measure DBP on a sample of people, you will find values which you 
cannot predict in advance.  However you will know in advance something about the 
kind of values that you will see - you will not find values much less than 40 or 
much bigger than 110 - values in the those ranges might occur a few times in a 
sample of 1000 people.  Values like 20 or 200 simply will not occur in normal 
healthy people, unless the blood pressure measuring equipment malfunctions or is 
used incorrectly.

  There is a DISTRIBUTION of possible values for DBP.  The average value across a 
large population of adults might be about 80 mm Hg (millimeters of mercury).  About 
half of the values will be between 74 and 86.  If you measure 1000 people and you 
plot the frequency of observed values as a bar graph, you may get something that
looks like the following:
-----------------------------------------------------------------------------------

            Histogram of Sample of 10,000 Diastolic Blood Pressures            1
                                                 17:59 Tuesday, January 23, 2007

       Percentage

       8 +                             * * *
         |                           * * * *
         |                           * * * * * *
         |                         * * * * * * *
       6 +                         * * * * * * * *
         |                         * * * * * * * *
         |                     * * * * * * * * * * *
         |                     * * * * * * * * * * *
       4 +                     * * * * * * * * * * * *
         |                   * * * * * * * * * * * * *
         |                 * * * * * * * * * * * * * * *
         |                 * * * * * * * * * * * * * * * *
       2 +             * * * * * * * * * * * * * * * * * *
         |             * * * * * * * * * * * * * * * * * * * *
         |           * * * * * * * * * * * * * * * * * * * * *
         |       * * * * * * * * * * * * * * * * * * * * * * * * * *
         ----------------------------------------------------------------
                                                             1 1 1 1 1 1
           5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 0 0 0 0 0 1
           0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0

                                   dbp Midpoint
-----------------------------------------------------------------------------------

  The histogram approximately follows a 'bell-shaped' curve.  There is a formula
for this curve, which describes the DISTRIBUTION of blood pressures.  In general 
the formula is:

                            1                   (x - m)^2 
             f(x) =  ----------------  * exp(- ------------),
                     s * sqrt(2 * pi)            2 * s^2
  

where pi = 3.14159... and  'm' is the overall mean value (80 in the case of the 
DBP distribution) and 's' is the standard deviation of the x-values.  The
'exp' term represents exponenation to the base 'e' for natural logarithms;
'e' is approximiately equal to 2.71828.  

  The overall mean m is also the where the curve f(x) has its peak.

  The standard deviation 's' of the x-values is a measure of how 'spread out'
the distribution is.  If 's' is small, the curve has a narrow peak around the
mean value m.  If 's' is large, the curve is wider and has a more 'gentle'
peak.  

  Probability distribution functions like f(x) above are constructed so that the 
total area under the curve is 1.00.  Note also that the curve is SYMMETRIC around 
the mean value 'm' - that is, an observation is just as likely to be x units below
the mean value as it is to be x units above the mean value.  Normal distributions
are always symmetric around their mean.  There are other distributions which are
asymmetric (i.e., 'lopsided').  

  There is one special case of the normal distribution which is called the
STANDARD normal distribution.  This has mean m = 0 and standard deviation s = 1.
The probability distribution function for the standard normal distribution is

                            1                   x^2 
             f(x) =  ---------------- * exp(- ------).
                     sqrt(2 * pi)                2  
        
so it is a little simpler than other normal distributions.

  There are tables which relate the standard normal distribution values to 
certain probabilities.  See the class handout for such a table.  If X has a 
standard normal distribution, you can use the table to evaluate certain 
probabilities.  For example, the table indicates the following:

     1.  The probability that X is less than 1.00  is p = 0.8413.

     2.  The probability that X is less than 1.65  is p = 0.9505.

     3.  The probability that X is less than 2.50  is p = 0.9938.

  You can compute other probabilities from such tables.  For example,
since the total probability must add up to 1, the probability that X is 
greater than 1.00 is 1 - 0.8413 = 0.1587.  The probability that X is 
between 1.65 and 2.50 is .9938 - .9505 = .0433.

  Since the standard normal distribution is symmetric around the mean 
m = 0, you can say that:  

     1.  prob (X < 0) = .5.

     2.  prob (X < -1.65) = 1 - .9505 = .0495

     3.  prob ( -2.50 <  X < -1.65) = .0433.

----------------------------------------------------------------------------------------
  Exercise: Find the following probabilities for a standard normal random
variable:

     a)  prob (X < 0.2)

     b)  prob (X < 0.3)

     c)  prob (X > 0.5)

     d)  prob (0.4 < X < 0.8)

     e)  prob (-1 < X < +1)

     f)  prob (X > 1.96 or X < -1.96)

     g)  prob (X > 2.32)

     h)  prob (X^2 < 2)

     i)  prob (X + 1 < 1.5)

     j)  prob ((X - 1)^2 < 1.5)

----------------------------------------------------------------------------------------

  If X has a normal distribution with mean 'm' and standard deviation (SDEV) 's', 
then X can be related back to a standard normal distribution by a TRANSFORMATION.  
In this case, the transformation is:

                X - m  
           Z = ------- .
                  s
 
  That is: if X has a normal distribution with mean 'm' and standard deviation 
's', then the random variable Z given by the formula above has a normal distribution 
with mean 0 and standard deviation 1.

  This makes it possible to use the tables for the standard normal distribution 
to compute probabilities for other normal distributions.  For example: 

     Assume X has mean 80 and standard deviation 10.  What is the probability that X 
     is between 70 and 90?

     Answer: let Z = (X - 80)/10.  

     When X = 70, Z = -1.

     When X = 80, Z = 0.

     When X = 90, Z = +1.

     Therefore,   

          prob( 70 < X < 90) = prob( -1 < Z < 1).

  Because Z has a standard normal distribution, you can use the standard normal 
table handout to compute prob(-1 < Z < 1).  Check to see that you get 0.6826.

  You can state this in another way: if X has a normal distribution, the 
probability that values of X will lie within 1 standard deviation of the mean
value is 0.6826.  

  Exercise:  Suppose X has a normal distribution with mean 100 and standard
deviation 25.  Compute the following probabilities:

     a)  The probability that X > 125

     b)  Prob (X > 110)

     c)  Prob (X < 0)

     d)  Prob ( 80 < X < 120)

     e)  Prob ( X > 130)

     f)  Prob (X < 60)
 
     g)  The probability that X lies within 2 standard deviations of the mean

     h)  The probability that X lies within .5 standard deviations of the mean

     i)  Find the value c such that Prob(X > c) = 0.20.

     j)  Find the value c such that Prob(X < c) = 0.30.

  Some facts about the normal distribution:

     1) If X has a normal distribution with mean 'm', then X + h has a normal
        distribution with mean 'm + h'.

     2) If X and Y both have normal distributions with means 'u' and 'v',
        the W = X + Y has a normal distribution with mean 'u + v'.

  Definition: The VARIANCE is the square of the standard deviation.

  Notation: If X has a normal distribution with mean 'm' and variance 'v', then
we say:

               X ~ N(m, v).

  Thus to say that X has a standard normal deviation is the same as saying

               X ~ N(0, 1).

  More facts: 

     3) For any random variables X and Y (not just normal ones), it is true 
        that mean(X + Y) = mean(X) + mean(Y).

     4) In general, it is NOT true that

             SDEV(X + Y) = SDEV(X) + SDEV(Y).

     5) It is also NOT true in general that

             VAR(X + Y) = VAR(X) + VAR(Y).

     6) For any random variable, 

             VAR(X) = mean(X^2) - (mean(X))^2,

        that is, the variance of X is the mean of the square of X minus the 
        the square of the mean of X.


  Note that I have not yet given the definition of 'mean' for any random 
variable other than normal random variables.  The general definition requires 
calculus.  The basic idea however is fairly simple.  A mean is essentially a
weighted average.  To find the mean of a random variable, you find all the 
possible values that the random variable can take on.  You multiply each value 
by the probability that that value can occur.  You add up all these products.  
The resulting value is the mean.  The complication is that if the random variable 
can take on infinitely many values, then you need to add up infinitely many 
products.  That is where calculus comes in.

  You might think that the mean value of a random variable X is the value 
such that the probability that X is less than this value is 0.50 and the probability
that X is greater than this value is also 0.50.  This is true for normal 
random variables, but it is not true in general.  There is another name for 
the 'center' of a distribution.  The value w such that

          prob(X < w) = 0.50

is called the MEDIAN.  In general the median is not the same as the mean.

  You might also think that the mean of a distribution X which has distribution
function f(X) is the value where the function f(X) has its highest point.  Again, 
this is true for the normal distribution - that is, if X ~ N(m, v), then f(X) has 
its highest point at the mean 'm' - but this is not true for other distributions.
The point where f(X) has its highest value is called the MODE of the distribution.

  Suppose you have a random variable X which is a measure of some kind - for
example, X is the height of a person.  You can estimate the mean of X by taking a 
sample of people and measuring their heights and then taking the average.  If there 
are n people and their measured heights are X1, X2, X3, ..., Xn, then the SAMPLE 
MEAN is defined as

          XBAR = (X1 + X2 + X3 + ... + Xn) / n. 

  Note that the sample mean in general does not equal the 'true' mean.  
That is, if X ~ N(m, v), then in general you do not get m = XBAR.  However,
if the sample is large, the sample mean XBAR tends to be close to the true mean m.

  You can also estimate the variance v from a sample.  See the next section 
for how to compute the sample variance and sample standard deviation. 


  Exercises:

     a)  Serum cholesterol has a normal distribution with a mean of 200 and a 
         standard deviation of 25.  You go to the doctor for a checkup.  He orders a 
         measurement of serum cholesterol.  The value is 250 mg/dl (milligrams per 
         deciliter).

         The doctor says, "You realize that only _____ percent of the population
         has a higher serum cholesterol than you do." [Fill in the blank].

     b)  The doctor says, "90% of the population has a serum cholesterol that
         is within ______  units of the mean."

     c)  The doctor says, "I want you to be down at the 30th percentile of the
         population.  I am going to give you a cholesterol-lowering drug.  
         That should bring your serum cholesterol down to ______  mg/dl."

     d)  You are measuring lengths of turtles.  You know that Species A of 
         turtle has lengths which are normally distributed with a mean of 
         20 cm and a standard deviation of 8 cm.  One of your turtles has a 
         length of 40 cm.  You want to compute the probability that, if the
         the turtle belongs to Species A, you would encounter a length as 
         great as or greater than 40 cm.

  This last exercise is a common kind of statistical problem, the situation of 
testing a hypothesis.  You want to test the hypothesis that your turtle belongs to 
Species A.  You state the hypothesis in the form of a NULL HYPOTHESIS, H0, based on 
the mean length of turtles in Species A.

         H0:  m = 20 cm.

  You are asking the question: if the null hypothesis H0 is true, what is the 
probability that I would observe a turtle with length 40 cm or greater?  The answer 
in this case turns out to be: about 0.00621.

  In certain situations, the average of two measurements of a random variable 
has a smaller standard deviation than a single measurement has.  By 'certain 
situations', I mean that the two measurements are INDEPENDENT.  This means that 
the value of one measurement does not depend on the value of the other measurement.
This is a slightly difficult concept.  It might be best explained by an example 
where two measurements are NOT independent.  Suppose you are testing people's 
ability to estimate lengths.  You make a straight mark on a piece of paper and you 
bring Mr. Smith into the testing room and you ask him to guess how long the mark is.  
He says "3 inches".  Then you dismiss Mr. Smith and bring in Mr. Jones and ask him 
the same thing.  He says "4 inches".  Of course, Mr. Jones might give a different 
answer if he was able to overhear Mr. Smith's answer.  If that happened, it is 
likely that Mr. Smith's answer and Mr. Jones's answer would NOT be independent.  
However, if they could not overhear each other or otherwise communicate their 
guesses to each other, their answers would most likely be independent.  

  If X and Y are independent measurements of the same thing, and X and Y have the 
same standard deviation s, then the average is (X + Y)/2, and the standard deviation 
of the average is s/sqrt(2).  

  In general, if X1, X2, X3, ..., Xn are independent measurements, all of which have 
the same standard deviation s, then the standard deviation of the average is:

         s/sqrt(n).

  This has a special name.  It is called the Standard Error of the Mean (SEM).

  Exercises:  

  a)  Suppose X is height in inches.  It has a mean of 68 and a standard 
      deviation s of 10 inches.  What is the standard deviation of the 
      average of 5 heights?

  b)  Suppose you measure 9 turtles from the same swamp.  As above, you know that
      turtles from Species A have a mean length of 20 cm, with a standard
      deviation of 8 cm.  The average length of your 9 turtles is 30 inches.
      What is the probability that you would observe an average length of 
      at least 30, given the null hypothesis that all 9 turtles belong to 
      Species A ?


  A lot of distributions are studied by statisticians.  However, the most important 
distribution by far is the normal distribution.  There are several reasons for this.
One is that many random variables that are based on measurements have approximately 
a normal distribution - for example, blood pressure (both systolic and diastolic), 
serum cholesterol, kidney function measurements, heights, weights, lung volume, IQ, 
hematocrit values, and many others.  A second reason is that if you add two random 
variables together which are not normally distributed, the sum very often is much 
closer to having a normal distribution than either of the two original variables, 
even if the two random variables are not related to each other.  This tendency 
becomes even stronger when you add together many random variables, or compute their 
average.  This is a consequence of a very deep theorem in probability called the 
Central Limit Theorem.  You will learn about this if you take a more advanced course 
in probability and statistics.

  Note that if X has a normal distribution, it is theoretically possible for X to 
have any value from minus infinity to plus infinity.  If X ~ N(0, 1), it is 
possible, but extremely unlikely, that X might take on a value as low as -8.  It is 
equally unlikely that X will be larger than +8.  In fact the probability of either 
of these happening is approximately 0.000000000000000622.

  Note that some random variables are NOT normally distributed.  An example is your 
platelet count.  This tends to be high in most people (on the order of 300,000 per 
cc in your blood) but for a few people with certain diseases it will be very low.  
This is an example of a what is called a skewed distribution.  Another example: if X 
has a standard normal distribution, X^2 has a very skewed distribution, with values 
tending to cluster close to 0 and no negative values at all.  In fact if Y is the 
square of a random variable X which has the N(0, 1) distribution, then Y has a 
chi-square distribution with degrees of freedom equal to 1.
   
  Finally: it is important to distinguish between the 'true' mean and the
mean of a sample.  The 'true' mean of X is the average of all possible
measurements of X.  In general you may not know what the true mean is.  In
practice, you take a sample from a large background population - for example,
you choose 10 people at random in Ramsey County, and you measure their weights -
and you find the average of the random variable X as

        XBAR = (X1 + X2 + X3 + ... + X10) / 10,

and that is the SAMPLE ESTIMATE of the true mean m.  If you take a different
sample you will (almost always) get a different sample estimate.  The general
objective of the science of statistics is to estimate the truth about a
population from relatively small samples.

==========================================================================================

3.  T-test for differences in the mean values of two groups:

    Suppose you have 10 observations on diastolic blood pressure (DBP) in each of two groups:

    Group 1:  92,  80,  70, 104,  86,  86,  72,  98,  74,  68

    Group 2:  68,  90,  64,  88,  76,  70,  82,  94,  76,  80


    You compute the MEAN, or average, DBP in each group as:

    Group 1 mean:  (92 + 80 + 70 + ... + 68) / 10 = 83.0

    Group 2 mean:  (68 + 90 + 64 + ... + 80) / 10 = 78.8


    You compute the VARIANCE for the sample from each group as

    Group 1 variance VAR1: [(92^2 + 80^2 + 70^2 + ... + 68^2) - 10 * MEAN1 * MEAN1]/9

                           = [70260 - 10 * 83 * 83]/9 = 152.22

    Group 2 variance VAR2: [(68^2 + 90^2 + 64^2 + ... + 80^2) - 10 * MEAN2 * MEAN2]/9

                           = [62976 - 10 * 78.8 * 78.8]/9 = 97.96.

   --------------------------------------------------------------------------------------
   |                                                                                    |
   |  In general, to compute the VARIANCE, VAR, of a set of N numbers,                        |
   |                                                                                    |
   |     x1, x2, x3, ... , xN:                                                          |
   |                                                                                    |
   |  First compute the MEAN, and then:                                                 |
   |                                                                                    |
   |  VAR = [(sum of squares of the N numbers) - N * (square of the mean) ] / (N - 1).  |
   |                                                                                    |
   |  The STANDARD DEVIATION, SDEV, of a set of N numbers is the square root of the     |
   |  variance.                                                                         |
   |                                                                                    |
   |      SDEV = sqrt(VAR).                                                             |
   |                                                                                    |
   |  The STANDARD ERROR OF THE MEAN, SEM, of a set of N numbers is:                    |
   |                                                                                    |
   |      SEM = SDEV / sqrt(N) = sqrt(VAR/N).                                           |
   |                                                                                    |
   --------------------------------------------------------------------------------------

  Note that for Group 1 and Group 2 above, the standard errors of the means are:


       Group 1: SEM1 = sqrt(152.22/10) = 3.90

       Group 2: SEM2 = sqrt(97.96/10)  = 3.13


  The t-statistic is defined as:

           Mean1 - Mean2
      t =  ------------- ,
               SEM12

where SEM12 = sqrt(SEM1*SEM1 + SEM2*SEM2) = sqrt(VAR1/N1 + VAR2/N2), and
where N1 = count of values in Group 1, N2 = count of values in Group 2.

  For the data above, SEM12 is:

      SEM12 = sqrt(VAR1/N1 + VAR2/N2) = sqrt(15.222 + 9.796) = 5.002, and

            83.0 - 78.8
      t =   ----------- = 4.2/5.002 = 0.8397.
               5.002

  To find the p-value corresponding to this t-statistic, you need a table of t-values.
To use this table, you need to know the DEGREES OF FREEDOM of the t-statistic.  This
is defined as


      DF = N1 + N2 - 2.


As a rule, you want a two-sided p-value for a given t-statistic.

From the handout table, with t = 0.8397, the p-value for the upper tail of the
t distribution with 18 degrees of freedom is between the 0.25 and 0.20.
Therefore the 2-sided p-value is between 0.50 and 0.40:

    0.40 < two-sided p-value < 0.50.

  The exact p-value for this t-statistic, from a statistical computing package, is

    p-value = 0.4121.


  Note that this p-value can also be computed by the use of the Website noted above,

    http://www.graphpad.com/quickcalcs/index.cfm

==========================================================================================


Last update: January 22, 2007