Web address of this page: http://www.biostat.umn.edu/~john-c/ph5450.formulas.html
Most recent update: February 1, 2004.
PubH 5470-3: Some Useful Formulas



A.   Formulas for Samples 


1.  Sample mean of observations  x₁, x₂, ..., x_n :

      xbar = (1/n) * SUM(x_i)


2.  Sample variance of observations  x₁, x₂, ..., x_n: 


      Method 1 :  Variance = {SUM[(x_i - xbar)²]}/(n - 1),
                  where xbar is the sample mean.

      Method 2 :  Variance =  {SUM(x_i²) - (1/n) * [SUM(x_i)]²} / (n - 1)


     Note: Both methods give the same answer.


3.  Sample standard deviation of observations x₁, x₂, ..., x_n: 

     s = square root(Variance)


4.  Sample median of observations x₁, x₂, ..., x_n (sorted from lowest to highest) : 

     If n is odd, the median is the value of the middle observation.

     If n is even, the median is the average of the two middle
     observations.


5.  Lower quartile:  Median of the observations below the median.
    (Also called the first quartile.)


6.  Upper quartile:  Median of the observations above the median.
    (Also called the third quartile.)


7.  Interquartile range (IQR): The difference between the upper
     quartile and the lower quartile.


8.  Def.  A point in a sample is an outlier  if it is larger than

       third quartile + 1.5 * IQR,

    or smaller than

       first quartile - 1.5 * IQR.


9.  Correlation coefficient, r: 

     Given observations

     x₁, x₂, ..., x_i, ... , x_n  and

     y₁, y₂, ..., y_i, ... , y_n,


     where each yi is paired with the corresponding xi,

     the CORRELATION r  of the x's and the y's is

         r = (1/(n - 1)) * SUM{(x_i - xbar)*(y_i - ybar)} / (s_x*s_y),

     where xbar and ybar are the sample means of the x's and y's,
     and s_x and s_y are the sample standard deviations.

     Another formula for r:

         r = (1/(n - 1))*[SUM(x_i*y_i - (1/n)*SUM(x_i)*SUM(y_i)] / (s_x*s_y).


10. Regression coefficients: 

     Given observations

     x₁, x₂, ..., x_i, ... , x_n  and

     y₁, y₂, ..., y_i, ... , y_n,



     The SLOPE  b  of the least-squares regression line is

         b = r * s_y / s_x,

     where r is the correlation coefficient and s_x and s_y are
     sample standard deviations of the two sets of observations.

     The INTERCEPT  a  of the regression line is

         a = ybar - b * xbar,

     where ybar and xbar are the sample means and b is the slope of
     the regression line.

     The complete equation of the regression line is thus

         y = a + b*x.


11. Another formula for the slope b: 

        b = TOP / BOTTOM, where

        TOP =    SUM(x_i * y_i) - (1/n) * SUM(x_i) * SUM(y_i), and

        BOTTOM = SUM(x_i²) - (1/n) * SUM(x_i) * SUM(x_i).



12. Formulas involving sums of squares: 

       Given observations of paired values of x's and y's,

       x₁, x₂, ..., x_i, ... , x_n  and
       y₁, y₂, ..., y_i, ... , y_n,

       SSTOT = adjusted total sum of squares    = SUM{(y_i - ybar)²},

       SSREG = sum of squares due to regression = SUM{(yhat_i - ybar)²},

       SSRES = sum of squares residual          = SUM{(y_i - yhati)²},

       where ybar  = mean of the observed y's, and

             yhat_i = predicted i-th y, yhat_i = a + b * x_i.


       FACT 1 :   SSREG + SSRES = SSTOT

       FACT 2 :   r² = SSREG / SSTOT  (that is, the
                         square of the correlation is the ratio of the regression
                         sum of squares to the adjusted total sum of squares.)



B.   Formulas for Random Variables: 


13.  Mean(X): 

       Given a random variable X which can take on values
       X₁, X₂, X₃, ..., X_n, with corresponding probabilities
       p₁, p₂, p₃, ..., p_n, then

       Mean(X) = SUM(p_i * X_i).


14.  Variance, Var(X): 

       Var(X) = SUM{p_i *(X_i - Mean(X))²}


      Another formula for Var(X):

       Var(X) = Mean(X²) - (Mean(X))².


15.  Standard Deviation, SD(X): 

       SD(X) = square root of Var(X).


16.  Independent Random Variables: 

       If X and Y are independent random variables, then

          Var(X + Y) = Var(X) + Var(Y).




C.   Formulas Involving Probabilities: 


17.  If A represents an event, then

        0 <= prob(A) <= 1.


18.  If -A is the event "not A" (that is, event A does NOT occur), then

        prob(-A) = 1 - prob(A).


19.  If A and B are events, then

        prob(A or B) = prob(A) + prob(B) - prob(A and B).


20.  If A and B are exclusive [or disjoint] events, then

        prob(A or B) = prob(A) + prob(B), and

        prob(A and B) = 0.

     [Definition: A and B are exclusive events if not both A and B can occur.]


21.  Definition.  A and B are independent events if and only if

        prob(A and B) = prob(A) * prob(B).



D.   Binomial Distribution: 


22.  A random variable X has a binomial distribution with
     parameters N and p  if, for any integer m between 0 and N,

        prob(X = m) = C(N, m) * p^m * (1 - p)^{(N - m)},

     where C(N, m) is the binomial coefficient, defined as

        C(N, m) = N! / (m! * (N - m)!).

     Notation: X ~ Binom(N, p).


23.  Note that:

     (1) The only values that such a binomial random variable X can
         take on are between 0 and N.  It cannot have fractional
         or negative values.

     (2) X can be thought of as the sum of N independent Bernoulli
         random variables with the same parameter p,

         X = X₁ + X₂ + ... + X_N,

         where a Bernoulli random variable X_i is a random
         variable which takes on the value 1 with probability p and
         the value 0 with probability (1 - p).  We say that

         X_i ~ Ber(p).

         'Bernoulli' is a special case of 'Binomial':  thus you can
         also say

         X_i ~ Binom(1, p).

24.  Mean and Variance:

     If X ~ Binom(N, p), then:

        Mean(X) = N * p, and

        Var(X)  = N * p * (1 - p), and therefore

        StdDev(X) = sqrt(N * p * (1 - p)).


E.   Tests and Confidence Intervals: Normal Distributions

25.  Suppose X₁, X₂, ..., X_n are
     independent random variables with distribution N(mu, sigma).
     Then

          z = (xbar - mu)/(sigma/sqrt(n))

     has distribution  N(0, 1).  This fact can be used to test
     hypotheses about the original distribution.  For example, if
     the null hypothesis is H₀: mu = mu₀, and the absolute
     value of z, |z|, is larger than 1.96, you would reject (at the
     .05 significance level) the null hypothesis.  This is a two-sided
     test, i.e., you reject H₀ if z is bigger than 1.96 or
     smaller than -1.96.

     Note that this test requires that you know sigma in advance.


26.  A 95% confidence interval for the true mean mu is given by

          (xbar - 1.96*sigma/sqrt(n), xbar + 1.96*sigma/sqrt(n)).

     This means that in a long run of experiments involving observations
     from the distribution N(mu, sigma), you can expect that the true mean
     mu will lie between the confidence limits given above about 95% of
     the time.  Note that the true mean is a fixed number which
     does not change from one experiment to the next; instead, it is the
     confidence limits which vary; they are in fact themselves
     random variables.

     These confidence limits also are based on the assumption that sigma
     is known.  Because this is not usually the case, they are therefore
     of limited value.


27.  One-sample t-test:

     If X₁, X₂, ..., X_n are as described in 25. above, but sigma is not known,
     tests are carried out using t-statistics rather than z-statistics.  Let

            t = (xbar - mu0) / (s/sqrt(n)),

     where  s  is the sample standard deviation (see 3. above).  Then to
     test the hypothesis H₀: mu = mu₀, you compare the observed value of
     the statistic t to values for the t-distribution with n-1 degrees of freedom
     (Table T-11 or Table D in Moore & McCabe).


28.  A  100*(1 - 2p)% confidence interval for the true mean mu is given by

          (xbar - t(p, n-1)*s/sqrt(n), xbar + t(p, n-1)*s/sqrt(n)).

     where t(p, n-1) is the value such that for a random variable with a  t 
     distribution with (n - 1) degrees of freedom, the probability of being
     larger than t(p, n-1) is p.

     [Example: for a 95% confidence interval with 10 degrees of freedom,
      t(.025, 10) = 2.228.]


29.  Two-sample t-test:

     If X₁, X₂, ..., X_n₁ are a random iid
     sample from the distribution N(mu₁, sigma₁), and
     Y₁, Y₂, ..., Y_n₂ are similarly a random
     iid sample from N(mu₂, sigma₂), then a two-sample t-test
     of the hypothesis

        H₀: mu₁ = mu₂

     is based on the statistic

          t = (xbar - ybar) / sqrt(s₁²/n₁ + s₂²/n₂).
 
     The statistic t is compared to a t-distribution.  The degrees of freedom
     for this test may be computed in two different ways:

     (1) (Conservative)  Let d.f. = min(n₁ - 1, n₂ - 1).

     (2) (Satterthwaite) Let d.f. = top / bottom, where

         top    = (s₁²/n₁ + s₂²/n₂)², and

         bottom = (s₁²/n₁)²/(n₁-1) + (s₂²/n₂)²/(n₂-1).


30.  Alternative two-sample t-test:

     Assume samples X₁, X₂, ..., X_n₁ and Y₁, Y₂, ..., Y_n₂ random samples as
     in 29. above.  Define the pooled estimate s_p² of the variance as

         s_p² = {(n₁ - 1) * s₁² + (n₂ - 1) * s₂²)} / (n₁ + n₂ - 2).

     and then define the t-statistic as

         t = (xbar - ybar) / (s_p*(1/n₁ + 1/n₂)).

     Compare this to a t-distribution with (n₁ + n₂ - 2) degrees of freedom.

     This test should be used only when s₁ and s₂ are reasonably close in value. 
     There is a test for equality of sigma₁ and sigma₂ which is carried out by
     the t-test procedure (proc ttest) in SAS.


31.  Overall guidelines for t-tests:

     The overall guidelines for the use of t-tests are the following:

     (1) The t-test described in 29. above, with the Satterthwaite degrees of
         freedom, is usually satisfactory.

     (2) The t-test described in 29., with d.f. = min(n₁ - 1, n₂ - 1), tends to
         be more conservative; that is, less likely to reject the null hypothesis.

     (2) The t-test described in 30. above may be preferable if s₁ and s₂ are
         close together in value, so that the pooled s.d. s_p can be used.


F.  Proportions: Confidence Intervals and Tests

32.  Wilson's Estimate of the Sample Proportion: One Sample.

      Given X ~ Binom(N, p), Wilson's estimate of the sample proportion is:

           p_W = (X + 2) / (N + 4).

      Approximate standard error of this estimate:

           SE(p_W) = sqrt(p_W * (1 - p_W) / (N + 4))

      95% Confidence Interval for the true value  p :

          p_W +/- 1.96 * SE(p_W).


33.  One-Sample Test for a Proportion

      Given X ~ Binom(N, p) and hypothesis H₀: p = p₀
      a test for H₀ can be based on

           z = (X/N - p₀) / sqrt(p₀*(1 - p₀) / N),

      which is compared to the N(0, 1) distribution.


34.  Proportions from Two Independent Samples: Confidence Interval for Difference.

      Assume X₁ ~ Binom(N₁, p₁) and X₂ ~ Binom(N₂, p₂)

      Let p_W1 = (X₁ + 1)/(N₁ + 2) and p_W2 = (X₂ + 1)/(N₂ + 2),

      and D_W = p_W1 - p_W2.  Then the approximate standard error of D_W is

          SE(D_W) = sqrt(p_W1*(1 - p_W1)/(N₁ + 2) + p_W2*(1 - p_W2)/(N₂ + 2)).

      A 99% confidence interval for the true difference D is given by:

          D_W +/- 2.576*SE(D_W).


35.  Two-Sample Test for Equality of Proportions:

       Assume X₁ ~ Binom(N₁, p₁) and X₂ ~ Binom(N₂, p₂)

       Assuming H₀: p₁ = p₂ = p, the pooled estimate of p is defined as:

           p_pool = (X₁ + X₂) / (N₁ + N₂).

       An approximate standard error of p_pool is given by

           SE(p_pool) = sqrt{p_pool*(1 - p_pool)*(1/N₁ + 1/N₂)}.

       A test of the hypothesis H₀:  p₁ = p₂ is based on the z-statistic

            z = (X₁/N₁ - X₂/N₂) / SE(p_pool),

       which is compared to the N(0, 1) distribution.


36.  Chi-Square Test for Equality of Proportions: 2 x 2 Table

       Assume X₁ ~ Binom(N₁, p₁) and X₂ ~ Binom(N₂, p₂)
       and assume the data are represented in a 2 x 2 table as
       follows:
                      1         2
                 ---------------------
                 |         |         |
          Event  |    X1   |    X2   |  X1 + X2
                 |         |         |
                 ---------------------
                 |         |         |
        No event | N1 - X1 | N2 - X2 |  N1 + N2 - (X1 + X2)
                 |         |         |
                 -------------------------
                      N1        N2   |  N1 + N2


       Represent the numbers in the table more compactly as:

                      1         2
                 ---------------------
                 |         |         |
          Event  |    a    |    b    |  a + b
                 |         |         |
                 ---------------------
                 |         |         |
        No event |    c    |    d    |  c + d
                 |         |         |
                 -------------------------
                    a + c     b + d  |  N = a + b + c + d


       Then the chi-square statistic for this table is computed as

            X² =  N*(a*d - b*c)²/[(a + b)*(c + d)*(a + c)*(b + d)].

       To test the hypothesis H₀: p₁ = p₂,

       compare X² to a chi-square distribution with 1 degree of freedom.

       [Note: this is equivalent to a two-sided z-test as described in 35. above.]


37.  Chi-Square Test for Higher-Dimensional Tables:

       Assume subjects in a study are cross-classified by two characteristics
       (for example, gender and age category).  Assume the counts of
       individuals are displayed in cells of a table like the following:

                          Gender

                      Men       Women
                 -----------------------
                 |          |          |
           18-35 |     a    |     b    |  rm₁
                 |          |          |
                 -----------------------
   AGE           |          |          |
           36-49 |     c    |     d    |  rm₂
 CATEGORY        |          |          |
                 -----------------------
                 |          |          |
           50-65 |     e    |     f    |  rm₃
                 |          |          |
                 -----------------------
                 |          |          |
         Over 65 |     g    |     h    |  rm₄
                 |          |          |
                 ----------------------------
                      cm₁        cm₂    |  N


         The numbers rm₁, rm₂, rm₃, and rm₄ are called row margins,
     and are the sums of the counts within the cells in the corresponding
     row; for example, rm₁ = a + b.  Similarly, cm₁ and cm₂ are column
     margins; in the table above, cm₁ = a + c + e + g.  The total, N,,
     of all the cells in the table is equal to the sum of the column margins
     and also to the sum of the row margins: in the table above,

         N = cm₁ + cm₂ = rm₁ + rm₂ + rm₃ + rm₄.

         The expected count for the cell in the i-th row and j-th
     column, conditional on the observed column margins, is defined as


         E_i,j = rm_i * cm_j / N.

         Let O_i,j = observed count for the i,j-th cell;

         for example, O_3,2 = f in the table above.


         The Pearson chi-squared statistic X² for this table
     is defined as


         X² = SUM { (O_i,j - E_i,j)² / E_i,j }.

     This statistic is used for testing the null hypothesis that the
     the proportion of counts in the j-th column of each row is the same
     regardless of the row.

         The X² statistic is compared to a chi-square distribution
     with (R - 1) x (C - 1) degrees of freedom, where R is the number of rows
     and C is the number of columns.  The null hypothesis is rejected if
     the value of X² is large, or equivalently, the associated p-value is small.





Web address of this page:  http://www.biostat.umn.edu/~john-c/ph5450.formulas.html
School of Public Health

Division of Biostatistics

PubH 5470-3: Some Useful Formulas

A. Formulas for Samples

B. Formulas for Random Variables:

C. Formulas Involving Probabilities:

D. Binomial Distribution:

E. Tests and Confidence Intervals: Normal Distributions

F. Proportions: Confidence Intervals and Tests