Web address of this page: http://www.biostat.umn.edu/~john-c/ph5450.formulas.html
Most recent update: February 1, 2004.
A. Formulas for Samples
1. Sample mean of observations x1, x2, ..., xn :
xbar = (1/n) * SUM(xi)
2. Sample variance of observations x1, x2, ..., xn:
Method 1 : Variance = {SUM[(xi - xbar)2]}/(n - 1),
where xbar is the sample mean.
Method 2 : Variance = {SUM(xi2) - (1/n) * [SUM(xi)]2} / (n - 1)
Note: Both methods give the same answer.
3. Sample standard deviation of observations x1, x2, ..., xn:
s = square root(Variance)
4. Sample median of observations x1, x2, ..., xn (sorted from lowest to highest) :
If n is odd, the median is the value of the middle observation.
If n is even, the median is the average of the two middle
observations.
5. Lower quartile: Median of the observations below the median.
(Also called the first quartile.)
6. Upper quartile: Median of the observations above the median.
(Also called the third quartile.)
7. Interquartile range (IQR): The difference between the upper
quartile and the lower quartile.
8. Def. A point in a sample is an outlier if it is larger than
third quartile + 1.5 * IQR,
or smaller than
first quartile - 1.5 * IQR.
9. Correlation coefficient, r:
Given observations
x1, x2, ..., xi, ... , xn and
y1, y2, ..., yi, ... , yn,
where each yi is paired with the corresponding xi,
the CORRELATION r of the x's and the y's is
r = (1/(n - 1)) * SUM{(xi - xbar)*(yi - ybar)} / (sx*sy),
where xbar and ybar are the sample means of the x's and y's,
and sx and sy are the sample standard deviations.
Another formula for r:
r = (1/(n - 1))*[SUM(xi*yi - (1/n)*SUM(xi)*SUM(yi)] / (sx*sy).
10. Regression coefficients:
Given observations
x1, x2, ..., xi, ... , xn and
y1, y2, ..., yi, ... , yn,
The SLOPE b of the least-squares regression line is
b = r * sy / sx,
where r is the correlation coefficient and sx and sy are
sample standard deviations of the two sets of observations.
The INTERCEPT a of the regression line is
a = ybar - b * xbar,
where ybar and xbar are the sample means and b is the slope of
the regression line.
The complete equation of the regression line is thus
y = a + b*x.
11. Another formula for the slope b:
b = TOP / BOTTOM, where
TOP = SUM(xi * yi) - (1/n) * SUM(xi) * SUM(yi), and
BOTTOM = SUM(xi2) - (1/n) * SUM(xi) * SUM(xi).
12. Formulas involving sums of squares:
Given observations of paired values of x's and y's,
x1, x2, ..., xi, ... , xn and
y1, y2, ..., yi, ... , yn,
SSTOT = adjusted total sum of squares = SUM{(yi - ybar)2},
SSREG = sum of squares due to regression = SUM{(yhati - ybar)2},
SSRES = sum of squares residual = SUM{(yi - yhati)2},
where ybar = mean of the observed y's, and
yhati = predicted i-th y, yhati = a + b * xi.
FACT 1 : SSREG + SSRES = SSTOT
FACT 2 : r2 = SSREG / SSTOT (that is, the
square of the correlation is the ratio of the regression
sum of squares to the adjusted total sum of squares.)
B. Formulas for Random Variables:
13. Mean(X):
Given a random variable X which can take on values
X1, X2, X3, ..., Xn, with corresponding probabilities
p1, p2, p3, ..., pn, then
Mean(X) = SUM(pi * Xi).
14. Variance, Var(X):
Var(X) = SUM{pi *(Xi - Mean(X))2}
Another formula for Var(X):
Var(X) = Mean(X2) - (Mean(X))2.
15. Standard Deviation, SD(X):
SD(X) = square root of Var(X).
16. Independent Random Variables:
If X and Y are independent random variables, then
Var(X + Y) = Var(X) + Var(Y).
C. Formulas Involving Probabilities:
17. If A represents an event, then
0 <= prob(A) <= 1.
18. If -A is the event "not A" (that is, event A does NOT occur), then
prob(-A) = 1 - prob(A).
19. If A and B are events, then
prob(A or B) = prob(A) + prob(B) - prob(A and B).
20. If A and B are exclusive [or disjoint] events, then
prob(A or B) = prob(A) + prob(B), and
prob(A and B) = 0.
[Definition: A and B are exclusive events if not both A and B can occur.]
21. Definition. A and B are independent events if and only if
prob(A and B) = prob(A) * prob(B).
D. Binomial Distribution:
22. A random variable X has a binomial distribution with
parameters N and p if, for any integer m between 0 and N,
prob(X = m) = C(N, m) * pm * (1 - p)(N - m),
where C(N, m) is the binomial coefficient, defined as
C(N, m) = N! / (m! * (N - m)!).
Notation: X ~ Binom(N, p).
23. Note that:
(1) The only values that such a binomial random variable X can
take on are between 0 and N. It cannot have fractional
or negative values.
(2) X can be thought of as the sum of N independent Bernoulli
random variables with the same parameter p,
X = X1 + X2 + ... + XN,
where a Bernoulli random variable Xi is a random
variable which takes on the value 1 with probability p and
the value 0 with probability (1 - p). We say that
Xi ~ Ber(p).
'Bernoulli' is a special case of 'Binomial': thus you can
also say
Xi ~ Binom(1, p).
24. Mean and Variance:
If X ~ Binom(N, p), then:
Mean(X) = N * p, and
Var(X) = N * p * (1 - p), and therefore
StdDev(X) = sqrt(N * p * (1 - p)).
E. Tests and Confidence Intervals: Normal Distributions
25. Suppose X1, X2, ..., Xn are
independent random variables with distribution N(mu, sigma).
Then
z = (xbar - mu)/(sigma/sqrt(n))
has distribution N(0, 1). This fact can be used to test
hypotheses about the original distribution. For example, if
the null hypothesis is H0: mu = mu0, and the absolute
value of z, |z|, is larger than 1.96, you would reject (at the
.05 significance level) the null hypothesis. This is a two-sided
test, i.e., you reject H0 if z is bigger than 1.96 or
smaller than -1.96.
Note that this test requires that you know sigma in advance.
26. A 95% confidence interval for the true mean mu is given by
(xbar - 1.96*sigma/sqrt(n), xbar + 1.96*sigma/sqrt(n)).
This means that in a long run of experiments involving observations
from the distribution N(mu, sigma), you can expect that the true mean
mu will lie between the confidence limits given above about 95% of
the time. Note that the true mean is a fixed number which
does not change from one experiment to the next; instead, it is the
confidence limits which vary; they are in fact themselves
random variables.
These confidence limits also are based on the assumption that sigma
is known. Because this is not usually the case, they are therefore
of limited value.
27. One-sample t-test:
If X1, X2, ..., Xn are as described in 25. above, but sigma is not known,
tests are carried out using t-statistics rather than z-statistics. Let
t = (xbar - mu0) / (s/sqrt(n)),
where s is the sample standard deviation (see 3. above). Then to
test the hypothesis H0: mu = mu0, you compare the observed value of
the statistic t to values for the t-distribution with n-1 degrees of freedom
(Table T-11 or Table D in Moore & McCabe).
28. A 100*(1 - 2p)% confidence interval for the true mean mu is given by
(xbar - t(p, n-1)*s/sqrt(n), xbar + t(p, n-1)*s/sqrt(n)).
where t(p, n-1) is the value such that for a random variable with a t
distribution with (n - 1) degrees of freedom, the probability of being
larger than t(p, n-1) is p.
[Example: for a 95% confidence interval with 10 degrees of freedom,
t(.025, 10) = 2.228.]
29. Two-sample t-test:
If X1, X2, ..., Xn1 are a random iid
sample from the distribution N(mu1, sigma1), and
Y1, Y2, ..., Yn2 are similarly a random
iid sample from N(mu2, sigma2), then a two-sample t-test
of the hypothesis
H0: mu1 = mu2
is based on the statistic
t = (xbar - ybar) / sqrt(s12/n1 + s22/n2).
The statistic t is compared to a t-distribution. The degrees of freedom
for this test may be computed in two different ways:
(1) (Conservative) Let d.f. = min(n1 - 1, n2 - 1).
(2) (Satterthwaite) Let d.f. = top / bottom, where
top = (s12/n1 + s22/n2)2, and
bottom = (s12/n1)2/(n1-1) + (s22/n2)2/(n2-1).
30. Alternative two-sample t-test:
Assume samples X1, X2, ..., Xn1 and Y1, Y2, ..., Yn2 random samples as
in 29. above. Define the pooled estimate sp2 of the variance as
sp2 = {(n1 - 1) * s12 + (n2 - 1) * s22)} / (n1 + n2 - 2).
and then define the t-statistic as
t = (xbar - ybar) / (sp*(1/n1 + 1/n2)).
Compare this to a t-distribution with (n1 + n2 - 2) degrees of freedom.
This test should be used only when s1 and s2 are reasonably close in value.
There is a test for equality of sigma1 and sigma2 which is carried out by
the t-test procedure (proc ttest) in SAS.
31. Overall guidelines for t-tests:
The overall guidelines for the use of t-tests are the following:
(1) The t-test described in 29. above, with the Satterthwaite degrees of
freedom, is usually satisfactory.
(2) The t-test described in 29., with d.f. = min(n1 - 1, n2 - 1), tends to
be more conservative; that is, less likely to reject the null hypothesis.
(2) The t-test described in 30. above may be preferable if s1 and s2 are
close together in value, so that the pooled s.d. sp can be used.
F. Proportions: Confidence Intervals and Tests
32. Wilson's Estimate of the Sample Proportion: One Sample.
Given X ~ Binom(N, p), Wilson's estimate of the sample proportion is:
pW = (X + 2) / (N + 4).
Approximate standard error of this estimate:
SE(pW) = sqrt(pW * (1 - pW) / (N + 4))
95% Confidence Interval for the true value p :
pW +/- 1.96 * SE(pW).
33. One-Sample Test for a Proportion
Given X ~ Binom(N, p) and hypothesis H0: p = p0
a test for H0 can be based on
z = (X/N - p0) / sqrt(p0*(1 - p0) / N),
which is compared to the N(0, 1) distribution.
34. Proportions from Two Independent Samples: Confidence Interval for Difference.
Assume X1 ~ Binom(N1, p1) and X2 ~ Binom(N2, p2)
Let pW1 = (X1 + 1)/(N1 + 2) and pW2 = (X2 + 1)/(N2 + 2),
and DW = pW1 - pW2. Then the approximate standard error of DW is
SE(DW) = sqrt(pW1*(1 - pW1)/(N1 + 2) + pW2*(1 - pW2)/(N2 + 2)).
A 99% confidence interval for the true difference D is given by:
DW +/- 2.576*SE(DW).
35. Two-Sample Test for Equality of Proportions:
Assume X1 ~ Binom(N1, p1) and X2 ~ Binom(N2, p2)
Assuming H0: p1 = p2 = p, the pooled estimate of p is defined as:
ppool = (X1 + X2) / (N1 + N2).
An approximate standard error of ppool is given by
SE(ppool) = sqrt{ppool*(1 - ppool)*(1/N1 + 1/N2)}.
A test of the hypothesis H0: p1 = p2 is based on the z-statistic
z = (X1/N1 - X2/N2) / SE(ppool),
which is compared to the N(0, 1) distribution.
36. Chi-Square Test for Equality of Proportions: 2 x 2 Table
Assume X1 ~ Binom(N1, p1) and X2 ~ Binom(N2, p2)
and assume the data are represented in a 2 x 2 table as
follows:
1 2
---------------------
| | |
Event | X1 | X2 | X1 + X2
| | |
---------------------
| | |
No event | N1 - X1 | N2 - X2 | N1 + N2 - (X1 + X2)
| | |
-------------------------
N1 N2 | N1 + N2
Represent the numbers in the table more compactly as:
1 2
---------------------
| | |
Event | a | b | a + b
| | |
---------------------
| | |
No event | c | d | c + d
| | |
-------------------------
a + c b + d | N = a + b + c + d
Then the chi-square statistic for this table is computed as
X2 = N*(a*d - b*c)2/[(a + b)*(c + d)*(a + c)*(b + d)].
To test the hypothesis H0: p1 = p2,
compare X2 to a chi-square distribution with 1 degree of freedom.
[Note: this is equivalent to a two-sided z-test as described in 35. above.]
37. Chi-Square Test for Higher-Dimensional Tables:
Assume subjects in a study are cross-classified by two characteristics
(for example, gender and age category). Assume the counts of
individuals are displayed in cells of a table like the following:
Gender
Men Women
-----------------------
| | |
18-35 | a | b | rm1
| | |
-----------------------
AGE | | |
36-49 | c | d | rm2
CATEGORY | | |
-----------------------
| | |
50-65 | e | f | rm3
| | |
-----------------------
| | |
Over 65 | g | h | rm4
| | |
----------------------------
cm1 cm2 | N
The numbers rm1, rm2, rm3, and rm4 are called row margins,
and are the sums of the counts within the cells in the corresponding
row; for example, rm1 = a + b. Similarly, cm1 and cm2 are column
margins; in the table above, cm1 = a + c + e + g. The total, N,,
of all the cells in the table is equal to the sum of the column margins
and also to the sum of the row margins: in the table above,
N = cm1 + cm2 = rm1 + rm2 + rm3 + rm4.
The expected count for the cell in the i-th row and j-th
column, conditional on the observed column margins, is defined as
Ei,j = rmi * cmj / N.
Let Oi,j = observed count for the i,j-th cell;
for example, O3,2 = f in the table above.
The Pearson chi-squared statistic X2 for this table
is defined as
X2 = SUM { (Oi,j - Ei,j)2 / Ei,j }.
This statistic is used for testing the null hypothesis that the
the proportion of counts in the j-th column of each row is the same
regardless of the row.
The X2 statistic is compared to a chi-square distribution
with (R - 1) x (C - 1) degrees of freedom, where R is the number of rows
and C is the number of columns. The null hypothesis is rejected if
the value of X2 is large, or equivalently, the associated p-value is small.
Web address of this page: http://www.biostat.umn.edu/~john-c/ph5450.formulas.html