MORE ON SIMULATIONS                                   SPH 5421 notes.007

      Z-statistics and t-statistics are used constantly in applications
involving random variables that are not necessarily normally distributed.  
The reason for this is essentially the Central Limit Theorem (CLT), which
says that a sum (or average) of independent random variables Xi satisfying
some reasonable conditions will be approximately normally distributed.  
If the random variables have identical distributions with the same mean mu
and standard deviation sigma, then the sum of N of them will have mean
N*mu and standard deviation sqrt(N) * sigma.  Of course you can prove this
without using the CLT.  What the CLT actually says is that

           W = sqrt(N) * (average(Xi) - mu) / sigma

has a CDF which is close to the CDF of an N(0, 1) random variable.

     This is still somewhat imprecise because "close to" is not defined or
specified.  Let Fw be the CDF for W and let Fz be the CDF for a random
variable Z which has an N(0, 1) distribution.  For any value t, let Fw(t)
and Fz(t) be the values of the CDFs at that value of t. By definition,

            Fw(t) = prob(W < t), and

            Fz(t) = prob(Z < t).

Then a precise statement of the CLT is that

            max (abs(Fw(t) - Fz(t)))
             t

converges to 0 as N goes to infinity.  There are other valid
ways to express CLT convergence also.

     However precision in making this sort of statement is not necessarily
useful.  The fact that something converges to something else as
N --> infinity does not tell you when the approximation of Fw by Fz is
good.  What you want to know is: how big does N have to be before the
approximation is good?  Is N = 20 big enough?  N = 100? N = 5?

     Simulation makes it possible for you to study this and draw
practical conclusions.  The logical thing to do is start with a
distribution which is very different from normal, and see how large
N has to be before the sum of iid random variables from that
distribution is approximately normal.

     Here is the basic plan for studying the question.  Choose a
cdf F.  Choose N.  Simulate N observations from a random variable
with cdf F.  Compute the average of the N observations.  Do this
whole thing several thousand times, until you have a good estimate of
the CDF for the average.  Then perform a test on that (empirical)
CDF to see if it is very different from that for a normal distribution
with the same mean.

     What test might you perform?

     One useful test is carried out as part of PROC UNIVARIATE. The test
statistic is called NORMAL and it is based on either the Shapiro-Wilk
statistic [See Shapiro & Wilk, Biometrika 52, pp 591-611, 1965] if 
N < 2000 or the Kolmogorov-Smirnov statistic if N >= 2000.  [Note: the
Kolmogorov-Smirnov statistic is closely related to the 'max' expression
above].

     If you use the PLOT and NORMAL options in PROC UNIVARIATE,
it will also produce a quantile plot.  This gives a kind of visual
test of normality.  If the data are approximately normal, the
quantile-quantile plot should lie approximately on a straight line.

     You might want to start with a distribution which is as
non-normal as you can possibly get, but which still has a mean
and variance (the Cauchy distribution doesn't satisfy the latter
condition).  Intuitively, the uniform distribution and the
Bernoulli(.5) distribution seem very non-normal.  The sum of N
independent Bernoullis with the same parameter p is of course
binomial(N, p), a discrete distribution [note: the Central Limit
Theorem is about the CDF, not the PDF: the PDF for a discrete
distribution can never be close to the PDF for a normal distribution].

PROBLEM 9

1.  Let A be the average of N independent observations from a U[0, 1]
    distribution.  For N = 1, 2, 5, 10, 20, and 30, carry out an
    appropriate simulation study (with 2000 observations per
    simulation) comparing the distribution of A to a normal
    distribution.  Also produce  histograms of the distribution of
    A for each choice of N.  Summarize your conclusions.

2.  Do the same thing for A being the average of N independent observations
    from a Bernoulli(.5) distribution.

3.  Do the same thing for A being the average of N independent
    observations from a Poisson distribution with parameter lambda = 1.

4.  Comment on for which of the three above distributions the CLT seems to
    converge most slowly.

5.  For the binomial distribution, you don't need to do a simulation 
    study; you can compute the exact level of disagreement of the
    CDFs of the binomial and the approximating normal.  Do this 
    for N = 1, 2, 5, 10, 20, and 30 and p = .5.

/home/gnome/john-c/5421/notes.007    Most recent update: Feb 13, 2001