MORE ON SIMULATIONS SPH 5421 notes.007 Z-statistics and t-statistics are used constantly in applications involving random variables that are not necessarily normally distributed. The reason for this is essentially the Central Limit Theorem (CLT), which says that a sum (or average) of independent random variables Xi satisfying some reasonable conditions will be approximately normally distributed. If the random variables have identical distributions with the same mean mu and standard deviation sigma, then the sum of N of them will have mean N*mu and standard deviation sqrt(N) * sigma. Of course you can prove this without using the CLT. What the CLT actually says is that W = sqrt(N) * (average(Xi) - mu) / sigma has a CDF which is close to the CDF of an N(0, 1) random variable. This is still somewhat imprecise because "close to" is not defined or specified. Let Fw be the CDF for W and let Fz be the CDF for a random variable Z which has an N(0, 1) distribution. For any value t, let Fw(t) and Fz(t) be the values of the CDFs at that value of t. By definition, Fw(t) = prob(W < t), and Fz(t) = prob(Z < t). Then a precise statement of the CLT is that max (abs(Fw(t) - Fz(t))) t converges to 0 as N goes to infinity. There are other valid ways to express CLT convergence also. However precision in making this sort of statement is not necessarily useful. The fact that something converges to something else as N --> infinity does not tell you when the approximation of Fw by Fz is good. What you want to know is: how big does N have to be before the approximation is good? Is N = 20 big enough? N = 100? N = 5? Simulation makes it possible for you to study this and draw practical conclusions. The logical thing to do is start with a distribution which is very different from normal, and see how large N has to be before the sum of iid random variables from that distribution is approximately normal. Here is the basic plan for studying the question. Choose a cdf F. Choose N. Simulate N observations from a random variable with cdf F. Compute the average of the N observations. Do this whole thing several thousand times, until you have a good estimate of the CDF for the average. Then perform a test on that (empirical) CDF to see if it is very different from that for a normal distribution with the same mean. What test might you perform? One useful test is carried out as part of PROC UNIVARIATE. The test statistic is called NORMAL and it is based on either the Shapiro-Wilk statistic [See Shapiro & Wilk, Biometrika 52, pp 591-611, 1965] if N < 2000 or the Kolmogorov-Smirnov statistic if N >= 2000. [Note: the Kolmogorov-Smirnov statistic is closely related to the 'max' expression above]. If you use the PLOT and NORMAL options in PROC UNIVARIATE, it will also produce a quantile plot. This gives a kind of visual test of normality. If the data are approximately normal, the quantile-quantile plot should lie approximately on a straight line. You might want to start with a distribution which is as non-normal as you can possibly get, but which still has a mean and variance (the Cauchy distribution doesn't satisfy the latter condition). Intuitively, the uniform distribution and the Bernoulli(.5) distribution seem very non-normal. The sum of N independent Bernoullis with the same parameter p is of course binomial(N, p), a discrete distribution [note: the Central Limit Theorem is about the CDF, not the PDF: the PDF for a discrete distribution can never be close to the PDF for a normal distribution]. PROBLEM 9 1. Let A be the average of N independent observations from a U[0, 1] distribution. For N = 1, 2, 5, 10, 20, and 30, carry out an appropriate simulation study (with 2000 observations per simulation) comparing the distribution of A to a normal distribution. Also produce histograms of the distribution of A for each choice of N. Summarize your conclusions. 2. Do the same thing for A being the average of N independent observations from a Bernoulli(.5) distribution. 3. Do the same thing for A being the average of N independent observations from a Poisson distribution with parameter lambda = 1. 4. Comment on for which of the three above distributions the CLT seems to converge most slowly. 5. For the binomial distribution, you don't need to do a simulation study; you can compute the exact level of disagreement of the CDFs of the binomial and the approximating normal. Do this for N = 1, 2, 5, 10, 20, and 30 and p = .5. /home/gnome/john-c/5421/notes.007 Most recent update: Feb 13, 2001