Grades: 100 96 85 83 70 65 63 58 October 28, 2008 Page 1 of 5 PubH 7460 - Fall 2008 - Exam 1 Name:___________________________________ ================================================================================= 1. A person throws darts at a map. The map includes only the States of Missouri, Iowa and Minnesota. The probability that the dart lands in State X is proportional to the area of State X. Dart-throws where the dart misses the map entirely are not counted. The areas of the states are: Missouri : 69,709 square miles Iowa : 56,276 square miles Minnesota: 86,943 square miles a) Write a program which simulates 1,000 dart-throws. The program should produce the simulated count of darts that land within each of the three States. data darts ; amissouri = 69709 ; aiowa = 56276 ; aminnesota = 86943 ; totalarea = amissouri + aiowa + aminnesota ; pmissouri = amissouri / totalarea ; piowa = aiowa / totalarea ; pminnesota = aminnesota / totalarea ; [13] nmissouri = 0 ; niowa = 0 ; nminnesota = 0 ; do i = 1 to 1000 ; r = ranuni(-1) ; if r < pmissouri then nmissouri = nmissouri + 1 ; else if r < pmissouri + piowa then niowa = niowa + 1 ; else if r ge pmissouri + piowa then nminnesota = nminnesota + 1 ; end ; output ; run ; proc print data = darts ; October 28, 2008 Page 2 of 5 PubH 7460 - Fall 2008 - Exam 1 Name:___________________________________ ================================================================================= 1., Continued b) Let M = the number of darts that land in Missouri, and I = number of darts that land in Iowa, where again you assume that 1,000 darts are thrown. How would you use the simulated data from your program to estimate the covariance of M and I ? Easiest answer: Let p1 = prob(dart lands in Missouri), [12] p2 = prob(dart lands in Iowa). Let Xi = 1 if the i-th dart does not land in Missouri, 0 otherwise. Let Yi = 1 if the i-th dart does not land in Iowa, 0 otherwise. Note that M = X1 + X2 + ... + X1000 and I = Y1 + Y2 + ... + Y1000. We assume that X1, X2, ..., X1000 are independent and Y1, Y2, ..., Y1000 are independent. COV(M, I) = E((X1 + X2 + ... + X1000)*(Y1 + Y2 + ... + Y1000) - E(M) * E(I) Note that Xi * Yi = 0. Therefore E((X1 + X2 + ... X1000)*(Y1 + Y2 + ... + Y1000)) = (Sum over i <> j of) E(Xi * Yj) = (Sum over i <> j of) E(Xi)E(Yj) = 1000 * 999 * p1 * p2. But M/1000 is an estimate of p1 and I/1000 is an estimate of p2. Therefore COV(M, I) = (1000 * 999) * p1 * p2 - 1000 * p1 * 1000 * p2 = - 1000 * p1 * p2 = - 1000 * (M/1000)* (I/1000) = - M * I / 1000. October 28, 2008 Page 3 of 5 PubH 7460 - Fall 2008 - Exam 1 Name:___________________________________ ================================================================================= 2. Assume X has a standard normal distribution (mean 0, variance 1). Assume Y is the absolute value of X. a) What is the CDF for Y [may be stated in terms of the CDF for X]. F_Y(y) = prob(Y < y) = prob(abs(X) < y) = 1 - 2 * prob(X < -y) [5] = 1 - 2 * F_X(-y). b) What is the pdf for Y ? f_Y(y) = 2 * f_X(-y), where f_X(x) is the pdf of the standard normal, f_X(x) = (1/sqrt(2*pi)) exp(-x^2/2). [5] c) What is the expectation of Y? integral from 0 to infinity of [2 * (1/sqrt(2*pi)) * y * exp(-y^2/2) dy] = 2 / sqrt(2*pi) = .798 approx. [7] d) How might you use distribution functions in SAS or R to find median(Y) ? The median of Y is the value of y such that F_Y(y) = 1/2. From part a), let 1/2 = 1 - 2 * F_X(-y), so [8] F_X(-y) = 1/4. Therefore -y = (inverse of F_X)(1/4). In SAS, the inverse of the normal CDF is PROBIT. Therefore y = -PROBIT(1/4) = .6744 approx. October 28, 2008 Page 4 of 5 PubH 7460 - Fall 2008 - Exam 1 Name:___________________________________ ================================================================================= 3. Given a set of 100 numbers x_1, x_2, x_3, ..., x_100, the 5% Winsorized mean is defined by replacing the lowest 5 numbers by the 6th lowest number, and the highest 5 numbers by the 6th highest number, and then computing the mean of this modified set of numbers. a) Why might someone compute a Winsorized mean instead of the usual mean? What is a disadvantage of the Winsorized mean? Advantage: less influenced by outliers, and in general will give a more robust estimate of the true mean (if the assumed model is incorrect). [10] Disadvantage: The standard error will be underestimated. Plus, for skewed distributions, the Winsorized mean may be biased. b) Write a program which will estimate the variance of the 5% Winsorized mean of a set of 100 numbers having an exponential distribution with hazard 0.5. The following is a clever way to do this problem and is due to one of the students in the class: data xdata ; infile 'x100.file' end = endmark ; retain n 0 sumx 0 sumxx 0 ; [15] input x ; n = n + 1 ; if n = 6 then do ; sumx = sumx + 6*x ; sumxx = sumxx + 6*x*x ; end ; if n > 6 and n < 95 then do ; sumx = sumx + x ; sumxx = sumxx + x*x ; end ; if n = 95 then do ; sumx = sumx + 6*x ; sumxx = sumxx + 6*x*x ; end ; if endmark = 1 then winsorvariance = (sumxx - sumx * sumx / 100)/ 99 ; run ; proc print data = xdata ; run ; endsas ; October 28, 2008 Page 5 of 5 PubH 7460 - Fall 2008 - Exam 1 Name:___________________________________ ================================================================================= 4. Assume X and Y are independent random variables, both having standard normal distributions (that is, mean 0 and variance 1). Let T be the linear transformation defined by: | X | | U | | 2*X - 5*Y | T | | = | | = | | | Y | | V | | X + 3*Y | a) Sketch the image: T(unit square). What is the area of the resulting figure? Y | V | | | | T | Big Parallogram Area = 11. [7] | ---> | | | | | -|-------------- -|--------------- 0 | X 0 | U b) Find Var(U) and Cov(U, V). Var(U) = Var(2*X - 5*Y) = 2*2 + 5*5 = 29. Cov(U, V) = Cov(2*X - 5*Y, X + 3*Y) = 2 - 15 = -13. [8] b) Write a simulation program in SAS which produces an estimate of Corr(U, V), based on a sample of size 10000. Do not use a SAS procedure. data simuv ; n = 10000 ; sumu = 0 ; sumuu = 0 sumv = 0 ; sumvv = 0 ; sumuv = 0 ; do i = 1 to 1000 ; x = rannor(-1) ; y = rannor(-1) ; u = 2 * x - 5 * y ; v = x + 3 * y ; [10] sumu = sumu + u ; sumv = sumv + v ; sumuu = sumuu + u*u ; sumvv = sumvv + v*v ; sumuv = sumuv + u * v ; end ; covuv = (sumuv - sumu * sumv / 1000 ) / 998 ; varu = (sumuu - sumu * sumu / 1000) / 999 ; varv = (sumvv - sumv * sumv / 1000) / 999 ; corruv = covuv / sqrt(varu * varv) ; output ; run ; proc print data = simuv ; var n varu varv covuv corruv ; run ;