Grades on this exam: 92 90 80 75 SPH 7460 Final Exam December 22, 2010 page 1 of 5 Four Problems - 5 pages Name: _________________________________________ ===================================================================================== 1. A study is conducted of the survival of bacteria in the presence of heat. A total of 1000 bacteria are studied. Each bacterium is observed until it dies (no censoring). The outcome of interest is the survival time in hours. Assume that the bacteria in the sample are a mixture of two strains of bacteria. The survival time distribution for each strain is exponential, but with different parameters. Further, the proportions of bacteria from each strain is not known. a) Write down an expression for the probability distribution function for the survival times [see part b) for names of parameters] [10%] f(t) = alpha*lambda1*exp(-lambda1*t) + (1 - alpha)*lambda2*exp(-lambda2*t) b) Write a PROC IML program which will provide maximum likelihood estimates of each of the exponential parameters, lambda1 and lambda2, and of the mixing alpha (i.e., the proportion of Population 1 is assumed to be alpha and of Population 2 is (1 - alpha)). The data for this problem are assumed to be of the form: Obs Survival Time --- ------------- 1 3.2 2 2.8 3 5.1 . . . . . . 1000 9.0 [30%] The following program is a Newton-algorithm-based approach to the problem, using proc iml: ============================================================================================ options linesize = 80 ; footnote "~john-c/5421/expo.mixture.sas &sysdate &systime" ; data mixture ; b1 = .1 ; b2 = 1 ; alpha = .4 ; n = 400 ; do i = 1 to n ; r = ranuni(-1) ; group = 1 ; if r > alpha then group = 2 ; if group eq 1 then do ; u = ranuni(-1) ; t = - (1/b1) * log(1 - u) ; output ; end ; if group eq 2 then do ; u = ranuni(-1) ; t = - (1/b2) * log(1 - u) ; output ; end ; end ; run ; proc iml ; use mixture ; read all var {t} into t ; read all var {group} into group ; start loglike(t, b1, b2, alpha, n, l, dl, d2l) ; suml = 0 ; sumdl = {0, 0, 0} ; sumd2l = {0 0 0, 0 0 0, 0 0 0} ; e1 = {1, 0, 0} ; e2 = {0, 1, 0} ; e3 = {0, 0, 1} ; s1 = {1 0 0, 0 0 0, 0 0 0} ; s2 = {0 1 0, 0 0 0, 0 0 0} ; s3 = {0 0 1, 0 0 0, 0 0 0} ; s4 = {0 0 0, 1 0 0, 0 0 0} ; s5 = {0 0 0, 0 1 0, 0 0 0} ; s6 = {0 0 0, 0 0 1, 0 0 0} ; s7 = {0 0 0, 0 0 0, 1 0 0} ; s8 = {0 0 0, 0 0 0, 0 1 0} ; s9 = {0 0 0, 0 0 0, 0 0 1} ; do i = 1 to n ; ti = t[i] ; like = alpha * b1 * exp(-b1 * ti) + (1 - alpha) * b2 * exp(-b2 * ti) ; suml = suml + log(like) ; dlogldalpha = (b1 * exp(-b1 * ti) - b2 * exp(-b2 * ti)) / like ; dlogldb1 = alpha * exp(-b1*ti) * (1 - b1*ti) / like ; dlogldb2 = (1 - alpha) * exp(-b2*ti) * (1 - b2*ti) / like ; sumdl = sumdl + dlogldb1 * e1 + dlogldb2 * e2 + dlogldalpha * e3 ; sumd2l = sumd2l + dlogldalpha * dlogldalpha * s9 + dlogldalpha * dlogldb1 * s7 + dlogldalpha * dlogldb2 * s6 + dlogldalpha * dlogldb1 * s3 + dlogldb1 * dlogldb1 * s1 + dlogldb1 * dlogldb2 * s2 + dlogldalpha * dlogldb2 * s8 + dlogldb1 * dlogldb2 * s4 + dlogldb2 * dlogldb2 * s5 ; end ; l = suml ; dl = sumdl ; d2l = sumd2l ; finish ; beta = {.2, 1.2, .5} ; oldbeta = beta ; n = 400 ; itlim = 1000 ; diff = 9999 ; eps = 1e-6 ; do iter = 1 to itlim while (diff > eps) ; b1 = beta[1] ; b2 = beta[2] ; alpha = beta[3] ; run loglike(t, b1, b2, alpha, n, l, dl, d2l) ; beta = oldbeta - inv(d2l) * dl ; print "Iteration #" iter oldbeta beta l dl d2l ; oldbeta = beta ; b1 = beta[1] ; b2 = beta[2]; alpha = beta[3] ; end ; quit ; endsas ; ============================================================================================ However, the program above is extremely unlikely to converge. The following program uses an EM approach, and converges quite reliably to a reasonable answer: ============================================================================================ options linesize = 80 ; footnote "~john-c/5421/em.expo.mixture.sas &sysdate &systime" ; data mixture ; b1 = .1 ; b2 = 1 ; alpha = .4 ; n = 400 ; do i = 1 to n ; r = ranuni(-1) ; group = 1 ; if r > alpha then group = 2 ; if group eq 1 then do ; u = ranuni(-1) ; t = - (1/b1) * log(1 - u) ; output ; end ; if group eq 2 then do ; u = ranuni(-1) ; t = - (1/b2) * log(1 - u) ; output ; end ; end ; run ; proc iml ; use mixture ; read all var {t} into t ; read all var {group} into group ; start loglike(t, b1, b2, alpha, n, l, newb1, newb2, newalpha) ; suml = 0 ; sumz1 = 0 ; sumz2 = 0 ; sumnumermean1 = 0 ; sumnumermean2 = 0 ; do i = 1 to n ; ti = t[i] ; like = alpha * b1 * exp(-b1 * ti) + (1 - alpha) * b2 * exp(-b2 * ti) ; suml = suml + log(like) ; prob1 = b1 * exp(-b1 * ti) ; prob2 = b2 * exp(-b2 * ti) ; a1 = alpha * prob1 ; a2 = (1 - alpha) * prob2 ; z1i = a1 / (a1 + a2) ; z2i = a2 / (a1 + a2) ; firstpart1 = log(alpha) * z1i ; secondpart1 = log(prob1) * z1i ; firstpart2 = log(1 - alpha) * z2i ; secondpart2 = log(prob2) * z2i ; numermean1i = ti * z1i ; numermean2i = ti * z2i ; sumnumermean1 = sumnumermean1 + numermean1i ; sumnumermean2 = sumnumermean2 + numermean2i ; sumz1 = sumz1 + z1i ; sumz2 = sumz2 + z2i ; end ; newmean1 = sumnumermean1 / sumz1 ; newmean2 = sumnumermean2 / sumz2 ; newalpha = sumz1 / n ; newb1 = 1 / newmean1 ; newb2 = 1 / newmean2 ; l = suml ; finish ; b1 = .2 ; b2 = 1 ; alpha = .5 ; itlim = 200 ; n = 400 ; diff = 99999 ; eps = 1e-8 ; do iter = 1 to itlim while (diff > eps) ; run loglike(t, b1, b2, alpha, n, l, newb1, newb2, newalpha) ; diff = abs(b1 - newb1) + abs(b2 - newb2) + abs(alpha - newalpha) ; print "iter = " iter l b1 newb1 b2 newb2 alpha newalpha diff ; alpha = newalpha ; b1 = newb1 ; b2 = newb2 ; end ; quit ; endsas ; ============================================================================================ The above EM-based program converges after about 60 iterations typically, and as claimed by the theory, the likelihood increases from one EM step to the next. ============================================================================================ SPH 7460 Final Exam December 22, 2010 page 3 of 5 Name: _________________________________________ ===================================================================================== 2. Write a program which simulates 500 pairs of observations X and Y such that (1) X and Y have normal distributions both of which have mean 0 and variance 1. (2) Corr(X, Y) = .5 Include a computation of the observed correlation in your simulated data, without using a SAS or R procedure. [20%] The key here is to note that if X = Z + e1 and Y = Z + e2, where Z, e1, and e2 are all independent and have a standard normal (N(0, 1)) distribution, then: 1) var(X) = 1 + 1 = 2 = var(Y). 2) cov(X, Y) = cov(Z + e1, Z + e2) = cov(Z, Z) + cov(Z, e2) + cov(Z, e1) + cov(e1, e2) = var(Z) + 0 + 0 + 0 = 1. 3) corr(X, Y) = cov(X, Y) / sqrt(var(X) * var(Y)) = 1 / sqrt(2 * 2) = 1/2. 4) Now replace X by X' = (1/sqrt(2))*X and Y by Y' = (1/sqrt(2))*Y. Then corr(X', Y') = corr(X, Y). and both X' and Y' have mean 0 and variance 1. Thus the program to simulate such a bivariate distribution is: --------------------------------------------------------------------------------------- data xy ; file 'corrxy.out' ; n = 500 ; sumx = 0 ; sumy = 0 ; sumxy = 0 ; sumx2 = 0 ; sumy2 = 0 ; do i = 1 to n ; z = rannor(-1) ; e1 = rannor(-1) ; e2 = rannor(-1) ; x = (z + e1) / sqrt(2) ; y = (z + e2) / sqrt(2) ; output ; sumx = sumx + x ; sumx2 = sumx2 + x*x ; sumy = sumy + y ; sumy2 = sumy2 + y*y ; sumxy = sumxy + x * y ; end ; covxy = (sumxy - sumx * sumy / n) / n ; varx = (sumx2 - sumx * sumx / n) / n ; vary = (sumy2 - sumy * sumy / n) / n ; meanx = sumx / n ; meany = sumy / n ; corrxy = covxy / sqrt(varx * vary) ; if i ge n then do ; put "n = " n ; put "mean(x) = " meanx ; put "mean(y) = " meany ; put "cov(x, y) = " covxy ; put "corr(x, y) = " corrxy ; end ; run ; proc corr data = xy ; var x y ; run ; endsas ; -------------------------------------------------------------------------------------------- SPH 7460 Final Exam December 22, 2010 page 4 of 5 Name: _________________________________________ ===================================================================================== 3. Write a program which simulates the number of times you flip a coin until you get 5 heads in a row. data coinflips ; nheads = 0 ; do i = 1 to 1000 while(nheads < 5) ; r = ranuni(-1) ; if r < .5 then nheads = nheads + 1 ; if r > .5 then nheads = 0 ; output ; end ; [20%] run ; proc print data = coinflips ; run ; SPH 7460 Final Exam December 22, 2010 page 5 of 5 Name: _________________________________________ ===================================================================================== 4. Let A be the matrix: | 1 a | | | | b 1 | For what ranges of values for a and b is it true that A is the covariance matrix of a bivariate distribution? (Hint: what must be true for the eigenvalues of A?) First, a covariance matrix must be symmetric, so b = a. Second, a covariance matrix must be positive definite. This means that both eigenvalues are real and positive. The eigenvalues are the solution x to: | 1 - x a | det | | = 0 . | a 1 - x | This determinant is: x^2 - 2*x + 1 - a^2. The solutions are: (2 + sqrt(4 + 4(1 - a^2))) / 2 and (2 - sqrt(4 - 4(1 - a^2))) / 2 These solutions reduce to: 1 + abs(a) and 1 - abs(a). Both solutions are real if abs(a) is greater than or equal to zero. This is always true. Both solutions are nonnegative if abs(a) < 1. [20%]