Grades on this exam:

      92   90  80  75


SPH 7460 Final Exam  December 22, 2010                                    page 1 of 5

Four Problems - 5 pages               Name: _________________________________________
=====================================================================================

1.  A study is conducted of the survival of bacteria in the presence of heat.  A
    total of 1000 bacteria are studied.  Each bacterium is observed until it dies
    (no censoring).  The outcome of interest is the survival time in hours.  Assume that the
    bacteria in the sample are a mixture of two strains of bacteria.  The survival
    time distribution for each strain is exponential, but with different parameters.
    Further, the proportions of bacteria from each strain is not known.

    a) Write down an expression for the probability distribution function
       for the survival times [see part b) for names of parameters]


[10%]  f(t) = alpha*lambda1*exp(-lambda1*t) + (1 - alpha)*lambda2*exp(-lambda2*t)


    b) Write a PROC IML program which will provide maximum likelihood estimates of
       each of the exponential parameters, lambda1 and lambda2, and of the mixing
       alpha (i.e., the proportion of Population 1 is assumed to be alpha and of
       Population 2 is (1 - alpha)).

    The data for this problem are assumed to be of the form:

    Obs    Survival Time
    ---    -------------
     1          3.2
     2          2.8
     3          5.1
     .           .
     .           .
     .           .
  1000          9.0

[30%]


The following program is a Newton-algorithm-based approach to the problem, using
proc iml:


============================================================================================
options linesize = 80 ;
footnote "~john-c/5421/expo.mixture.sas &sysdate &systime" ;

data mixture ;
     b1 = .1 ;
     b2 = 1 ;
     alpha = .4 ;

     n = 400 ;

do i = 1 to n ;

   r = ranuni(-1) ;

   group = 1 ;
   if r > alpha then group = 2 ;

   if group eq 1 then do ;

      u = ranuni(-1) ;
      t = - (1/b1) * log(1 - u) ;
      output ;

   end ;

   if group eq 2 then do ;

      u = ranuni(-1) ;
      t = - (1/b2) * log(1 - u) ;
      output ;

   end ;

end ;

run ;

proc iml ;

     use mixture ;
     read all var {t} into t ;
     read all var {group} into group ;


start loglike(t, b1, b2, alpha, n, l, dl, d2l) ;

     suml = 0 ;
     sumdl = {0, 0, 0} ;
     sumd2l = {0 0 0, 0 0 0, 0 0 0} ;

     e1 = {1, 0, 0} ; e2 = {0, 1, 0} ; e3 = {0, 0, 1} ;

     s1 = {1 0 0, 0 0 0, 0 0 0} ;
     s2 = {0 1 0, 0 0 0, 0 0 0} ;
     s3 = {0 0 1, 0 0 0, 0 0 0} ;
     s4 = {0 0 0, 1 0 0, 0 0 0} ;
     s5 = {0 0 0, 0 1 0, 0 0 0} ;
     s6 = {0 0 0, 0 0 1, 0 0 0} ;
     s7 = {0 0 0, 0 0 0, 1 0 0} ;
     s8 = {0 0 0, 0 0 0, 0 1 0} ;
     s9 = {0 0 0, 0 0 0, 0 0 1} ;

    do i = 1 to n ;

      ti = t[i] ;

      like = alpha * b1 * exp(-b1 * ti) + (1 - alpha) * b2 * exp(-b2 * ti) ;
      suml = suml + log(like) ;

      dlogldalpha = (b1 * exp(-b1 * ti) - b2 * exp(-b2 * ti)) / like ;
      dlogldb1    = alpha * exp(-b1*ti) * (1 - b1*ti) / like ;
      dlogldb2    = (1 - alpha) * exp(-b2*ti) * (1 - b2*ti) / like ;

      sumdl = sumdl + dlogldb1 * e1 + dlogldb2 * e2 + dlogldalpha * e3 ;


      sumd2l = sumd2l + dlogldalpha * dlogldalpha * s9
                      + dlogldalpha * dlogldb1 * s7
                      + dlogldalpha * dlogldb2 * s6
                      + dlogldalpha * dlogldb1 * s3
                      + dlogldb1 * dlogldb1 * s1
                      + dlogldb1 * dlogldb2 * s2
                      + dlogldalpha * dlogldb2 * s8
                      + dlogldb1 * dlogldb2 * s4
                      + dlogldb2 * dlogldb2 * s5 ;

    end ;

    l = suml ; dl = sumdl ; d2l = sumd2l ;

finish ;

beta = {.2, 1.2, .5} ;

oldbeta = beta ;

n = 400 ;
itlim = 1000 ;
diff = 9999 ;
eps = 1e-6 ;

do iter = 1 to itlim while (diff > eps) ;

   b1 = beta[1] ; b2 = beta[2] ; alpha = beta[3] ;

   run loglike(t, b1, b2, alpha, n, l, dl, d2l) ;

   beta = oldbeta - inv(d2l) * dl ;

   print "Iteration #" iter oldbeta beta l dl d2l ;

   oldbeta = beta ;

   b1 = beta[1] ; b2 = beta[2]; alpha = beta[3] ;

end ;

quit ;

endsas ;

============================================================================================

However, the program above is extremely unlikely to converge.  The following program
uses an EM approach, and converges quite reliably to a reasonable answer:

============================================================================================


options linesize = 80 ;
footnote "~john-c/5421/em.expo.mixture.sas &sysdate &systime" ;

data mixture ;
     b1 = .1 ;
     b2 = 1 ;
     alpha = .4 ;

     n = 400 ;

do i = 1 to n ;

   r = ranuni(-1) ;

   group = 1 ;
   if r > alpha then group = 2 ;

   if group eq 1 then do ;

      u = ranuni(-1) ;
      t = - (1/b1) * log(1 - u) ;
      output ;

   end ;

   if group eq 2 then do ;

      u = ranuni(-1) ;
      t = - (1/b2) * log(1 - u) ;
      output ;

   end ;

end ;

run ;

proc iml ;

     use mixture ;
     read all var {t} into t ;
     read all var {group} into group ;

start loglike(t, b1, b2, alpha, n, l, newb1, newb2, newalpha) ;

     suml = 0 ;
     sumz1 = 0 ;
     sumz2 = 0 ;
     sumnumermean1 = 0 ;
     sumnumermean2 = 0 ;

    do i = 1 to n ;

      ti = t[i] ;

      like = alpha * b1 * exp(-b1 * ti) + (1 - alpha) * b2 * exp(-b2 * ti) ;
      suml = suml + log(like) ;

      prob1 = b1 * exp(-b1 * ti) ;
      prob2 = b2 * exp(-b2 * ti) ;

      a1 = alpha * prob1 ;
      a2 = (1 - alpha) * prob2 ;

      z1i = a1 / (a1 + a2) ;
      z2i = a2 / (a1 + a2) ;


      firstpart1 = log(alpha) * z1i ;
      secondpart1 = log(prob1) * z1i ;
      firstpart2 = log(1 - alpha) * z2i ;
      secondpart2 = log(prob2) * z2i ;

      numermean1i = ti * z1i ;
      numermean2i = ti * z2i ;

      sumnumermean1 = sumnumermean1 + numermean1i ;
      sumnumermean2 = sumnumermean2 + numermean2i ;

      sumz1 = sumz1 + z1i ;
      sumz2 = sumz2 + z2i ;

    end ;

    newmean1 = sumnumermean1 / sumz1 ;
    newmean2 = sumnumermean2 / sumz2 ;

    newalpha = sumz1 / n ;

    newb1 = 1 / newmean1 ;
    newb2 = 1 / newmean2 ;

    l = suml ;

finish ;

b1 = .2 ; b2 = 1 ; alpha = .5 ;

itlim = 200 ;

n = 400 ;
diff = 99999 ;
eps = 1e-8 ;

do iter = 1 to itlim while (diff > eps) ;

   run loglike(t, b1, b2, alpha, n, l, newb1, newb2, newalpha) ;
   diff = abs(b1 - newb1) + abs(b2 - newb2) + abs(alpha - newalpha) ;

   print "iter = " iter l b1 newb1 b2 newb2 alpha newalpha diff ;

   alpha = newalpha ;
   b1 = newb1 ;
   b2 = newb2 ;

end ;

quit ;

endsas ;

============================================================================================

The above EM-based program converges after about 60 iterations typically, and as claimed by the
theory, the likelihood increases from one EM step to the next.

============================================================================================

SPH 7460 Final Exam  December 22, 2010                                    page 3 of 5

                                      Name: _________________________________________
=====================================================================================

2.  Write a program which simulates 500 pairs of observations X and Y such that
    (1) X and Y have normal distributions both of which have mean 0 and variance 1.
    (2) Corr(X, Y) = .5

    Include a computation of the observed correlation in your simulated data, without
    using a SAS or R procedure.

 [20%]

    The key here is to note that if X = Z + e1  and Y = Z + e2, where Z, e1, and e2
    are all independent and have a standard normal (N(0, 1)) distribution, then:

    1) var(X) = 1 + 1 = 2 = var(Y).

    2) cov(X, Y) = cov(Z + e1, Z + e2) = cov(Z, Z) + cov(Z, e2) + cov(Z, e1) + cov(e1, e2)

                 = var(Z) + 0 + 0 + 0 = 1.

    3) corr(X, Y) = cov(X, Y) / sqrt(var(X) * var(Y)) = 1 / sqrt(2 * 2) = 1/2.

    4) Now replace X by X' = (1/sqrt(2))*X and Y by Y' = (1/sqrt(2))*Y.  Then corr(X', Y') = corr(X, Y).
       and both X' and Y' have mean 0 and variance 1.

    Thus the program to simulate such a bivariate distribution is:

    ---------------------------------------------------------------------------------------

    data xy ;

    file 'corrxy.out' ;

         n = 500 ;
         sumx = 0 ; sumy = 0 ; sumxy = 0 ; sumx2 = 0 ; sumy2 = 0 ;

         do i = 1 to n ;

            z = rannor(-1) ; e1 = rannor(-1) ; e2 = rannor(-1) ;
            x = (z + e1) / sqrt(2) ;
            y = (z + e2) / sqrt(2) ;

            output ;

            sumx = sumx + x ;  sumx2 = sumx2 + x*x ;
            sumy = sumy + y ;  sumy2 = sumy2 + y*y ;
            sumxy = sumxy + x * y ;

         end ;

         covxy = (sumxy - sumx * sumy / n) / n ;
         varx = (sumx2 - sumx * sumx / n) / n  ;
         vary = (sumy2 - sumy * sumy / n) / n  ;

         meanx = sumx / n ;
         meany = sumy / n ;

         corrxy = covxy / sqrt(varx * vary) ;


         if i ge n then do ;

           put "n          = " n ;
           put "mean(x)    = " meanx ;
           put "mean(y)    = " meany ;

           put "cov(x, y)  = " covxy ;
           put "corr(x, y) = " corrxy ;

        end ;

run ;

proc corr data = xy ;
     var x y ;
run ;

endsas ;


--------------------------------------------------------------------------------------------


SPH 7460 Final Exam  December 22, 2010                                    page 4 of 5

                                      Name: _________________________________________
=====================================================================================

3.  Write a program which simulates the number of times you flip a coin until you
    get 5 heads in a row.


    data coinflips ;

         nheads = 0 ;

         do i = 1 to 1000 while(nheads < 5) ;

            r = ranuni(-1) ;

            if r < .5 then nheads = nheads + 1 ;

            if r > .5 then nheads = 0 ;

            output ;

         end ;
[20%]
      run ;

      proc print data = coinflips ;
      run ;


SPH 7460 Final Exam  December 22, 2010                                    page 5 of 5

                                      Name: _________________________________________
=====================================================================================

4.  Let A be the matrix:


    | 1    a |
    |        |
    | b    1 |

    For what ranges of values for a and b is it true that A is the covariance matrix of a
    bivariate distribution? (Hint: what must be true for the eigenvalues of A?)


    First, a covariance matrix must be symmetric, so b = a.

    Second, a covariance matrix must be positive definite.  This means that both eigenvalues
    are real and positive.  The eigenvalues are the solution x to:

              | 1 - x    a |
          det |            |  = 0 .
              | a    1 - x |

    This determinant is:   x^2 - 2*x + 1 - a^2.

    The solutions are:  

        (2 + sqrt(4 + 4(1 - a^2))) / 2   and
        (2 - sqrt(4 - 4(1 - a^2))) / 2

    These solutions reduce to: 1 + abs(a) and 1 - abs(a).

    Both solutions are real if abs(a) is greater than or equal to zero.
    This is always true.

    Both solutions are nonnegative if abs(a) < 1.

[20%]