APPLICATIONS OF SAS IML:                                 SPH 5421 notes.019

FINDING MAXIMUM LIKELIHOOD ESTIMATES


     Why might it be useful in statistics to be able to find minima or maxima of a
given nonlinear function?

     Suppose L(B) = L(Y, X;B) is the loglikelihood for a given data set.  Here Y
could indicate an outcome variable (dependent variable) and X a vector of
predictors.  Both Y and X are measured variables in the data set.  The vector B
represents unknown coefficients in the model.

     The object of maximum likelihood estimation is to find the value of B which
maximizes L(Y, X ; B)  for the observed values of Y and X.  As above this reduces to
finding where the partial derivatives of L are 0. That is, the object is to find B =
(B1, B2, B3, ... , Bp) such that

                          dL/dB1 = 0
                          dL/dB2 = 0
[1]                            .
                               .
                               .
                          dL/dBp = 0.

     This is a system of p simultaneous nonlinear equations.  In general the
solution cannot be found by a formula.  It must be estimated numerically by an
iterative process like Newton's method.


     The system of equations [1] can be written in vector form as


[2]                  dL/dB = 0,

     where 0 is the 0-vector in p-dimensional space.

     In theory, finding estimates of the parameter vector B which maximize the
likelihood is no different from the problem discussed above of minimizing or
maximizing a nonlinear function.  In fact, it is often rather difficult to work with
log likelihood functions because the expressions for them involve not only the
unknown parameters, but also the independent variables X in the design matrix and
the outcome variable Y.  This implies that each evaluation of the likelihood
function must include a complete pass through the entire design matrix.

     The following example is typical.  It uses PROC IML in SAS to produce maximum
likelihood estimates for a logistic model.  The basic algorithm is Newton's method,
and requires repeated evaluation of the loglikelihood, the first derivative vector
of the log likelihood, and the 2nd derivative matrix of the log likelihood.  Output
from this program is also appended.

     The parameter vector, BETA, is assumed to be p x 1.  The first derivative
vector of the log-likelihood L is also p x 1.  The second derivative matrix of the
loglikelihood is a square, symmetric p x p matrix.

========================================================================

/*    Program to compute maximum likelihood estimates ....           */;
/*    Based on Newton-Raphson methods - uses IML                     */;

footnote "program: /home/walleye/john-c/5421/imlml.sas &sysdate &systime" ;

options linesize = 80 ;

data sim ;

     one = 1 ;
     a = 2 ;
     b = 1 ;
     n = 300 ;
     seed = 20000214 ;

/*   Simulated data from logistic distribution                      */;

     do i = 1 to n ;

        u = 2 * i / n ;
        p = 1 / (1 + exp(-a - b * u)) ;
        r = ranuni(seed) ;
        yi = 0 ;
        if r lt p then yi = 1 ;
        output ;

     end ;

run ;


/*   Logistic analysis of simulated data ......                     */;

proc logistic descending ;
     model yi = u ;
title1 'Proc logistic results ... ' ;
run ;

/*   Begin proc iml ..............................                    */;

title1 'PROC IML Results ...' ;

proc iml ;

  use sim ;                        /* Input the simulated dataset     */;

  read all var {one u} into x ;    /* Read the design matrix into x   */;
  read all var {yi}    into y ;    /* Read the outcome data into y    */;

  p = 2 ;                          /* Set the number of parameters    */;

  beta = {0.00, 0.00} ;            /* Initialize the param. vector    */;


/* -------------------------------------------------------------------*/;
/*   Module which computes loglikelihood and derivatives ...          */;

  start loglike(beta, p, x, y, l, dl, d2l) ;

     n = 300 ;

     l = 0 ;                       /*  Initialize log likelihood ...  */;
     dl  = j(p, 1, 0) ;            /*  Initialize 1st derivatives...  */;
     d2l = j(p, p, 0) ;            /*  Initialize 2nd derivatives...  */;

     do i = 1 to n ;

        xi = x[i,] ;
        yi = y[i] ;
        xibeta = xi * beta ;

        w = exp(-xibeta) ;

/*      The log likelihood for the i-th observation ...               */;

        iloglike = log((1 - yi) * w + yi) - log(1 + w) ;

        l = l + iloglike ;

        do j = 1 to p ;

           xij = xi[j] ;

/*      The jth 1st derivative of the log likelihood for the ith obs      */;

           jdlogl = -(1 - yi) * w * xij / ((1 - yi) * w + yi)
                    + w * xij / (1 + w) ;

           dtemp = dl[j] + jdlogl ;
           dl[j] = dtemp ;

           do k = 1 to p ;

              xik = xi[k] ;

/*      The jkth 2nd derivative of the log likelihood for the ith obs    */;

              jkd2logl = ((1 - yi) * w * xij * xik * ((1 - yi) * w + yi)
                         -(1 - yi) * w * xij * ((1 - yi) * w * xik))/
                          ((1 - yi) * w + yi)**2
                       + (-w * xij * xik * (1 + w) + w * w * xij * xik)/
                          (1 + w)**2 ;

              d2temp = d2l[j, k] ;
              d2l[j, k] = d2temp + jkd2logl ;

           end ;

        end ;

     end ;

 finish loglike ;

/* -------------------------------------------------------------------*/;

 eps = 1e-8 ;
 diff = 1 ;

/*  The following do loop stops when the increments in beta are       */;
/*  sufficiently small, or when number of iterations reaches 20       */;

 do iter = 1 to 20 while(diff > eps) ;

    run loglike(beta, p, x, y, l, dl, d2l) ;

    invd2l = inv(d2l) ;
    beta = beta - invd2l * dl ;       /*  The key Newton step ...  */;

    diff = max(abs(invd2l * dl)) ;

    print iter  l  dl  d2l  diff  beta ;

 end ;

 b1 = beta[1] ;
 b2 = beta[2] ;

 serr1 = sqrt(-invd2l[1, 1]) ;
 serr2 = sqrt(-invd2l[2, 2]) ;
 covar = -invd2l ;
 llratio = - 2 * l ;

 print ' -2 * loglikelihood = ' llratio ;

 print 'b1 coeff, std err : ' b1 serr1 ;
 print 'b2 coeff, std err : ' b2 serr2 ;

 print 'covariance matrix : ' covar ;

quit ;

========================================================================

                           Proc logistic results ...                           1
                                                  10:55 Saturday, March 18, 2000

                             The LOGISTIC Procedure

     Data Set: WORK.SIM     
     Response Variable: YI        
     Response Levels: 2
     Number of Observations: 300
     Link Function: Logit


                                Response Profile
 
                           Ordered
                             Value      YI     Count

                                 1       1       278
                                 2       0        22



      Model Fitting Information and Testing Global Null Hypothesis BETA=0
 
                               Intercept
                 Intercept        and   
   Criterion       Only       Covariates    Chi-Square for Covariates

   AIC             159.306       153.781         .                          
   SC              163.010       161.188         .                          
   -2 LOG L        157.306       149.781        7.525 with 1 DF (p=0.0061)  
   Score              .             .           7.266 with 1 DF (p=0.0070)  


                    Analysis of Maximum Likelihood Estimates
 
               Parameter Standard    Wald       Pr >    Standardized     Odds
   Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio

   INTERCPT 1     1.5917   0.3766    17.8675     0.0001            .     .   
   U        1     1.1108   0.4273     6.7592     0.0093     0.354175    3.037


         Association of Predicted Probabilities and Observed Responses

                   Concordant = 66.7%          Somers' D = 0.344
                   Discordant = 32.3%          Gamma     = 0.348
                   Tied       =  1.0%          Tau-a     = 0.047
                   (6116 pairs)                c         = 0.672
 
 
 
 
            program: /home/walleye/john-c/5421/imlml.sas 18MAR00 10:55
                            PROC IML Results ...                             2
                                                  10:55 Saturday, March 18, 2000

          ITER         L        DL       D2L                DIFF      BETA
             1 -207.9442       128       -75    -75.25   1.42466   1.42466
                         135.45333    -75.25 -100.5006           0.2810698


          ITER         L        DL       D2L                DIFF      BETA
             2 -86.11339 24.433947  -39.1126  -36.7227  0.382878 1.6898849
                         27.881783  -36.7227 -47.38327           0.6639477


          ITER         L        DL       D2L                DIFF      BETA
             3 -76.10077 5.4183855  -24.6185 -20.35354 0.3405272 1.6284455
                         7.0927236 -20.35354 -24.50093           1.0044749


          ITER         L        DL       D2L                DIFF      BETA
             4 -74.93779 0.8082239 -20.61113 -15.14088 0.1008562 1.5935698
                         1.1867474 -15.14088 -17.00239           1.1053311


          ITER         L        DL       D2L                DIFF      BETA
             5 -74.89053   0.04008 -19.90169 -14.11319 0.0054776 1.5916993
                         0.0585386 -14.11319 -15.50634           1.1108087


          ITER         L        DL       D2L                DIFF      BETA
             6  -74.8904 0.0001116  -19.8646 -14.05981 0.0000151 1.5916942
                         0.0001618 -14.05981  -15.4291           1.1108238


          ITER         L        DL       D2L                DIFF      BETA
             7  -74.8904  8.55E-10 -19.86449 -14.05967 1.157E-10 1.5916942
                         1.2388E-9 -14.05967 -15.42889           1.1108238


                                                 LLRATIO
                         -2 * loglikelihood =  149.78081


                                                B1     SERR1
                    b1 coeff, std err :  1.5916942  0.376554


                                                B2     SERR2
                    b2 coeff, std err :  1.1108238 0.4272664


                                             COVAR
                    covariance matrix :  0.1417929  -0.12921
                                          -0.12921 0.1825565

 
 
 
 
 
           program: /home/walleye/john-c/5421/imlml.sas 18MAR00 10:55
========================================================================
     The difficult part of this kind of program is writing out the expressions for
the 1st and 2nd partial derivatives.  To some extent this problem cannot be avoided.

     The choice of initial values for the parameter vector BETA can strongly affect
the behavior of this kind of program.  In general you want to choose a value for
BETA that is close to the minimum point. However, you usually don't know where that
is.  In this case, (0, 0)' was chosen and the algorithm converged quickly and
satisfactorily. In other cases you may need to make an informed guess of where the
minimum might be.  You may be able to do this by trying several values for BETA and
examining the corresponding values of the log likelihood.  If the number of
parameters is moderate or large, (> 5) this can be very time-consuming.

     In this case the algorithm converges in 7 iterations to the same answer as
that given by PROC LOGISTIC.  Note that during the iterations,

      (1)  The log-likelihood L increases
      (2)  The derivative vector, DL, gets close to (0, 0)'.
      (3)  DIFF --> 0.

     The program above is a very rudimentary one for this purpose.  There is no
reason to use this program rather than PROC LOGISTIC.  It is intended as an example
of how the algorithm can be implemented.  Features which should be added include the
following:

     1.  A way of checking that the likelihood is actually increasing.

     2.  Check that the derivatives of the log-likelihood are very small
         when convergence is achieved: typically, less than 10^(-5) in
         absolute value.

     3.  A provision for what to do if the algorithm does not converge
         within the prescribed number of steps. 

     4.  A provision for 'step-halving': if a given iteration actually
         does not increase the likelihood, a modification in which
         the parameter estimates can be incremented by one-half
         the amount can be added.  Step-halving can be implemented as
         follows:

         1)   start with f = 1

         2)   compute beta = beta - f * inv(d2l) * dl ;

         3)   evaluate the log-likelihood at the new beta

         4)   if the loglikelihood decreases at a given iteration,
              replace  f  by  f/2  and go back to step 2.

         5)   keep doing this until you either do 10 step-halves, or
              the loglikelihood actually increases.  If after 10 step-
              halves, the loglikelihood has not improved, declare
              nonconvergence and exit.

         The rationale for step-halving is as follows.  The change in the
         parameters at a given step may be in the correct direction, but
         too large in magnitude to increase the log-likelihood.  Step-halving
         produces changes in the parameters in the same direction as that specified
         by the Newton process, but of smaller magnitude.

     5.  Provisions for dealing with missing data need to be either built
         into the program, or the dataset used in the analysis must have
         the observations with missing data deleted before the IML procedure
         begins.

     6.  Provision to print an estimate of the standard errors of the coefficient
         estimates.  This can easily be added to the above program at the end,
         as follows:

               invd2l = inv(d2l) ;
               serr1 = sqrt(-invd2l[1, 1]) ;
               serr2 = sqrt(-invd2l[2, 2]) ;

         Here serr1 will be the estimated standard error of the intercept
         term, b0, and serr2 will the estimated s. e. for the coefficient
         b1.  The reason this works is, the covariance matrix of
         parameters in maximum likelihood estimation is asymptotically equal
         to the negative of the inverse of the Fisher information.  The
         Fisher information is the expectation of the 2nd derivative of the
         log likelihood.  The expectation is usually approximated by simply
         evaluating the 2nd derivative matrix at the maximum likelihood
         estimate itself.  In this case that 2nd derivative matrix is:  d2l.

         Fisher information is an important idea.

-------------------------------------------------------------------------------------

     Finally, why are programs like this useful, if you already have things like
PROC LOGISTIC to find coefficient estimates and other statistics?

     SAS and Splus both provide a wide but still limited range of statistical
routines for regression and analysis of variance.  In SAS, the primary areas are (1)
linear models, (2) a subset of generalized linear models and loglinear models (PROC
GENMOD), (3) Mixed models (PROC MIXED), (4) nonlinear least squares (PROC NLIN), and
(5) proportional hazards models for survival analysis.  SAS does not have a
procedure to deal with some models in survival analysis (e.g., Gompertz or Pareto),
general forms of Poisson regression, among many others.  If you need to find maximum
likelihood estimates for a model not handled by a SAS or Splus procedure, you may 
need to have to write a program something like the example above or you may need to 
make use of a statistical package like BMDP which does include a procedure (LE) for
maximum-likelihood estimation.
     
=================================================================================

PROBLEM 18

1.   Use a program like that given above to do the following logistic regression
     problem: in the Lung Health Study dataset, assume that the probability of quitting
     smoking at Year 1 is given by:

        prob(quit smoke at A1) = 1 / (1 + exp(-b0 - b1*age)).

     Thus, the problem is to estimate b0 and b1.  Check your answer by using PROC
     LOGISTIC or the Splus version of logistic regression.

2.   Again, use a similar program to estimate the coefficients for the following
     model:

        prob(quit smoke at A1) = 1 /(1 + exp(-b0 - b1*age - b2*gender)) ;

     Also estimate standard errors of the coefficients.

=================================================================================
/home/walleye/john-c/5421/notes.019    Last update: October 21, 2003

=================================================================================

SAS PROC IML                                            SPH 5421 notes.logistic

Maximum likelihood estimation, continued

APPENDIX to notes.019:

The Logistic Model: Likelihood, Log-likelihood, Derivatives, and Maximum
Likelihood Estimation.


     The logistic model is used to model the probability of an event as a function
of variables called risk factors.

     Occurrence of the event (the outcome variable) is represented by a variable Y,
where Y = 1 means the event occurred and Y = 0 means it did not occur.  The
probability that the event occurs is often the function of certain risk factors. For
example, if the event is death from stroke, the probability of the event within a
given fixed time interval (for example, one year) depends on the age of the person
and the person's systolic blood pressure.  Older people are at higher risk and
people with high SBP are at higher risk.  
   
     For the stroke-death example, the logistic model is


     prob(Y = 1 | X1 = age, X2 = SBP) = 1 / (1 + exp(-b0 - b1*X1 - b2*X2))


     Here b0, b1, and b2 are coefficients that are usually unknown.  The typical
problem for a given study is to collect data and use a program to estimate the
values of b0, b1, and b2.  Note that the logistic probability is always a number
between 0 and 1.

     Also note that if, for example, b2 is positive, then an increase in X2 will
produce an increase in the probability that Y = 1.  Conversely, if b2 is negative,
then an increase in X2 will cause a decrease in the probability that Y = 1.

     It is easy to show that


     prob(Y = 0 | X1, X2) = 1 - prob(Y = 1 | X1, X2)

                        = exp(-b0 - b1*X1 - b2*X2)/ (1 + exp(-b0 - b1*X1 - b2*X2))


     Assume you are carrying out an observational study in which you determine each
participant's age and systolic blood pressure at a baseline visit, and then observe
what happens to the person during a one-year followup period.

     Suppose a person is 85 years of age and has SBP = 160, and that he dies of
stroke within the one-year period of observation.  The likelihood for this
observation for this single individual is:

     L1 = 1 / (1 + exp(-b0 - b1*85 - b2*160)).

     On the other hand, if he does NOT die of stroke within a year, the outcome
Y is set to zero, and the likelihood for the observation is

     L0 = 1 - L1 = exp(-b0 - b1*85 - b2*160) / (1 + exp(-b0 - b1*85 - b2*160))


     It is possible to write the likelihood as a single expression which is
a function of Y, X1, and X2:

     L = Y / (1 + exp(-b0 - b1*85 - b2*160) +

         (1 - Y) * exp(-b0 - b1*85 - b2*160) / (1 + exp(-b0 - b1*85 - b2*160))

     (Note that if Y = 1, the second part of the expression is 0, while if Y = 0,
the first part of the expression is 0.)

     If you knew what the values of b0, b1, and b2 were, you could explicitly
calculate what the person's risk of an event would be.  The occurrence or
non-occurrence of the event for a single individual can be viewed as a Bernoulli
random variable.

     The likelihood of the entire set of observations from the whole study is a
product of the individual likelihoods.  Different individuals have different values
of age and SBP, and possibly different outcomes.  Thus the total likelihood of all
the observations could be written

     L = product{Yi / (1 + exp(-b0 - b1*X1i - b2*X2i) +

         (1 - Yi) * exp(-b0 - b1*X1i - b2*X2i) / (1 + exp(-b0 - b1*X1i - b2*X2i)),

where the i subscript refers to the i-th person in the study, and the product is
taken across all the observations, i = 1, 2, ..., n.

     The main question in a typical study is, what coefficients 'best' fit the
observed data?  Note that the likelihood is both a function of the observed data
(Yi, X1i, X2i) and of the unknown coefficients b0, b1, and b2.  The observed data,
are fixed constants, and there are typically many data points and measurements which
make them up.  If you have a study involving 1000 people with one outcome variable Y
and two covariates X1 and X2 for each person, the number of observed data values is
3000.  On the other hand, there are only 3 unknown coefficients.  Thus in a sense,
the likelihood would be a function of only 3 variables, b0, b1, and b2.

     The goal of maximum likelihood estimation is to find the values of the unknown
coefficients which maximize the likelihood.  The problem of finding maximum
likelihood estimates, therefore, is just a problem of maximizing a nonlinear
function with a relatively small number of unknowns.  It is a very complicated
function, but, in the case of the stroke-death study, only 3 unknown coefficients
need to be estimated.

     Again, the likelihood is a product of probabilities.  Many of the probabilities
will be small, and of course none are larger than 1 (or smaller than 0).  In most
studies, for realistic choices of the coefficients b0, b1, and b2, the likelihood is
a very small number, perhaps on the order of e^(-1000).  This means that it is very
hard to do accurate computations.  It is much easier computationally to deal with
the LOG of the likelihood.  The log (natural log) function is a monotone increasing
function, so that if you find values which maximize the log of the likelihood, those
same values will also maximize the likelihood itself.

     Another good thing about taking the log of the likelihood is that the PRODUCT
in the likelihood expression gets converted to a SUM.  Sums are generally easier to
compute with than products.

     The expression for the log-likelihood of the data set is:

     l = sum{log[Yi + (1 - Yi) * exp(-Xi*b)] - log[1 + exp(-Xi*b)]},

where the sum is over i = 1, 2, ..., n.  Note also that Xi*b is matrix
shorthand;

         Xi = (1, X1i, X2i) and b = (b0, b1, b2)`, and

         exp(-Xi*b) = exp(-b0 - b1*X1i - b2*X2i).


     The method of finding the values of b0, b1, and b2 that maximize the log
likelihood is simply the Newton method for finding the maximum point of a nonlinear
function.  It is not particularly hard to write down a PROC IML routine which will
do this.  The only tricky part of it is writing the correct expressions for the
first and second-order partial derivatives of the log likelihood.

     Here is the first partial derivative of the log-likelihood  l  with respect
to b1:

     dl/db1 = sum{-(1 - Yi)*X1i*exp(-Xi*b) / (Yi + (1 - Yi)*exp(-Xi*b))

                  + X1i*exp(-Xi*b) / ( 1 + exp(-Xi*b))}.

     The first partial derivative with respect to b2 is similar.  The first partial
derivative with respect to b0 is also similar, except that where X1i occurs in the
expression above, it is replaced by 1.

     The second partial derivative of  l  with respect to b1 (twice) is

     d2l/db1b1 = sum{[(1 - Yi)*(X1i^2)*exp(-Xi*b)*(Yi + (1 - Yi)*exp(-Xi*b))

         - ((1 - Yi)^2)*(X1i^2)*exp(-2*Xi*b)] / (Yi + (1 - Yi)*exp(-Xi*b))^2]

         + [-(X1i^2)*exp(-Xi*b)*(1 + exp(-Xi*b)) + (X1i^2)*exp(-2*Xi*b) /
             (1 + exp(-Xi*b))^2]}

     The mixed second partial derivative of  l  with respect to b1 and b2 is

     d2l/db1b2 = sum{[(1 - Yi)*(X1i*X2i)*exp(-Xi*b)*(Yi + (1 - Yi)*exp(-Xi*b))

         - ((1 - Yi)^2)*(X1i*X2i)*exp(-2*Xi*b)) / (Yi + (1 - Yi)*exp(-Xi*b))^2]

         + [-(X1i*X2i)*exp(-Xi*b)*(1 + exp(-Xi*b)) + (X1i*X2i)*exp(-2*Xi*b) /
             (1 + exp(-Xi*b))^2]}

     Most of the derivation of this awful-looking expression is just application of
the quotient rule for derivatives.  You do have to be very careful about plus and
minus signs.  This expression occurs in a slightly disguised form in the program
imlml.sas, in notes.019.

     There are some operations in the above expression which can consume a lot of
computer time.  The expression exp(-Xi*b) occurs over and over again.  Recall that
Xi is a 1 x p matrix, and b is a p x 1 matrix.  Every computation of Xi*b involves
some arithmetic: p multiplications and p additions.  You don't see this explicitly
in IML, but it occurs and requires some computer time. Also, in terms of computer
time, functions like exponentiation are rather time-consuming.  It is better to do
these things only once per iteration if possible.  That is why, in the program in
notes.019,

            xibeta = Xi * beta

is computed only once per iteration, and

            w = exp(-xibeta) 

is computed only once per iteration.  After these two computations, all the
occurrences of exp(-Xi*b) are replaced by w.  This saves some time.  There are other
places in the program that some savings in computer time could be
realized, but they are less important.

     The logistic likelihood is fairly typical of likelihood expressions.  In
practice more complicated likelihood expressions are often encountered.  If the
Newton process of finding maximum likelihood estimates is used, some way of
estimating first and second derivatives must be available.  You can try numerical
estimation, as is done in the program numml.sas in notes.020.  In that example, 
the first derivatives are computed analytically but the second derivatives are
estimated numerically.  The process appears to work satisfactorily for a simulated
dataset.  However in general numerical estimates of derivatives are less trustworthy
than analytic expressions.  They also typically require more function evaluations,
hence require more computer time.

     Programs exist which can read programming code for a function and produce an
exact analytic expression for the derivative.  The math package Mathematica has such
a program.  It may be worthwhile to use such a program if you are faced with a very 
complicated likelihood function.
    

VARIANCE ESTIMATION AND TESTING

     Maximum likelihood estimation has several good features, first noticed by
R. A. Fisher.  Maximum likelihood estimates (MLEs) are consistent, which implies
they are asymptotically unbiased and have minimum variance, and are approximately
normally distributed.  An approximate variance-covariance matrix of the estimates is
the negative inverse of the second derivative matrix evaluated at the MLE.  This is
computed explicitly in the PROC IML program.  The square roots of the diagonal
elements of this matrix are approximate standard errors of the coefficient
estimates, and can be used to compute confidence intervals for the true
coefficients.

     The log-likelihood itself can be used for testing hypotheses.  Suppose you 
are considering two models for the probability of stroke death:

     Model 1:  prob(Y = 1 | age) = 1 / (1 + exp(-b0 - b1*age))


     Model 2:  prob(Y = 1 | age, SBP) = 1 / (1 + exp(-b0 - b1*age - b2*SBP)


     You use a program like that used in notes.019 to compute MLEs for both
models.  The program computes estimates of the log-likelihood itself in the course
of computing the MLEs.  These two models are hierarchically related: that is,
Model 2 is obtained from Model 1 by simply adding a new risk factor (SBP).  

     A test for whether adding SBP to the model makes a difference can be obtained
by computing the difference in -2 * loglikelihood for the two models; that is,

          D = (-2*logL[Model 1]) - (-2*logL[Model 2]).

     Note that the statistic D is always nonnegative.  There is a theorem which
implies that, under the null hypothesis that adding SBP to a model that already
includes age does not make a difference, the statistic D has approximately a
chi-square distribution with 1 degree of freedom. [The null hypothesis here can also
be stated as:  H0: b2 = 0.] If D is large, you will tend to reject the null
hypothesis.  For example, if D > 3.84, you would reject the null hypothesis at the
.05 significance level.

     You need to be careful in using this test.  The two models must be
hierarchically related as described above.

     Note that you can use the same kind of procedure to test whether Model 1 fits
the data better than the intercept-only model (Model 0).  Equivalently, this is a
test of whether b1 = 0 in Model 1.

========================================================================

SECOND APPENDIX TO notes.019:

SIMPLIFYING COMPUTATIONS


     Fisher information is defined as the expectation of the
square of the first derivative of the log-likelihood,

     I(B) = E{[d log(L) / dB] * [d log(L) / dB]`},

where B represents a p x 1 vector of parameters and d log(L)/dB
is the p x 1 vector of first derivatives of the log likelihood.
Note that the vector product inside the expectation is in fact a p x p
matrix of products of first derivatives.

There is a theorem which says that another formula for I(B) is

     I(B) = -E{ d2 log(L) / dB2},

that is, Fisher information also equals the negative of the expectation
of the second derivative of the log likelihood.  Note that the
second-derivative matrix is also a p x p matrix.  A precise statement
of this theorem for one-parameter distributions can be found in various
textbooks, including Bickel and Doksum (Mathematical Statistics;
Prentice-Hall, 1977), p. 139 and Exercise 4.3.9, p. 148.

     As it happens this theorem is useful for simplifying maximum
likelihood estimation programs which make use of Newton's method.
Newton's method, in the following step in the iterations,

            V1 = V0 - inv(d2logL) * d1logL,

requires the use of the second-derivatives matrix.  The theorem above
implies that you might get approximately the same answer if you replace
the second derivative matrix by the negative of the matrix of
first derivative products:

            V1 = V0 + inv(d1logL * d1logL`) * d1logL.

If this works, it has the significant advantage that you need only
derive expressions for the first derivatives and then assemble the
appropriate products in order to carry out the iterative step in the
algorithm.

     This variant of Newton's method is called either Gauss's method
or the Gauss-Newton method.

     It is implemented below in a variation of the program imlml.sas
which appears on 19.1-19.3 of notes.019.  The only difference in these
two programs is that in imlml.sas, the following code appears:

------------------------------------------------------------------------

/*      The jkth 2nd derivative of the log likelihood for the ith obs    */;

              jkd2logl = ((1 - yi) * w * xij * xik * ((1 - yi) * w + yi)
                         -(1 - yi) * w * xij * ((1 - yi) * w * xik))/
                          ((1 - yi) * w + yi)**2
                       + (-w * xij * xik * (1 + w) + w * w * xij * xik)/
                          (1 + w)**2 ;

              d2temp = d2l[j, k] ;
              d2l[j, k] = d2temp + jkd2logl ;

------------------------------------------------------------------------

     In the modified program gaussimlml.sas, this section is replaced
by:

------------------------------------------------------------------------

/*      The kth 1st derivative of the log likelihood for the ith obs    */;

           kdlogl = -(1 - yi) * w * xik / ((1 - yi) * w + yi)
                    + w * xik / (1 + w) ;

              d2temp = -jdlogl * kdlogl ;
              d2l[j, k] = d2l[j, k] + d2temp ;

------------------------------------------------------------------------

     Note that kdlogl, the derivative of the log likelihood of the data
for the i-th observation with respect to the k-th parameter, has the
same form as that of jdlogl, except that the covariate xij is replaced
in two places by xik.  Also note that d2temp is the NEGATIVE of the
product the jdlogl and kdlogl.

     Here is the complete program gaussimlml.sas, followed by printout
for the same data that occurred in the earlier example for imlml.sas:

========================================================================

/*    Program to compute maximum likelihood estimates ....           */;
/*    Based on Gauss-Newton methods - uses IML                       */;

footnote "program: /home/walleye/john-c/5421/gaussimlml.sas &sysdate &systime" ;

options linesize = 80 ;

data sim ;

     one = 1 ;
     a = 2 ;
     b = 1 ;
     n = 300 ;
     seed = 20000214 ;

/*   Simulated data from logistic distribution                      */;

     do i = 1 to n ;

        u = 2 * i / n ;
        p = 1 / (1 + exp(-a - b * u)) ;
        r = ranuni(seed) ;
        yi = 0 ;
        if r lt p then yi = 1 ;
        output ;

     end ;

run ;


/*   Logistic analysis of simulated data ......                     */;

proc logistic descending ;
     model yi = u ;
title1 'Proc logistic results ... ' ;
run ;

/*   Begin proc iml ..............................                    */;

title1 'PROC IML Results ...' ;

proc iml ;

  use sim ;                        /* Input the simulated dataset     */;

  read all var {one u} into x ;    /* Read the design matrix into x   */;
  read all var {yi}    into y ;    /* Read the outcome data into y    */;

  p = 2 ;                          /* Set the number of parameters    */;

  beta = {0.00, 0.00} ;            /* Initialize the param. vector    */;


/* -------------------------------------------------------------------*/;
/*   Module which computes loglikelihood and derivatives ...          */;

  start loglike(beta, p, x, y, l, dl, d2l) ;

     n = 300 ;

     l = 0 ;                       /*  Initialize log likelihood ...  */;
     dl  = j(p, 1, 0) ;            /*  Initialize 1st derivatives...  */;
     d2l = j(p, p, 0) ;            /*  Initialize 2nd derivatives...  */;

     do i = 1 to n ;

        xi = x[i,] ;
        yi = y[i] ;
        xibeta = xi * beta ;

        w = exp(-xibeta) ;

/*      The log likelihood for the i-th observation ...               */;

        iloglike = log((1 - yi) * w + yi) - log(1 + w) ;

        l = l + iloglike ;

        do j = 1 to p ;

           xij = xi[j] ;

/*      The jth 1st derivative of the log likelihood for the ith obs      */;

           jdlogl = -(1 - yi) * w * xij / ((1 - yi) * w + yi)
                    + w * xij / (1 + w) ;

           dtemp = dl[j] + jdlogl ;
           dl[j] = dtemp ;

           do k = 1 to p ;

              xik = xi[k] ;

/*      The kth 1st derivative of the log likelihood for the ith obs    */;

           kdlogl = -(1 - yi) * w * xik / ((1 - yi) * w + yi)
                    + w * xik / (1 + w) ;

              d2temp = -jdlogl * kdlogl ;
              d2l[j, k] = d2l[j, k] + d2temp ;

           end ;

        end ;

     end ;

 finish loglike ;

/* -------------------------------------------------------------------*/;

 eps = 1e-8 ;
 diff = 1 ;

/*  The following do loop stops when the increments in beta are       */;
/*  sufficiently small, or when number of iterations reaches 20       */;

 do iter = 1 to 20 while(diff > eps) ;

    run loglike(beta, p, x, y, l, dl, d2l) ;

    invd2l = inv(d2l) ;
    beta = beta - invd2l * dl ;       /*  The key Newton step ...  */;

    diff = max(abs(invd2l * dl)) ;

    print iter  l  dl  d2l  diff  beta ;

 end ;

 b1 = beta[1] ;
 b2 = beta[2] ;

 serr1 = sqrt(-invd2l[1, 1]) ;
 serr2 = sqrt(-invd2l[2, 2]) ;
 covar = -invd2l ;
 llratio = - 2 * l ;

 print ' -2 * loglikelihood = ' llratio ;

 print 'b1 coeff, std err : ' b1 serr1 ;
 print 'b2 coeff, std err : ' b2 serr2 ;

 print 'covariance matrix : ' covar ;

quit ;

========================================================================

                           Proc logistic results ...                           1
                                                  16:30 Thursday, March 15, 2001

                             The LOGISTIC Procedure

     Data Set: WORK.SIM     
     Response Variable: YI        
     Response Levels: 2
     Number of Observations: 300
     Link Function: Logit


                                Response Profile
 
                           Ordered
                             Value      YI     Count

                                 1       1       278
                                 2       0        22



      Model Fitting Information and Testing Global Null Hypothesis BETA=0
 
                               Intercept
                 Intercept        and   
   Criterion       Only       Covariates    Chi-Square for Covariates

   AIC             159.306       153.781         .                          
   SC              163.010       161.188         .                          
   -2 LOG L        157.306       149.781        7.525 with 1 DF (p=0.0061)  
   Score              .             .           7.266 with 1 DF (p=0.0070)  


                    Analysis of Maximum Likelihood Estimates
 
               Parameter Standard    Wald       Pr >    Standardized     Odds
   Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio

   INTERCPT 1     1.5917   0.3766    17.8675     0.0001            .     .   
   U        1     1.1108   0.4273     6.7592     0.0093     0.354175    3.037


         Association of Predicted Probabilities and Observed Responses

                   Concordant = 66.7%          Somers' D = 0.344
                   Discordant = 32.3%          Gamma     = 0.348
                   Tied       =  1.0%          Tau-a     = 0.047
                   (6116 pairs)                c         = 0.672
 
 
 
 
 
 
 
 
 
 
        program: /home/walleye/john-c/5421/gaussimlml.sas 15MAR01 16:30
                               PROC IML Results ...                             2
                                                  16:30 Thursday, March 15, 2001

          ITER         L        DL       D2L                DIFF      BETA
             1 -207.9442       128       -75    -75.25   1.42466   1.42466
                         135.45333    -75.25 -100.5006           0.2810698


          ITER         L        DL       D2L                DIFF      BETA
             2 -86.11339 24.433947 -21.98682  -16.7216 1.3524393 1.5073912
                         27.881783  -16.7216 -19.59303           1.6335091


          ITER         L        DL       D2L                DIFF      BETA
             3 -76.18316 -4.486689 -19.88465 -14.20949 0.3422821 1.5263487
                         -5.234316 -14.20949  -16.0794            1.291227


          ITER         L        DL       D2L                DIFF      BETA
             4 -75.01091 -1.128447 -19.82677 -14.12761   0.15291 1.5783898
                         -1.706508 -14.12761 -15.96837            1.138317


          ITER         L        DL       D2L                DIFF      BETA
             5 -74.89283 -0.120079 -19.82628 -14.11533 0.0251799 1.5902601
                         -0.234054 -14.11533 -15.94952           1.1131371


          ITER         L        DL       D2L                DIFF      BETA
             6 -74.89042 -0.004023 -19.82753 -14.11462 0.0021435  1.591583
                         -0.015512 -14.11462 -15.94839           1.1109937


          ITER         L        DL       D2L                DIFF      BETA
             7  -74.8904  -0.00018 -19.82766 -14.11456 0.0001575 1.5916861
                         -0.001058 -14.11456 -15.94832           1.1108361


          ITER         L        DL       D2L                DIFF      BETA
             8  -74.8904 -0.000013 -19.82767 -14.11456 0.0000114 1.5916936
                         -0.000076 -14.11456 -15.94831           1.1108247


          ITER         L        DL       D2L                DIFF      BETA
             9  -74.8904 -9.028E-7 -19.82767 -14.11455 8.2616E-7 1.5916942
                         -5.518E-6 -14.11455 -15.94831           1.1108239


          ITER         L        DL       D2L                DIFF      BETA
            10  -74.8904 -6.533E-8 -19.82767 -14.11455 5.9797E-8 1.5916942
                         -3.993E-7 -14.11455 -15.94831           1.1108238


          ITER         L        DL       D2L                DIFF      BETA
            11  -74.8904 -4.728E-9 -19.82767 -14.11455  4.328E-9 1.5916942
                          -2.89E-8 -14.11455 -15.94831           1.1108238


 

                               PROC IML Results ...                             3
                                                  16:30 Thursday, March 15, 2001

                                                 LLRATIO
                         -2 * loglikelihood =  149.78081


                                                B1     SERR1
                    b1 coeff, std err :  1.5916942 0.3692068


                                                B2     SERR2
                    b2 coeff, std err :  1.1108238 0.4116691


                                             COVAR
                    covariance matrix :  0.1363137  -0.12064
                                          -0.12064 0.1694714

        program: /home/walleye/john-c/5421/gaussimlml.sas 15MAR01 16:30
========================================================================

     Comparing the two printouts for imlml.sas and gaussimlml.sas
shows that they produce exactly the same parameter estimates.  The
variance estimates differ slightly.  The theorem which justifies the
Gauss method does not say that the observed matrices will exactly
agree; it says that they have the same expectation.  In fact they
will differ markedly from each other in regions of the parameter space
which are not near the maximum likelihood estimates themselves.

     The Gauss-method program takes 11 iterations to converge in this
case, while the Newton method requires only 7.  The final estimates of
the D2L matrices do not agree, and, although identical parameter
estimates are obtained, the estimated standard errors also do not agree.

     Which method is better?

     It's not clear.  The Newton second-derivative method is based
on a Taylor expansion of the first derivative.  It is evaluated at
the current estimate.  The Gauss method depends on the theorem noted
above,

     I(B) = -E{ d2 log(L) / dB2},

where I(B) is Fisher information, defined as

     I(B) = E{[d log(L) / dB] * [d log(L) / dB]`}.

     In that sense, the Gauss method is an approximation, since the
expectation is not known exactly.  The Gauss method may not do well
until you are close to the MLE, because the theorem above applies
only to that point.  However, Taylor's theorem is itself an approximation.  
It is not clear which approximation is better in general.  I am slightly
inclined to think the Newton method will perform better, but it has 
the disadvantage that it requires computation of second derivatives.

     SAS, in the nonlinear least squares procedure PROC NLIN, makes
use of either the Newton or Gauss methods, as well as several others.
The default method for PROC NLIN is Gauss.  The Newton method can
be specified as an option.  Note that in the SAS implementation, if
the Newton method is used, the second derivatives are estimated
numerically, whereas the first derivatives are specified analytically.
My own experience with PROC NLIN indicates that both the Gauss and
Newton methods are not quite as reliable as a third variant, MARQUARDT.

========================================================================

PROBLEM 18A.

     The following data set shows values of a variable X which
has a distribution which is the mixture of two normal distributions.
That is, there is a probability p1 that X is drawn from a normal
distribution with mean mu1 and standard deviation s.  There is a
probability 1 - p1 that X is drawn from another normal distribution
with mean mu2 and standard deviation s.

     A histogram is shown below the dataset.

     The unknown parameters are: mu1, mu2, s, and p1.

     Write a program which will compute maximum likelihood
estimates of these parameters.

                                                                               1
                                              18:27 Wednesday, November 15, 2006

                                 Obs       x

                                   1    0.80625
                                   2    1.03691
                                   3    1.46767
                                   4    0.74435
                                   5    0.76884
                                   6    1.15201
                                   7    1.76863
                                   8    1.35288
                                   9    1.39510
                                  10    1.38580
                                  11    1.36411
                                  12    1.51375
                                  13    0.68575
                                  14    1.11369
                                  15    1.48374
                                  16    1.15337
                                  17    1.23556
                                  18    1.11115
                                  19    1.31487
                                  20    1.52987
                                  21    1.33171
                                  22    1.45874
                                  23    0.59281
                                  24    1.51549
                                  25    1.24540
                                  26    0.82244
                                  27    1.31432
                                  28    0.54313
                                  29    0.79331
                                  30    1.92460
                                  31    0.65663
                                  32    1.33266
                                  33    1.04949
                                  34    0.97332
                                  35    1.49776
                                  36    1.53517
                                  37    0.81621
                                  38    1.48104
                                  39    0.90330
                                  40    1.31403
                                  41    0.93707
                                  42    0.85454
                                  43    1.59543
                                  44    1.85178
                                  45    0.79937
                                  46    1.91335
                                  47    1.12369
                                  48    1.02486
                                  49    0.98977
                                  50    1.28284
                                  51    0.43247
                                  52    0.83582
                                  53    0.87399
                                  54    0.95347
                                  55    1.41779
                                  56    0.78824
                                  57    0.43659
                                  58    1.47591
                                  59    1.76129
                                  60    1.29779
                                  61    1.21414
                                  62    1.30930
                                  63    1.69770
                                  64    1.53651
                                  65    1.54546
                                  66    1.62319
                                  67    0.83417
                                  68    0.50302
                                  69    0.76503
                                  70    1.06025
                                  71    1.39400
                                  72    1.20968
                                  73    0.84474
                                  74    1.10303
                                  75    0.94593
                                  76    1.24635
                                  77    0.77448
                                  78    1.22901
                                  79    1.75974
                                  80    0.70282
                                  81    1.33777
                                  82    0.87791
                                  83    1.68290
                                  84    0.57039
                                  85    1.58249
                                  86    1.26539
                                  87    1.14392
                                  88    1.57961
                                  89    0.95139
                                  90    0.86490
                                  91    0.69971
                                  92    1.05069
                                  93    1.06263
                                  94    0.83222
                                  95    1.24652
                                  96    0.69845
                                  97    1.51541
                                  98    1.25000
                                  99    0.49979
                                 100    1.22121
                                 101    1.21017
                                 102    0.73679
                                 103    0.62642
                                 104    1.62786
                                 105    0.62392
                                 106    1.68102
                                 107    0.85341
                                 108    0.72261
                                 109    0.98163
                                 110    1.02806
                                 111    1.02553
                                 112    1.60963
                                 113    1.02189
                                 114    0.89691
                                 115    1.39085
                                 116    0.92491
                                 117    0.89822
                                 118    0.96070
                                 119    1.27105
                                 120    1.29916
                                 121    1.12870
                                 122    1.20774
                                 123    1.40016
                                 124    1.48849
                                 125    0.91027
                                 126    1.51901
                                 127    0.92196
                                 128    1.27068
                                 129    0.98230
                                 130    1.20435
                                 131    0.66172
                                 132    1.33744
                                 133    1.17518
                                 134    1.33228
                                 135    0.47799
                                 136    1.39439
                                 137    0.88771
                                 138    1.28552
                                 139    0.75350
                                 140    0.81856
                                 141    0.64155
                                 142    1.08935
                                 143    1.76052
                                 144    0.77502
                                 145    0.65953
                                 146    0.87828
                                 147    1.34773
                                 148    0.93423
                                 149    0.83391
                                 150    0.97298
                                 151    0.69556
                                 152    0.57002
                                 153    0.82694
                                 154    1.27763
                                 155    0.57893
                                 156    1.41407
                                 157    0.84335
                                 158    1.29575
                                 159    1.22293
                                 160    1.15268
                                 161    1.24789
                                 162    0.77514
                                 163    1.26712
                                 164    1.74086
                                 165    1.65762
                                 166    0.70038
                                 167    0.61442
                                 168    0.87938
                                 169    1.40703
                                 170    1.48931
                                 171    1.02206
                                 172    1.46202
                                 173    1.72561
                                 174    0.80142
                                 175    1.43894
                                 176    0.60786
                                 177    0.93005
                                 178    1.98187
                                 179    1.31041
                                 180    1.56395
                                 181    1.07366
                                 182    1.04669
                                 183    0.64663
                                 184    0.60105
                                 185    0.59709
                                 186    0.59587
                                 187    1.19483
                                 188    1.42432
                                 189    0.76585
                                 190    0.88321
                                 191    1.42488
                                 192    0.73576
                                 193    1.62198
                                 194    0.68361
                                 195    0.87568
                                 196    0.44357
                                 197    1.39430
                                 198    0.95161
                                 199    0.75769
                                 200    1.07720


                           Problem 18A, Histogram ...                          2
                               Use of PROC CHART
                                              18:27 Wednesday, November 15, 2006

    Frequency

       |                    ****
       |                    ****              ****
    30 +                    ****              ****
       |              ****  ****              ****
       |              ****  ****        ****  ****
       |              ****  ****        ****  ****
       |              ****  ****  ****  ****  ****
    20 +              ****  ****  ****  ****  ****  ****
       |        ****  ****  ****  ****  ****  ****  ****
       |        ****  ****  ****  ****  ****  ****  ****
       |        ****  ****  ****  ****  ****  ****  ****
       |        ****  ****  ****  ****  ****  ****  ****  ****
    10 +        ****  ****  ****  ****  ****  ****  ****  ****
       |        ****  ****  ****  ****  ****  ****  ****  ****  ****
       |  ****  ****  ****  ****  ****  ****  ****  ****  ****  ****
       |  ****  ****  ****  ****  ****  ****  ****  ****  ****  ****  ****
       |  ****  ****  ****  ****  ****  ****  ****  ****  ****  ****  ****
       ---------------------------------------------------------------------
          0.45  0.60  0.75  0.90  1.05  1.20  1.35  1.50  1.65  1.80  1.95

                                     x Midpoint

========================================================================

~john-c/5421/notes.logistic    Last update: November 15, 2010