APPLICATIONS OF SAS IML: SPH 5421 notes.019 FINDING MAXIMUM LIKELIHOOD ESTIMATES Why might it be useful in statistics to be able to find minima or maxima of a given nonlinear function? Suppose L(B) = L(Y, X;B) is the loglikelihood for a given data set. Here Y could indicate an outcome variable (dependent variable) and X a vector of predictors. Both Y and X are measured variables in the data set. The vector B represents unknown coefficients in the model. The object of maximum likelihood estimation is to find the value of B which maximizes L(Y, X ; B) for the observed values of Y and X. As above this reduces to finding where the partial derivatives of L are 0. That is, the object is to find B = (B1, B2, B3, ... , Bp) such that dL/dB1 = 0 dL/dB2 = 0 [1] . . . dL/dBp = 0. This is a system of p simultaneous nonlinear equations. In general the solution cannot be found by a formula. It must be estimated numerically by an iterative process like Newton's method. The system of equations [1] can be written in vector form as [2] dL/dB = 0, where 0 is the 0-vector in p-dimensional space. In theory, finding estimates of the parameter vector B which maximize the likelihood is no different from the problem discussed above of minimizing or maximizing a nonlinear function. In fact, it is often rather difficult to work with log likelihood functions because the expressions for them involve not only the unknown parameters, but also the independent variables X in the design matrix and the outcome variable Y. This implies that each evaluation of the likelihood function must include a complete pass through the entire design matrix. The following example is typical. It uses PROC IML in SAS to produce maximum likelihood estimates for a logistic model. The basic algorithm is Newton's method, and requires repeated evaluation of the loglikelihood, the first derivative vector of the log likelihood, and the 2nd derivative matrix of the log likelihood. Output from this program is also appended. The parameter vector, BETA, is assumed to be p x 1. The first derivative vector of the log-likelihood L is also p x 1. The second derivative matrix of the loglikelihood is a square, symmetric p x p matrix. ======================================================================== /* Program to compute maximum likelihood estimates .... */; /* Based on Newton-Raphson methods - uses IML */; footnote "program: /home/walleye/john-c/5421/imlml.sas &sysdate &systime" ; options linesize = 80 ; data sim ; one = 1 ; a = 2 ; b = 1 ; n = 300 ; seed = 20000214 ; /* Simulated data from logistic distribution */; do i = 1 to n ; u = 2 * i / n ; p = 1 / (1 + exp(-a - b * u)) ; r = ranuni(seed) ; yi = 0 ; if r lt p then yi = 1 ; output ; end ; run ; /* Logistic analysis of simulated data ...... */; proc logistic descending ; model yi = u ; title1 'Proc logistic results ... ' ; run ; /* Begin proc iml .............................. */; title1 'PROC IML Results ...' ; proc iml ; use sim ; /* Input the simulated dataset */; read all var {one u} into x ; /* Read the design matrix into x */; read all var {yi} into y ; /* Read the outcome data into y */; p = 2 ; /* Set the number of parameters */; beta = {0.00, 0.00} ; /* Initialize the param. vector */; /* -------------------------------------------------------------------*/; /* Module which computes loglikelihood and derivatives ... */; start loglike(beta, p, x, y, l, dl, d2l) ; n = 300 ; l = 0 ; /* Initialize log likelihood ... */; dl = j(p, 1, 0) ; /* Initialize 1st derivatives... */; d2l = j(p, p, 0) ; /* Initialize 2nd derivatives... */; do i = 1 to n ; xi = x[i,] ; yi = y[i] ; xibeta = xi * beta ; w = exp(-xibeta) ; /* The log likelihood for the i-th observation ... */; iloglike = log((1 - yi) * w + yi) - log(1 + w) ; l = l + iloglike ; do j = 1 to p ; xij = xi[j] ; /* The jth 1st derivative of the log likelihood for the ith obs */; jdlogl = -(1 - yi) * w * xij / ((1 - yi) * w + yi) + w * xij / (1 + w) ; dtemp = dl[j] + jdlogl ; dl[j] = dtemp ; do k = 1 to p ; xik = xi[k] ; /* The jkth 2nd derivative of the log likelihood for the ith obs */; jkd2logl = ((1 - yi) * w * xij * xik * ((1 - yi) * w + yi) -(1 - yi) * w * xij * ((1 - yi) * w * xik))/ ((1 - yi) * w + yi)**2 + (-w * xij * xik * (1 + w) + w * w * xij * xik)/ (1 + w)**2 ; d2temp = d2l[j, k] ; d2l[j, k] = d2temp + jkd2logl ; end ; end ; end ; finish loglike ; /* -------------------------------------------------------------------*/; eps = 1e-8 ; diff = 1 ; /* The following do loop stops when the increments in beta are */; /* sufficiently small, or when number of iterations reaches 20 */; do iter = 1 to 20 while(diff > eps) ; run loglike(beta, p, x, y, l, dl, d2l) ; invd2l = inv(d2l) ; beta = beta - invd2l * dl ; /* The key Newton step ... */; diff = max(abs(invd2l * dl)) ; print iter l dl d2l diff beta ; end ; b1 = beta[1] ; b2 = beta[2] ; serr1 = sqrt(-invd2l[1, 1]) ; serr2 = sqrt(-invd2l[2, 2]) ; covar = -invd2l ; llratio = - 2 * l ; print ' -2 * loglikelihood = ' llratio ; print 'b1 coeff, std err : ' b1 serr1 ; print 'b2 coeff, std err : ' b2 serr2 ; print 'covariance matrix : ' covar ; quit ; ======================================================================== Proc logistic results ... 1 10:55 Saturday, March 18, 2000 The LOGISTIC Procedure Data Set: WORK.SIM Response Variable: YI Response Levels: 2 Number of Observations: 300 Link Function: Logit Response Profile Ordered Value YI Count 1 1 278 2 0 22 Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 159.306 153.781 . SC 163.010 161.188 . -2 LOG L 157.306 149.781 7.525 with 1 DF (p=0.0061) Score . . 7.266 with 1 DF (p=0.0070) Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT 1 1.5917 0.3766 17.8675 0.0001 . . U 1 1.1108 0.4273 6.7592 0.0093 0.354175 3.037 Association of Predicted Probabilities and Observed Responses Concordant = 66.7% Somers' D = 0.344 Discordant = 32.3% Gamma = 0.348 Tied = 1.0% Tau-a = 0.047 (6116 pairs) c = 0.672 program: /home/walleye/john-c/5421/imlml.sas 18MAR00 10:55 PROC IML Results ... 2 10:55 Saturday, March 18, 2000 ITER L DL D2L DIFF BETA 1 -207.9442 128 -75 -75.25 1.42466 1.42466 135.45333 -75.25 -100.5006 0.2810698 ITER L DL D2L DIFF BETA 2 -86.11339 24.433947 -39.1126 -36.7227 0.382878 1.6898849 27.881783 -36.7227 -47.38327 0.6639477 ITER L DL D2L DIFF BETA 3 -76.10077 5.4183855 -24.6185 -20.35354 0.3405272 1.6284455 7.0927236 -20.35354 -24.50093 1.0044749 ITER L DL D2L DIFF BETA 4 -74.93779 0.8082239 -20.61113 -15.14088 0.1008562 1.5935698 1.1867474 -15.14088 -17.00239 1.1053311 ITER L DL D2L DIFF BETA 5 -74.89053 0.04008 -19.90169 -14.11319 0.0054776 1.5916993 0.0585386 -14.11319 -15.50634 1.1108087 ITER L DL D2L DIFF BETA 6 -74.8904 0.0001116 -19.8646 -14.05981 0.0000151 1.5916942 0.0001618 -14.05981 -15.4291 1.1108238 ITER L DL D2L DIFF BETA 7 -74.8904 8.55E-10 -19.86449 -14.05967 1.157E-10 1.5916942 1.2388E-9 -14.05967 -15.42889 1.1108238 LLRATIO -2 * loglikelihood = 149.78081 B1 SERR1 b1 coeff, std err : 1.5916942 0.376554 B2 SERR2 b2 coeff, std err : 1.1108238 0.4272664 COVAR covariance matrix : 0.1417929 -0.12921 -0.12921 0.1825565 program: /home/walleye/john-c/5421/imlml.sas 18MAR00 10:55 ======================================================================== The difficult part of this kind of program is writing out the expressions for the 1st and 2nd partial derivatives. To some extent this problem cannot be avoided. The choice of initial values for the parameter vector BETA can strongly affect the behavior of this kind of program. In general you want to choose a value for BETA that is close to the minimum point. However, you usually don't know where that is. In this case, (0, 0)' was chosen and the algorithm converged quickly and satisfactorily. In other cases you may need to make an informed guess of where the minimum might be. You may be able to do this by trying several values for BETA and examining the corresponding values of the log likelihood. If the number of parameters is moderate or large, (> 5) this can be very time-consuming. In this case the algorithm converges in 7 iterations to the same answer as that given by PROC LOGISTIC. Note that during the iterations, (1) The log-likelihood L increases (2) The derivative vector, DL, gets close to (0, 0)'. (3) DIFF --> 0. The program above is a very rudimentary one for this purpose. There is no reason to use this program rather than PROC LOGISTIC. It is intended as an example of how the algorithm can be implemented. Features which should be added include the following: 1. A way of checking that the likelihood is actually increasing. 2. Check that the derivatives of the log-likelihood are very small when convergence is achieved: typically, less than 10^(-5) in absolute value. 3. A provision for what to do if the algorithm does not converge within the prescribed number of steps. 4. A provision for 'step-halving': if a given iteration actually does not increase the likelihood, a modification in which the parameter estimates can be incremented by one-half the amount can be added. Step-halving can be implemented as follows: 1) start with f = 1 2) compute beta = beta - f * inv(d2l) * dl ; 3) evaluate the log-likelihood at the new beta 4) if the loglikelihood decreases at a given iteration, replace f by f/2 and go back to step 2. 5) keep doing this until you either do 10 step-halves, or the loglikelihood actually increases. If after 10 step- halves, the loglikelihood has not improved, declare nonconvergence and exit. The rationale for step-halving is as follows. The change in the parameters at a given step may be in the correct direction, but too large in magnitude to increase the log-likelihood. Step-halving produces changes in the parameters in the same direction as that specified by the Newton process, but of smaller magnitude. 5. Provisions for dealing with missing data need to be either built into the program, or the dataset used in the analysis must have the observations with missing data deleted before the IML procedure begins. 6. Provision to print an estimate of the standard errors of the coefficient estimates. This can easily be added to the above program at the end, as follows: invd2l = inv(d2l) ; serr1 = sqrt(-invd2l[1, 1]) ; serr2 = sqrt(-invd2l[2, 2]) ; Here serr1 will be the estimated standard error of the intercept term, b0, and serr2 will the estimated s. e. for the coefficient b1. The reason this works is, the covariance matrix of parameters in maximum likelihood estimation is asymptotically equal to the negative of the inverse of the Fisher information. The Fisher information is the expectation of the 2nd derivative of the log likelihood. The expectation is usually approximated by simply evaluating the 2nd derivative matrix at the maximum likelihood estimate itself. In this case that 2nd derivative matrix is: d2l. Fisher information is an important idea. ------------------------------------------------------------------------------------- Finally, why are programs like this useful, if you already have things like PROC LOGISTIC to find coefficient estimates and other statistics? SAS and Splus both provide a wide but still limited range of statistical routines for regression and analysis of variance. In SAS, the primary areas are (1) linear models, (2) a subset of generalized linear models and loglinear models (PROC GENMOD), (3) Mixed models (PROC MIXED), (4) nonlinear least squares (PROC NLIN), and (5) proportional hazards models for survival analysis. SAS does not have a procedure to deal with some models in survival analysis (e.g., Gompertz or Pareto), general forms of Poisson regression, among many others. If you need to find maximum likelihood estimates for a model not handled by a SAS or Splus procedure, you may need to have to write a program something like the example above or you may need to make use of a statistical package like BMDP which does include a procedure (LE) for maximum-likelihood estimation. ================================================================================= PROBLEM 18 1. Use a program like that given above to do the following logistic regression problem: in the Lung Health Study dataset, assume that the probability of quitting smoking at Year 1 is given by: prob(quit smoke at A1) = 1 / (1 + exp(-b0 - b1*age)). Thus, the problem is to estimate b0 and b1. Check your answer by using PROC LOGISTIC or the Splus version of logistic regression. 2. Again, use a similar program to estimate the coefficients for the following model: prob(quit smoke at A1) = 1 /(1 + exp(-b0 - b1*age - b2*gender)) ; Also estimate standard errors of the coefficients. ================================================================================= /home/walleye/john-c/5421/notes.019 Last update: October 21, 2003 ================================================================================= SAS PROC IML SPH 5421 notes.logistic Maximum likelihood estimation, continued APPENDIX to notes.019: The Logistic Model: Likelihood, Log-likelihood, Derivatives, and Maximum Likelihood Estimation. The logistic model is used to model the probability of an event as a function of variables called risk factors. Occurrence of the event (the outcome variable) is represented by a variable Y, where Y = 1 means the event occurred and Y = 0 means it did not occur. The probability that the event occurs is often the function of certain risk factors. For example, if the event is death from stroke, the probability of the event within a given fixed time interval (for example, one year) depends on the age of the person and the person's systolic blood pressure. Older people are at higher risk and people with high SBP are at higher risk. For the stroke-death example, the logistic model is prob(Y = 1 | X1 = age, X2 = SBP) = 1 / (1 + exp(-b0 - b1*X1 - b2*X2)) Here b0, b1, and b2 are coefficients that are usually unknown. The typical problem for a given study is to collect data and use a program to estimate the values of b0, b1, and b2. Note that the logistic probability is always a number between 0 and 1. Also note that if, for example, b2 is positive, then an increase in X2 will produce an increase in the probability that Y = 1. Conversely, if b2 is negative, then an increase in X2 will cause a decrease in the probability that Y = 1. It is easy to show that prob(Y = 0 | X1, X2) = 1 - prob(Y = 1 | X1, X2) = exp(-b0 - b1*X1 - b2*X2)/ (1 + exp(-b0 - b1*X1 - b2*X2)) Assume you are carrying out an observational study in which you determine each participant's age and systolic blood pressure at a baseline visit, and then observe what happens to the person during a one-year followup period. Suppose a person is 85 years of age and has SBP = 160, and that he dies of stroke within the one-year period of observation. The likelihood for this observation for this single individual is: L1 = 1 / (1 + exp(-b0 - b1*85 - b2*160)). On the other hand, if he does NOT die of stroke within a year, the outcome Y is set to zero, and the likelihood for the observation is L0 = 1 - L1 = exp(-b0 - b1*85 - b2*160) / (1 + exp(-b0 - b1*85 - b2*160)) It is possible to write the likelihood as a single expression which is a function of Y, X1, and X2: L = Y / (1 + exp(-b0 - b1*85 - b2*160) + (1 - Y) * exp(-b0 - b1*85 - b2*160) / (1 + exp(-b0 - b1*85 - b2*160)) (Note that if Y = 1, the second part of the expression is 0, while if Y = 0, the first part of the expression is 0.) If you knew what the values of b0, b1, and b2 were, you could explicitly calculate what the person's risk of an event would be. The occurrence or non-occurrence of the event for a single individual can be viewed as a Bernoulli random variable. The likelihood of the entire set of observations from the whole study is a product of the individual likelihoods. Different individuals have different values of age and SBP, and possibly different outcomes. Thus the total likelihood of all the observations could be written L = product{Yi / (1 + exp(-b0 - b1*X1i - b2*X2i) + (1 - Yi) * exp(-b0 - b1*X1i - b2*X2i) / (1 + exp(-b0 - b1*X1i - b2*X2i)), where the i subscript refers to the i-th person in the study, and the product is taken across all the observations, i = 1, 2, ..., n. The main question in a typical study is, what coefficients 'best' fit the observed data? Note that the likelihood is both a function of the observed data (Yi, X1i, X2i) and of the unknown coefficients b0, b1, and b2. The observed data, are fixed constants, and there are typically many data points and measurements which make them up. If you have a study involving 1000 people with one outcome variable Y and two covariates X1 and X2 for each person, the number of observed data values is 3000. On the other hand, there are only 3 unknown coefficients. Thus in a sense, the likelihood would be a function of only 3 variables, b0, b1, and b2. The goal of maximum likelihood estimation is to find the values of the unknown coefficients which maximize the likelihood. The problem of finding maximum likelihood estimates, therefore, is just a problem of maximizing a nonlinear function with a relatively small number of unknowns. It is a very complicated function, but, in the case of the stroke-death study, only 3 unknown coefficients need to be estimated. Again, the likelihood is a product of probabilities. Many of the probabilities will be small, and of course none are larger than 1 (or smaller than 0). In most studies, for realistic choices of the coefficients b0, b1, and b2, the likelihood is a very small number, perhaps on the order of e^(-1000). This means that it is very hard to do accurate computations. It is much easier computationally to deal with the LOG of the likelihood. The log (natural log) function is a monotone increasing function, so that if you find values which maximize the log of the likelihood, those same values will also maximize the likelihood itself. Another good thing about taking the log of the likelihood is that the PRODUCT in the likelihood expression gets converted to a SUM. Sums are generally easier to compute with than products. The expression for the log-likelihood of the data set is: l = sum{log[Yi + (1 - Yi) * exp(-Xi*b)] - log[1 + exp(-Xi*b)]}, where the sum is over i = 1, 2, ..., n. Note also that Xi*b is matrix shorthand; Xi = (1, X1i, X2i) and b = (b0, b1, b2)`, and exp(-Xi*b) = exp(-b0 - b1*X1i - b2*X2i). The method of finding the values of b0, b1, and b2 that maximize the log likelihood is simply the Newton method for finding the maximum point of a nonlinear function. It is not particularly hard to write down a PROC IML routine which will do this. The only tricky part of it is writing the correct expressions for the first and second-order partial derivatives of the log likelihood. Here is the first partial derivative of the log-likelihood l with respect to b1: dl/db1 = sum{-(1 - Yi)*X1i*exp(-Xi*b) / (Yi + (1 - Yi)*exp(-Xi*b)) + X1i*exp(-Xi*b) / ( 1 + exp(-Xi*b))}. The first partial derivative with respect to b2 is similar. The first partial derivative with respect to b0 is also similar, except that where X1i occurs in the expression above, it is replaced by 1. The second partial derivative of l with respect to b1 (twice) is d2l/db1b1 = sum{[(1 - Yi)*(X1i^2)*exp(-Xi*b)*(Yi + (1 - Yi)*exp(-Xi*b)) - ((1 - Yi)^2)*(X1i^2)*exp(-2*Xi*b)] / (Yi + (1 - Yi)*exp(-Xi*b))^2] + [-(X1i^2)*exp(-Xi*b)*(1 + exp(-Xi*b)) + (X1i^2)*exp(-2*Xi*b) / (1 + exp(-Xi*b))^2]} The mixed second partial derivative of l with respect to b1 and b2 is d2l/db1b2 = sum{[(1 - Yi)*(X1i*X2i)*exp(-Xi*b)*(Yi + (1 - Yi)*exp(-Xi*b)) - ((1 - Yi)^2)*(X1i*X2i)*exp(-2*Xi*b)) / (Yi + (1 - Yi)*exp(-Xi*b))^2] + [-(X1i*X2i)*exp(-Xi*b)*(1 + exp(-Xi*b)) + (X1i*X2i)*exp(-2*Xi*b) / (1 + exp(-Xi*b))^2]} Most of the derivation of this awful-looking expression is just application of the quotient rule for derivatives. You do have to be very careful about plus and minus signs. This expression occurs in a slightly disguised form in the program imlml.sas, in notes.019. There are some operations in the above expression which can consume a lot of computer time. The expression exp(-Xi*b) occurs over and over again. Recall that Xi is a 1 x p matrix, and b is a p x 1 matrix. Every computation of Xi*b involves some arithmetic: p multiplications and p additions. You don't see this explicitly in IML, but it occurs and requires some computer time. Also, in terms of computer time, functions like exponentiation are rather time-consuming. It is better to do these things only once per iteration if possible. That is why, in the program in notes.019, xibeta = Xi * beta is computed only once per iteration, and w = exp(-xibeta) is computed only once per iteration. After these two computations, all the occurrences of exp(-Xi*b) are replaced by w. This saves some time. There are other places in the program that some savings in computer time could be realized, but they are less important. The logistic likelihood is fairly typical of likelihood expressions. In practice more complicated likelihood expressions are often encountered. If the Newton process of finding maximum likelihood estimates is used, some way of estimating first and second derivatives must be available. You can try numerical estimation, as is done in the program numml.sas in notes.020. In that example, the first derivatives are computed analytically but the second derivatives are estimated numerically. The process appears to work satisfactorily for a simulated dataset. However in general numerical estimates of derivatives are less trustworthy than analytic expressions. They also typically require more function evaluations, hence require more computer time. Programs exist which can read programming code for a function and produce an exact analytic expression for the derivative. The math package Mathematica has such a program. It may be worthwhile to use such a program if you are faced with a very complicated likelihood function. VARIANCE ESTIMATION AND TESTING Maximum likelihood estimation has several good features, first noticed by R. A. Fisher. Maximum likelihood estimates (MLEs) are consistent, which implies they are asymptotically unbiased and have minimum variance, and are approximately normally distributed. An approximate variance-covariance matrix of the estimates is the negative inverse of the second derivative matrix evaluated at the MLE. This is computed explicitly in the PROC IML program. The square roots of the diagonal elements of this matrix are approximate standard errors of the coefficient estimates, and can be used to compute confidence intervals for the true coefficients. The log-likelihood itself can be used for testing hypotheses. Suppose you are considering two models for the probability of stroke death: Model 1: prob(Y = 1 | age) = 1 / (1 + exp(-b0 - b1*age)) Model 2: prob(Y = 1 | age, SBP) = 1 / (1 + exp(-b0 - b1*age - b2*SBP) You use a program like that used in notes.019 to compute MLEs for both models. The program computes estimates of the log-likelihood itself in the course of computing the MLEs. These two models are hierarchically related: that is, Model 2 is obtained from Model 1 by simply adding a new risk factor (SBP). A test for whether adding SBP to the model makes a difference can be obtained by computing the difference in -2 * loglikelihood for the two models; that is, D = (-2*logL[Model 1]) - (-2*logL[Model 2]). Note that the statistic D is always nonnegative. There is a theorem which implies that, under the null hypothesis that adding SBP to a model that already includes age does not make a difference, the statistic D has approximately a chi-square distribution with 1 degree of freedom. [The null hypothesis here can also be stated as: H0: b2 = 0.] If D is large, you will tend to reject the null hypothesis. For example, if D > 3.84, you would reject the null hypothesis at the .05 significance level. You need to be careful in using this test. The two models must be hierarchically related as described above. Note that you can use the same kind of procedure to test whether Model 1 fits the data better than the intercept-only model (Model 0). Equivalently, this is a test of whether b1 = 0 in Model 1. ======================================================================== SECOND APPENDIX TO notes.019: SIMPLIFYING COMPUTATIONS Fisher information is defined as the expectation of the square of the first derivative of the log-likelihood, I(B) = E{[d log(L) / dB] * [d log(L) / dB]`}, where B represents a p x 1 vector of parameters and d log(L)/dB is the p x 1 vector of first derivatives of the log likelihood. Note that the vector product inside the expectation is in fact a p x p matrix of products of first derivatives. There is a theorem which says that another formula for I(B) is I(B) = -E{ d2 log(L) / dB2}, that is, Fisher information also equals the negative of the expectation of the second derivative of the log likelihood. Note that the second-derivative matrix is also a p x p matrix. A precise statement of this theorem for one-parameter distributions can be found in various textbooks, including Bickel and Doksum (Mathematical Statistics; Prentice-Hall, 1977), p. 139 and Exercise 4.3.9, p. 148. As it happens this theorem is useful for simplifying maximum likelihood estimation programs which make use of Newton's method. Newton's method, in the following step in the iterations, V1 = V0 - inv(d2logL) * d1logL, requires the use of the second-derivatives matrix. The theorem above implies that you might get approximately the same answer if you replace the second derivative matrix by the negative of the matrix of first derivative products: V1 = V0 + inv(d1logL * d1logL`) * d1logL. If this works, it has the significant advantage that you need only derive expressions for the first derivatives and then assemble the appropriate products in order to carry out the iterative step in the algorithm. This variant of Newton's method is called either Gauss's method or the Gauss-Newton method. It is implemented below in a variation of the program imlml.sas which appears on 19.1-19.3 of notes.019. The only difference in these two programs is that in imlml.sas, the following code appears: ------------------------------------------------------------------------ /* The jkth 2nd derivative of the log likelihood for the ith obs */; jkd2logl = ((1 - yi) * w * xij * xik * ((1 - yi) * w + yi) -(1 - yi) * w * xij * ((1 - yi) * w * xik))/ ((1 - yi) * w + yi)**2 + (-w * xij * xik * (1 + w) + w * w * xij * xik)/ (1 + w)**2 ; d2temp = d2l[j, k] ; d2l[j, k] = d2temp + jkd2logl ; ------------------------------------------------------------------------ In the modified program gaussimlml.sas, this section is replaced by: ------------------------------------------------------------------------ /* The kth 1st derivative of the log likelihood for the ith obs */; kdlogl = -(1 - yi) * w * xik / ((1 - yi) * w + yi) + w * xik / (1 + w) ; d2temp = -jdlogl * kdlogl ; d2l[j, k] = d2l[j, k] + d2temp ; ------------------------------------------------------------------------ Note that kdlogl, the derivative of the log likelihood of the data for the i-th observation with respect to the k-th parameter, has the same form as that of jdlogl, except that the covariate xij is replaced in two places by xik. Also note that d2temp is the NEGATIVE of the product the jdlogl and kdlogl. Here is the complete program gaussimlml.sas, followed by printout for the same data that occurred in the earlier example for imlml.sas: ======================================================================== /* Program to compute maximum likelihood estimates .... */; /* Based on Gauss-Newton methods - uses IML */; footnote "program: /home/walleye/john-c/5421/gaussimlml.sas &sysdate &systime" ; options linesize = 80 ; data sim ; one = 1 ; a = 2 ; b = 1 ; n = 300 ; seed = 20000214 ; /* Simulated data from logistic distribution */; do i = 1 to n ; u = 2 * i / n ; p = 1 / (1 + exp(-a - b * u)) ; r = ranuni(seed) ; yi = 0 ; if r lt p then yi = 1 ; output ; end ; run ; /* Logistic analysis of simulated data ...... */; proc logistic descending ; model yi = u ; title1 'Proc logistic results ... ' ; run ; /* Begin proc iml .............................. */; title1 'PROC IML Results ...' ; proc iml ; use sim ; /* Input the simulated dataset */; read all var {one u} into x ; /* Read the design matrix into x */; read all var {yi} into y ; /* Read the outcome data into y */; p = 2 ; /* Set the number of parameters */; beta = {0.00, 0.00} ; /* Initialize the param. vector */; /* -------------------------------------------------------------------*/; /* Module which computes loglikelihood and derivatives ... */; start loglike(beta, p, x, y, l, dl, d2l) ; n = 300 ; l = 0 ; /* Initialize log likelihood ... */; dl = j(p, 1, 0) ; /* Initialize 1st derivatives... */; d2l = j(p, p, 0) ; /* Initialize 2nd derivatives... */; do i = 1 to n ; xi = x[i,] ; yi = y[i] ; xibeta = xi * beta ; w = exp(-xibeta) ; /* The log likelihood for the i-th observation ... */; iloglike = log((1 - yi) * w + yi) - log(1 + w) ; l = l + iloglike ; do j = 1 to p ; xij = xi[j] ; /* The jth 1st derivative of the log likelihood for the ith obs */; jdlogl = -(1 - yi) * w * xij / ((1 - yi) * w + yi) + w * xij / (1 + w) ; dtemp = dl[j] + jdlogl ; dl[j] = dtemp ; do k = 1 to p ; xik = xi[k] ; /* The kth 1st derivative of the log likelihood for the ith obs */; kdlogl = -(1 - yi) * w * xik / ((1 - yi) * w + yi) + w * xik / (1 + w) ; d2temp = -jdlogl * kdlogl ; d2l[j, k] = d2l[j, k] + d2temp ; end ; end ; end ; finish loglike ; /* -------------------------------------------------------------------*/; eps = 1e-8 ; diff = 1 ; /* The following do loop stops when the increments in beta are */; /* sufficiently small, or when number of iterations reaches 20 */; do iter = 1 to 20 while(diff > eps) ; run loglike(beta, p, x, y, l, dl, d2l) ; invd2l = inv(d2l) ; beta = beta - invd2l * dl ; /* The key Newton step ... */; diff = max(abs(invd2l * dl)) ; print iter l dl d2l diff beta ; end ; b1 = beta[1] ; b2 = beta[2] ; serr1 = sqrt(-invd2l[1, 1]) ; serr2 = sqrt(-invd2l[2, 2]) ; covar = -invd2l ; llratio = - 2 * l ; print ' -2 * loglikelihood = ' llratio ; print 'b1 coeff, std err : ' b1 serr1 ; print 'b2 coeff, std err : ' b2 serr2 ; print 'covariance matrix : ' covar ; quit ; ======================================================================== Proc logistic results ... 1 16:30 Thursday, March 15, 2001 The LOGISTIC Procedure Data Set: WORK.SIM Response Variable: YI Response Levels: 2 Number of Observations: 300 Link Function: Logit Response Profile Ordered Value YI Count 1 1 278 2 0 22 Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 159.306 153.781 . SC 163.010 161.188 . -2 LOG L 157.306 149.781 7.525 with 1 DF (p=0.0061) Score . . 7.266 with 1 DF (p=0.0070) Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT 1 1.5917 0.3766 17.8675 0.0001 . . U 1 1.1108 0.4273 6.7592 0.0093 0.354175 3.037 Association of Predicted Probabilities and Observed Responses Concordant = 66.7% Somers' D = 0.344 Discordant = 32.3% Gamma = 0.348 Tied = 1.0% Tau-a = 0.047 (6116 pairs) c = 0.672 program: /home/walleye/john-c/5421/gaussimlml.sas 15MAR01 16:30 PROC IML Results ... 2 16:30 Thursday, March 15, 2001 ITER L DL D2L DIFF BETA 1 -207.9442 128 -75 -75.25 1.42466 1.42466 135.45333 -75.25 -100.5006 0.2810698 ITER L DL D2L DIFF BETA 2 -86.11339 24.433947 -21.98682 -16.7216 1.3524393 1.5073912 27.881783 -16.7216 -19.59303 1.6335091 ITER L DL D2L DIFF BETA 3 -76.18316 -4.486689 -19.88465 -14.20949 0.3422821 1.5263487 -5.234316 -14.20949 -16.0794 1.291227 ITER L DL D2L DIFF BETA 4 -75.01091 -1.128447 -19.82677 -14.12761 0.15291 1.5783898 -1.706508 -14.12761 -15.96837 1.138317 ITER L DL D2L DIFF BETA 5 -74.89283 -0.120079 -19.82628 -14.11533 0.0251799 1.5902601 -0.234054 -14.11533 -15.94952 1.1131371 ITER L DL D2L DIFF BETA 6 -74.89042 -0.004023 -19.82753 -14.11462 0.0021435 1.591583 -0.015512 -14.11462 -15.94839 1.1109937 ITER L DL D2L DIFF BETA 7 -74.8904 -0.00018 -19.82766 -14.11456 0.0001575 1.5916861 -0.001058 -14.11456 -15.94832 1.1108361 ITER L DL D2L DIFF BETA 8 -74.8904 -0.000013 -19.82767 -14.11456 0.0000114 1.5916936 -0.000076 -14.11456 -15.94831 1.1108247 ITER L DL D2L DIFF BETA 9 -74.8904 -9.028E-7 -19.82767 -14.11455 8.2616E-7 1.5916942 -5.518E-6 -14.11455 -15.94831 1.1108239 ITER L DL D2L DIFF BETA 10 -74.8904 -6.533E-8 -19.82767 -14.11455 5.9797E-8 1.5916942 -3.993E-7 -14.11455 -15.94831 1.1108238 ITER L DL D2L DIFF BETA 11 -74.8904 -4.728E-9 -19.82767 -14.11455 4.328E-9 1.5916942 -2.89E-8 -14.11455 -15.94831 1.1108238 PROC IML Results ... 3 16:30 Thursday, March 15, 2001 LLRATIO -2 * loglikelihood = 149.78081 B1 SERR1 b1 coeff, std err : 1.5916942 0.3692068 B2 SERR2 b2 coeff, std err : 1.1108238 0.4116691 COVAR covariance matrix : 0.1363137 -0.12064 -0.12064 0.1694714 program: /home/walleye/john-c/5421/gaussimlml.sas 15MAR01 16:30 ======================================================================== Comparing the two printouts for imlml.sas and gaussimlml.sas shows that they produce exactly the same parameter estimates. The variance estimates differ slightly. The theorem which justifies the Gauss method does not say that the observed matrices will exactly agree; it says that they have the same expectation. In fact they will differ markedly from each other in regions of the parameter space which are not near the maximum likelihood estimates themselves. The Gauss-method program takes 11 iterations to converge in this case, while the Newton method requires only 7. The final estimates of the D2L matrices do not agree, and, although identical parameter estimates are obtained, the estimated standard errors also do not agree. Which method is better? It's not clear. The Newton second-derivative method is based on a Taylor expansion of the first derivative. It is evaluated at the current estimate. The Gauss method depends on the theorem noted above, I(B) = -E{ d2 log(L) / dB2}, where I(B) is Fisher information, defined as I(B) = E{[d log(L) / dB] * [d log(L) / dB]`}. In that sense, the Gauss method is an approximation, since the expectation is not known exactly. The Gauss method may not do well until you are close to the MLE, because the theorem above applies only to that point. However, Taylor's theorem is itself an approximation. It is not clear which approximation is better in general. I am slightly inclined to think the Newton method will perform better, but it has the disadvantage that it requires computation of second derivatives. SAS, in the nonlinear least squares procedure PROC NLIN, makes use of either the Newton or Gauss methods, as well as several others. The default method for PROC NLIN is Gauss. The Newton method can be specified as an option. Note that in the SAS implementation, if the Newton method is used, the second derivatives are estimated numerically, whereas the first derivatives are specified analytically. My own experience with PROC NLIN indicates that both the Gauss and Newton methods are not quite as reliable as a third variant, MARQUARDT. ======================================================================== PROBLEM 18A. The following data set shows values of a variable X which has a distribution which is the mixture of two normal distributions. That is, there is a probability p1 that X is drawn from a normal distribution with mean mu1 and standard deviation s. There is a probability 1 - p1 that X is drawn from another normal distribution with mean mu2 and standard deviation s. A histogram is shown below the dataset. The unknown parameters are: mu1, mu2, s, and p1. Write a program which will compute maximum likelihood estimates of these parameters. 1 18:27 Wednesday, November 15, 2006 Obs x 1 0.80625 2 1.03691 3 1.46767 4 0.74435 5 0.76884 6 1.15201 7 1.76863 8 1.35288 9 1.39510 10 1.38580 11 1.36411 12 1.51375 13 0.68575 14 1.11369 15 1.48374 16 1.15337 17 1.23556 18 1.11115 19 1.31487 20 1.52987 21 1.33171 22 1.45874 23 0.59281 24 1.51549 25 1.24540 26 0.82244 27 1.31432 28 0.54313 29 0.79331 30 1.92460 31 0.65663 32 1.33266 33 1.04949 34 0.97332 35 1.49776 36 1.53517 37 0.81621 38 1.48104 39 0.90330 40 1.31403 41 0.93707 42 0.85454 43 1.59543 44 1.85178 45 0.79937 46 1.91335 47 1.12369 48 1.02486 49 0.98977 50 1.28284 51 0.43247 52 0.83582 53 0.87399 54 0.95347 55 1.41779 56 0.78824 57 0.43659 58 1.47591 59 1.76129 60 1.29779 61 1.21414 62 1.30930 63 1.69770 64 1.53651 65 1.54546 66 1.62319 67 0.83417 68 0.50302 69 0.76503 70 1.06025 71 1.39400 72 1.20968 73 0.84474 74 1.10303 75 0.94593 76 1.24635 77 0.77448 78 1.22901 79 1.75974 80 0.70282 81 1.33777 82 0.87791 83 1.68290 84 0.57039 85 1.58249 86 1.26539 87 1.14392 88 1.57961 89 0.95139 90 0.86490 91 0.69971 92 1.05069 93 1.06263 94 0.83222 95 1.24652 96 0.69845 97 1.51541 98 1.25000 99 0.49979 100 1.22121 101 1.21017 102 0.73679 103 0.62642 104 1.62786 105 0.62392 106 1.68102 107 0.85341 108 0.72261 109 0.98163 110 1.02806 111 1.02553 112 1.60963 113 1.02189 114 0.89691 115 1.39085 116 0.92491 117 0.89822 118 0.96070 119 1.27105 120 1.29916 121 1.12870 122 1.20774 123 1.40016 124 1.48849 125 0.91027 126 1.51901 127 0.92196 128 1.27068 129 0.98230 130 1.20435 131 0.66172 132 1.33744 133 1.17518 134 1.33228 135 0.47799 136 1.39439 137 0.88771 138 1.28552 139 0.75350 140 0.81856 141 0.64155 142 1.08935 143 1.76052 144 0.77502 145 0.65953 146 0.87828 147 1.34773 148 0.93423 149 0.83391 150 0.97298 151 0.69556 152 0.57002 153 0.82694 154 1.27763 155 0.57893 156 1.41407 157 0.84335 158 1.29575 159 1.22293 160 1.15268 161 1.24789 162 0.77514 163 1.26712 164 1.74086 165 1.65762 166 0.70038 167 0.61442 168 0.87938 169 1.40703 170 1.48931 171 1.02206 172 1.46202 173 1.72561 174 0.80142 175 1.43894 176 0.60786 177 0.93005 178 1.98187 179 1.31041 180 1.56395 181 1.07366 182 1.04669 183 0.64663 184 0.60105 185 0.59709 186 0.59587 187 1.19483 188 1.42432 189 0.76585 190 0.88321 191 1.42488 192 0.73576 193 1.62198 194 0.68361 195 0.87568 196 0.44357 197 1.39430 198 0.95161 199 0.75769 200 1.07720 Problem 18A, Histogram ... 2 Use of PROC CHART 18:27 Wednesday, November 15, 2006 Frequency | **** | **** **** 30 + **** **** | **** **** **** | **** **** **** **** | **** **** **** **** | **** **** **** **** **** 20 + **** **** **** **** **** **** | **** **** **** **** **** **** **** | **** **** **** **** **** **** **** | **** **** **** **** **** **** **** | **** **** **** **** **** **** **** **** 10 + **** **** **** **** **** **** **** **** | **** **** **** **** **** **** **** **** **** | **** **** **** **** **** **** **** **** **** **** | **** **** **** **** **** **** **** **** **** **** **** | **** **** **** **** **** **** **** **** **** **** **** --------------------------------------------------------------------- 0.45 0.60 0.75 0.90 1.05 1.20 1.35 1.50 1.65 1.80 1.95 x Midpoint ======================================================================== ~john-c/5421/notes.logistic Last update: November 15, 2010