APPLICATIONS OF SAS IML: SPH 5421 notes.019
FINDING MAXIMUM LIKELIHOOD ESTIMATES
Why might it be useful in statistics to be able to find minima or maxima of a
given nonlinear function?
Suppose L(B) = L(Y, X;B) is the loglikelihood for a given data set. Here Y
could indicate an outcome variable (dependent variable) and X a vector of
predictors. Both Y and X are measured variables in the data set. The vector B
represents unknown coefficients in the model.
The object of maximum likelihood estimation is to find the value of B which
maximizes L(Y, X ; B) for the observed values of Y and X. As above this reduces to
finding where the partial derivatives of L are 0. That is, the object is to find B =
(B1, B2, B3, ... , Bp) such that
dL/dB1 = 0
dL/dB2 = 0
[1] .
.
.
dL/dBp = 0.
This is a system of p simultaneous nonlinear equations. In general the
solution cannot be found by a formula. It must be estimated numerically by an
iterative process like Newton's method.
The system of equations [1] can be written in vector form as
[2] dL/dB = 0,
where 0 is the 0-vector in p-dimensional space.
In theory, finding estimates of the parameter vector B which maximize the
likelihood is no different from the problem discussed above of minimizing or
maximizing a nonlinear function. In fact, it is often rather difficult to work with
log likelihood functions because the expressions for them involve not only the
unknown parameters, but also the independent variables X in the design matrix and
the outcome variable Y. This implies that each evaluation of the likelihood
function must include a complete pass through the entire design matrix.
The following example is typical. It uses PROC IML in SAS to produce maximum
likelihood estimates for a logistic model. The basic algorithm is Newton's method,
and requires repeated evaluation of the loglikelihood, the first derivative vector
of the log likelihood, and the 2nd derivative matrix of the log likelihood. Output
from this program is also appended.
The parameter vector, BETA, is assumed to be p x 1. The first derivative
vector of the log-likelihood L is also p x 1. The second derivative matrix of the
loglikelihood is a square, symmetric p x p matrix.
========================================================================
/* Program to compute maximum likelihood estimates .... */;
/* Based on Newton-Raphson methods - uses IML */;
footnote "program: /home/walleye/john-c/5421/imlml.sas &sysdate &systime" ;
options linesize = 80 ;
data sim ;
one = 1 ;
a = 2 ;
b = 1 ;
n = 300 ;
seed = 20000214 ;
/* Simulated data from logistic distribution */;
do i = 1 to n ;
u = 2 * i / n ;
p = 1 / (1 + exp(-a - b * u)) ;
r = ranuni(seed) ;
yi = 0 ;
if r lt p then yi = 1 ;
output ;
end ;
run ;
/* Logistic analysis of simulated data ...... */;
proc logistic descending ;
model yi = u ;
title1 'Proc logistic results ... ' ;
run ;
/* Begin proc iml .............................. */;
title1 'PROC IML Results ...' ;
proc iml ;
use sim ; /* Input the simulated dataset */;
read all var {one u} into x ; /* Read the design matrix into x */;
read all var {yi} into y ; /* Read the outcome data into y */;
p = 2 ; /* Set the number of parameters */;
beta = {0.00, 0.00} ; /* Initialize the param. vector */;
/* -------------------------------------------------------------------*/;
/* Module which computes loglikelihood and derivatives ... */;
start loglike(beta, p, x, y, l, dl, d2l) ;
n = 300 ;
l = 0 ; /* Initialize log likelihood ... */;
dl = j(p, 1, 0) ; /* Initialize 1st derivatives... */;
d2l = j(p, p, 0) ; /* Initialize 2nd derivatives... */;
do i = 1 to n ;
xi = x[i,] ;
yi = y[i] ;
xibeta = xi * beta ;
w = exp(-xibeta) ;
/* The log likelihood for the i-th observation ... */;
iloglike = log((1 - yi) * w + yi) - log(1 + w) ;
l = l + iloglike ;
do j = 1 to p ;
xij = xi[j] ;
/* The jth 1st derivative of the log likelihood for the ith obs */;
jdlogl = -(1 - yi) * w * xij / ((1 - yi) * w + yi)
+ w * xij / (1 + w) ;
dtemp = dl[j] + jdlogl ;
dl[j] = dtemp ;
do k = 1 to p ;
xik = xi[k] ;
/* The jkth 2nd derivative of the log likelihood for the ith obs */;
jkd2logl = ((1 - yi) * w * xij * xik * ((1 - yi) * w + yi)
-(1 - yi) * w * xij * ((1 - yi) * w * xik))/
((1 - yi) * w + yi)**2
+ (-w * xij * xik * (1 + w) + w * w * xij * xik)/
(1 + w)**2 ;
d2temp = d2l[j, k] ;
d2l[j, k] = d2temp + jkd2logl ;
end ;
end ;
end ;
finish loglike ;
/* -------------------------------------------------------------------*/;
eps = 1e-8 ;
diff = 1 ;
/* The following do loop stops when the increments in beta are */;
/* sufficiently small, or when number of iterations reaches 20 */;
do iter = 1 to 20 while(diff > eps) ;
run loglike(beta, p, x, y, l, dl, d2l) ;
invd2l = inv(d2l) ;
beta = beta - invd2l * dl ; /* The key Newton step ... */;
diff = max(abs(invd2l * dl)) ;
print iter l dl d2l diff beta ;
end ;
b1 = beta[1] ;
b2 = beta[2] ;
serr1 = sqrt(-invd2l[1, 1]) ;
serr2 = sqrt(-invd2l[2, 2]) ;
covar = -invd2l ;
llratio = - 2 * l ;
print ' -2 * loglikelihood = ' llratio ;
print 'b1 coeff, std err : ' b1 serr1 ;
print 'b2 coeff, std err : ' b2 serr2 ;
print 'covariance matrix : ' covar ;
quit ;
========================================================================
Proc logistic results ... 1
10:55 Saturday, March 18, 2000
The LOGISTIC Procedure
Data Set: WORK.SIM
Response Variable: YI
Response Levels: 2
Number of Observations: 300
Link Function: Logit
Response Profile
Ordered
Value YI Count
1 1 278
2 0 22
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 159.306 153.781 .
SC 163.010 161.188 .
-2 LOG L 157.306 149.781 7.525 with 1 DF (p=0.0061)
Score . . 7.266 with 1 DF (p=0.0070)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 1.5917 0.3766 17.8675 0.0001 . .
U 1 1.1108 0.4273 6.7592 0.0093 0.354175 3.037
Association of Predicted Probabilities and Observed Responses
Concordant = 66.7% Somers' D = 0.344
Discordant = 32.3% Gamma = 0.348
Tied = 1.0% Tau-a = 0.047
(6116 pairs) c = 0.672
program: /home/walleye/john-c/5421/imlml.sas 18MAR00 10:55
PROC IML Results ... 2
10:55 Saturday, March 18, 2000
ITER L DL D2L DIFF BETA
1 -207.9442 128 -75 -75.25 1.42466 1.42466
135.45333 -75.25 -100.5006 0.2810698
ITER L DL D2L DIFF BETA
2 -86.11339 24.433947 -39.1126 -36.7227 0.382878 1.6898849
27.881783 -36.7227 -47.38327 0.6639477
ITER L DL D2L DIFF BETA
3 -76.10077 5.4183855 -24.6185 -20.35354 0.3405272 1.6284455
7.0927236 -20.35354 -24.50093 1.0044749
ITER L DL D2L DIFF BETA
4 -74.93779 0.8082239 -20.61113 -15.14088 0.1008562 1.5935698
1.1867474 -15.14088 -17.00239 1.1053311
ITER L DL D2L DIFF BETA
5 -74.89053 0.04008 -19.90169 -14.11319 0.0054776 1.5916993
0.0585386 -14.11319 -15.50634 1.1108087
ITER L DL D2L DIFF BETA
6 -74.8904 0.0001116 -19.8646 -14.05981 0.0000151 1.5916942
0.0001618 -14.05981 -15.4291 1.1108238
ITER L DL D2L DIFF BETA
7 -74.8904 8.55E-10 -19.86449 -14.05967 1.157E-10 1.5916942
1.2388E-9 -14.05967 -15.42889 1.1108238
LLRATIO
-2 * loglikelihood = 149.78081
B1 SERR1
b1 coeff, std err : 1.5916942 0.376554
B2 SERR2
b2 coeff, std err : 1.1108238 0.4272664
COVAR
covariance matrix : 0.1417929 -0.12921
-0.12921 0.1825565
program: /home/walleye/john-c/5421/imlml.sas 18MAR00 10:55
========================================================================
The difficult part of this kind of program is writing out the expressions for
the 1st and 2nd partial derivatives. To some extent this problem cannot be avoided.
The choice of initial values for the parameter vector BETA can strongly affect
the behavior of this kind of program. In general you want to choose a value for
BETA that is close to the minimum point. However, you usually don't know where that
is. In this case, (0, 0)' was chosen and the algorithm converged quickly and
satisfactorily. In other cases you may need to make an informed guess of where the
minimum might be. You may be able to do this by trying several values for BETA and
examining the corresponding values of the log likelihood. If the number of
parameters is moderate or large, (> 5) this can be very time-consuming.
In this case the algorithm converges in 7 iterations to the same answer as
that given by PROC LOGISTIC. Note that during the iterations,
(1) The log-likelihood L increases
(2) The derivative vector, DL, gets close to (0, 0)'.
(3) DIFF --> 0.
The program above is a very rudimentary one for this purpose. There is no
reason to use this program rather than PROC LOGISTIC. It is intended as an example
of how the algorithm can be implemented. Features which should be added include the
following:
1. A way of checking that the likelihood is actually increasing.
2. Check that the derivatives of the log-likelihood are very small
when convergence is achieved: typically, less than 10^(-5) in
absolute value.
3. A provision for what to do if the algorithm does not converge
within the prescribed number of steps.
4. A provision for 'step-halving': if a given iteration actually
does not increase the likelihood, a modification in which
the parameter estimates can be incremented by one-half
the amount can be added. Step-halving can be implemented as
follows:
1) start with f = 1
2) compute beta = beta - f * inv(d2l) * dl ;
3) evaluate the log-likelihood at the new beta
4) if the loglikelihood decreases at a given iteration,
replace f by f/2 and go back to step 2.
5) keep doing this until you either do 10 step-halves, or
the loglikelihood actually increases. If after 10 step-
halves, the loglikelihood has not improved, declare
nonconvergence and exit.
The rationale for step-halving is as follows. The change in the
parameters at a given step may be in the correct direction, but
too large in magnitude to increase the log-likelihood. Step-halving
produces changes in the parameters in the same direction as that specified
by the Newton process, but of smaller magnitude.
5. Provisions for dealing with missing data need to be either built
into the program, or the dataset used in the analysis must have
the observations with missing data deleted before the IML procedure
begins.
6. Provision to print an estimate of the standard errors of the coefficient
estimates. This can easily be added to the above program at the end,
as follows:
invd2l = inv(d2l) ;
serr1 = sqrt(-invd2l[1, 1]) ;
serr2 = sqrt(-invd2l[2, 2]) ;
Here serr1 will be the estimated standard error of the intercept
term, b0, and serr2 will the estimated s. e. for the coefficient
b1. The reason this works is, the covariance matrix of
parameters in maximum likelihood estimation is asymptotically equal
to the negative of the inverse of the Fisher information. The
Fisher information is the expectation of the 2nd derivative of the
log likelihood. The expectation is usually approximated by simply
evaluating the 2nd derivative matrix at the maximum likelihood
estimate itself. In this case that 2nd derivative matrix is: d2l.
Fisher information is an important idea.
-------------------------------------------------------------------------------------
Finally, why are programs like this useful, if you already have things like
PROC LOGISTIC to find coefficient estimates and other statistics?
SAS and Splus both provide a wide but still limited range of statistical
routines for regression and analysis of variance. In SAS, the primary areas are (1)
linear models, (2) a subset of generalized linear models and loglinear models (PROC
GENMOD), (3) Mixed models (PROC MIXED), (4) nonlinear least squares (PROC NLIN), and
(5) proportional hazards models for survival analysis. SAS does not have a
procedure to deal with some models in survival analysis (e.g., Gompertz or Pareto),
general forms of Poisson regression, among many others. If you need to find maximum
likelihood estimates for a model not handled by a SAS or Splus procedure, you may
need to have to write a program something like the example above or you may need to
make use of a statistical package like BMDP which does include a procedure (LE) for
maximum-likelihood estimation.
=================================================================================
PROBLEM 18
1. Use a program like that given above to do the following logistic regression
problem: in the Lung Health Study dataset, assume that the probability of quitting
smoking at Year 1 is given by:
prob(quit smoke at A1) = 1 / (1 + exp(-b0 - b1*age)).
Thus, the problem is to estimate b0 and b1. Check your answer by using PROC
LOGISTIC or the Splus version of logistic regression.
2. Again, use a similar program to estimate the coefficients for the following
model:
prob(quit smoke at A1) = 1 /(1 + exp(-b0 - b1*age - b2*gender)) ;
Also estimate standard errors of the coefficients.
=================================================================================
/home/walleye/john-c/5421/notes.019 Last update: October 21, 2003
=================================================================================
SAS PROC IML SPH 5421 notes.logistic
Maximum likelihood estimation, continued
APPENDIX to notes.019:
The Logistic Model: Likelihood, Log-likelihood, Derivatives, and Maximum
Likelihood Estimation.
The logistic model is used to model the probability of an event as a function
of variables called risk factors.
Occurrence of the event (the outcome variable) is represented by a variable Y,
where Y = 1 means the event occurred and Y = 0 means it did not occur. The
probability that the event occurs is often the function of certain risk factors. For
example, if the event is death from stroke, the probability of the event within a
given fixed time interval (for example, one year) depends on the age of the person
and the person's systolic blood pressure. Older people are at higher risk and
people with high SBP are at higher risk.
For the stroke-death example, the logistic model is
prob(Y = 1 | X1 = age, X2 = SBP) = 1 / (1 + exp(-b0 - b1*X1 - b2*X2))
Here b0, b1, and b2 are coefficients that are usually unknown. The typical
problem for a given study is to collect data and use a program to estimate the
values of b0, b1, and b2. Note that the logistic probability is always a number
between 0 and 1.
Also note that if, for example, b2 is positive, then an increase in X2 will
produce an increase in the probability that Y = 1. Conversely, if b2 is negative,
then an increase in X2 will cause a decrease in the probability that Y = 1.
It is easy to show that
prob(Y = 0 | X1, X2) = 1 - prob(Y = 1 | X1, X2)
= exp(-b0 - b1*X1 - b2*X2)/ (1 + exp(-b0 - b1*X1 - b2*X2))
Assume you are carrying out an observational study in which you determine each
participant's age and systolic blood pressure at a baseline visit, and then observe
what happens to the person during a one-year followup period.
Suppose a person is 85 years of age and has SBP = 160, and that he dies of
stroke within the one-year period of observation. The likelihood for this
observation for this single individual is:
L1 = 1 / (1 + exp(-b0 - b1*85 - b2*160)).
On the other hand, if he does NOT die of stroke within a year, the outcome
Y is set to zero, and the likelihood for the observation is
L0 = 1 - L1 = exp(-b0 - b1*85 - b2*160) / (1 + exp(-b0 - b1*85 - b2*160))
It is possible to write the likelihood as a single expression which is
a function of Y, X1, and X2:
L = Y / (1 + exp(-b0 - b1*85 - b2*160) +
(1 - Y) * exp(-b0 - b1*85 - b2*160) / (1 + exp(-b0 - b1*85 - b2*160))
(Note that if Y = 1, the second part of the expression is 0, while if Y = 0,
the first part of the expression is 0.)
If you knew what the values of b0, b1, and b2 were, you could explicitly
calculate what the person's risk of an event would be. The occurrence or
non-occurrence of the event for a single individual can be viewed as a Bernoulli
random variable.
The likelihood of the entire set of observations from the whole study is a
product of the individual likelihoods. Different individuals have different values
of age and SBP, and possibly different outcomes. Thus the total likelihood of all
the observations could be written
L = product{Yi / (1 + exp(-b0 - b1*X1i - b2*X2i) +
(1 - Yi) * exp(-b0 - b1*X1i - b2*X2i) / (1 + exp(-b0 - b1*X1i - b2*X2i)),
where the i subscript refers to the i-th person in the study, and the product is
taken across all the observations, i = 1, 2, ..., n.
The main question in a typical study is, what coefficients 'best' fit the
observed data? Note that the likelihood is both a function of the observed data
(Yi, X1i, X2i) and of the unknown coefficients b0, b1, and b2. The observed data,
are fixed constants, and there are typically many data points and measurements which
make them up. If you have a study involving 1000 people with one outcome variable Y
and two covariates X1 and X2 for each person, the number of observed data values is
3000. On the other hand, there are only 3 unknown coefficients. Thus in a sense,
the likelihood would be a function of only 3 variables, b0, b1, and b2.
The goal of maximum likelihood estimation is to find the values of the unknown
coefficients which maximize the likelihood. The problem of finding maximum
likelihood estimates, therefore, is just a problem of maximizing a nonlinear
function with a relatively small number of unknowns. It is a very complicated
function, but, in the case of the stroke-death study, only 3 unknown coefficients
need to be estimated.
Again, the likelihood is a product of probabilities. Many of the probabilities
will be small, and of course none are larger than 1 (or smaller than 0). In most
studies, for realistic choices of the coefficients b0, b1, and b2, the likelihood is
a very small number, perhaps on the order of e^(-1000). This means that it is very
hard to do accurate computations. It is much easier computationally to deal with
the LOG of the likelihood. The log (natural log) function is a monotone increasing
function, so that if you find values which maximize the log of the likelihood, those
same values will also maximize the likelihood itself.
Another good thing about taking the log of the likelihood is that the PRODUCT
in the likelihood expression gets converted to a SUM. Sums are generally easier to
compute with than products.
The expression for the log-likelihood of the data set is:
l = sum{log[Yi + (1 - Yi) * exp(-Xi*b)] - log[1 + exp(-Xi*b)]},
where the sum is over i = 1, 2, ..., n. Note also that Xi*b is matrix
shorthand;
Xi = (1, X1i, X2i) and b = (b0, b1, b2)`, and
exp(-Xi*b) = exp(-b0 - b1*X1i - b2*X2i).
The method of finding the values of b0, b1, and b2 that maximize the log
likelihood is simply the Newton method for finding the maximum point of a nonlinear
function. It is not particularly hard to write down a PROC IML routine which will
do this. The only tricky part of it is writing the correct expressions for the
first and second-order partial derivatives of the log likelihood.
Here is the first partial derivative of the log-likelihood l with respect
to b1:
dl/db1 = sum{-(1 - Yi)*X1i*exp(-Xi*b) / (Yi + (1 - Yi)*exp(-Xi*b))
+ X1i*exp(-Xi*b) / ( 1 + exp(-Xi*b))}.
The first partial derivative with respect to b2 is similar. The first partial
derivative with respect to b0 is also similar, except that where X1i occurs in the
expression above, it is replaced by 1.
The second partial derivative of l with respect to b1 (twice) is
d2l/db1b1 = sum{[(1 - Yi)*(X1i^2)*exp(-Xi*b)*(Yi + (1 - Yi)*exp(-Xi*b))
- ((1 - Yi)^2)*(X1i^2)*exp(-2*Xi*b)] / (Yi + (1 - Yi)*exp(-Xi*b))^2]
+ [-(X1i^2)*exp(-Xi*b)*(1 + exp(-Xi*b)) + (X1i^2)*exp(-2*Xi*b) /
(1 + exp(-Xi*b))^2]}
The mixed second partial derivative of l with respect to b1 and b2 is
d2l/db1b2 = sum{[(1 - Yi)*(X1i*X2i)*exp(-Xi*b)*(Yi + (1 - Yi)*exp(-Xi*b))
- ((1 - Yi)^2)*(X1i*X2i)*exp(-2*Xi*b)) / (Yi + (1 - Yi)*exp(-Xi*b))^2]
+ [-(X1i*X2i)*exp(-Xi*b)*(1 + exp(-Xi*b)) + (X1i*X2i)*exp(-2*Xi*b) /
(1 + exp(-Xi*b))^2]}
Most of the derivation of this awful-looking expression is just application of
the quotient rule for derivatives. You do have to be very careful about plus and
minus signs. This expression occurs in a slightly disguised form in the program
imlml.sas, in notes.019.
There are some operations in the above expression which can consume a lot of
computer time. The expression exp(-Xi*b) occurs over and over again. Recall that
Xi is a 1 x p matrix, and b is a p x 1 matrix. Every computation of Xi*b involves
some arithmetic: p multiplications and p additions. You don't see this explicitly
in IML, but it occurs and requires some computer time. Also, in terms of computer
time, functions like exponentiation are rather time-consuming. It is better to do
these things only once per iteration if possible. That is why, in the program in
notes.019,
xibeta = Xi * beta
is computed only once per iteration, and
w = exp(-xibeta)
is computed only once per iteration. After these two computations, all the
occurrences of exp(-Xi*b) are replaced by w. This saves some time. There are other
places in the program that some savings in computer time could be
realized, but they are less important.
The logistic likelihood is fairly typical of likelihood expressions. In
practice more complicated likelihood expressions are often encountered. If the
Newton process of finding maximum likelihood estimates is used, some way of
estimating first and second derivatives must be available. You can try numerical
estimation, as is done in the program numml.sas in notes.020. In that example,
the first derivatives are computed analytically but the second derivatives are
estimated numerically. The process appears to work satisfactorily for a simulated
dataset. However in general numerical estimates of derivatives are less trustworthy
than analytic expressions. They also typically require more function evaluations,
hence require more computer time.
Programs exist which can read programming code for a function and produce an
exact analytic expression for the derivative. The math package Mathematica has such
a program. It may be worthwhile to use such a program if you are faced with a very
complicated likelihood function.
VARIANCE ESTIMATION AND TESTING
Maximum likelihood estimation has several good features, first noticed by
R. A. Fisher. Maximum likelihood estimates (MLEs) are consistent, which implies
they are asymptotically unbiased and have minimum variance, and are approximately
normally distributed. An approximate variance-covariance matrix of the estimates is
the negative inverse of the second derivative matrix evaluated at the MLE. This is
computed explicitly in the PROC IML program. The square roots of the diagonal
elements of this matrix are approximate standard errors of the coefficient
estimates, and can be used to compute confidence intervals for the true
coefficients.
The log-likelihood itself can be used for testing hypotheses. Suppose you
are considering two models for the probability of stroke death:
Model 1: prob(Y = 1 | age) = 1 / (1 + exp(-b0 - b1*age))
Model 2: prob(Y = 1 | age, SBP) = 1 / (1 + exp(-b0 - b1*age - b2*SBP)
You use a program like that used in notes.019 to compute MLEs for both
models. The program computes estimates of the log-likelihood itself in the course
of computing the MLEs. These two models are hierarchically related: that is,
Model 2 is obtained from Model 1 by simply adding a new risk factor (SBP).
A test for whether adding SBP to the model makes a difference can be obtained
by computing the difference in -2 * loglikelihood for the two models; that is,
D = (-2*logL[Model 1]) - (-2*logL[Model 2]).
Note that the statistic D is always nonnegative. There is a theorem which
implies that, under the null hypothesis that adding SBP to a model that already
includes age does not make a difference, the statistic D has approximately a
chi-square distribution with 1 degree of freedom. [The null hypothesis here can also
be stated as: H0: b2 = 0.] If D is large, you will tend to reject the null
hypothesis. For example, if D > 3.84, you would reject the null hypothesis at the
.05 significance level.
You need to be careful in using this test. The two models must be
hierarchically related as described above.
Note that you can use the same kind of procedure to test whether Model 1 fits
the data better than the intercept-only model (Model 0). Equivalently, this is a
test of whether b1 = 0 in Model 1.
========================================================================
SECOND APPENDIX TO notes.019:
SIMPLIFYING COMPUTATIONS
Fisher information is defined as the expectation of the
square of the first derivative of the log-likelihood,
I(B) = E{[d log(L) / dB] * [d log(L) / dB]`},
where B represents a p x 1 vector of parameters and d log(L)/dB
is the p x 1 vector of first derivatives of the log likelihood.
Note that the vector product inside the expectation is in fact a p x p
matrix of products of first derivatives.
There is a theorem which says that another formula for I(B) is
I(B) = -E{ d2 log(L) / dB2},
that is, Fisher information also equals the negative of the expectation
of the second derivative of the log likelihood. Note that the
second-derivative matrix is also a p x p matrix. A precise statement
of this theorem for one-parameter distributions can be found in various
textbooks, including Bickel and Doksum (Mathematical Statistics;
Prentice-Hall, 1977), p. 139 and Exercise 4.3.9, p. 148.
As it happens this theorem is useful for simplifying maximum
likelihood estimation programs which make use of Newton's method.
Newton's method, in the following step in the iterations,
V1 = V0 - inv(d2logL) * d1logL,
requires the use of the second-derivatives matrix. The theorem above
implies that you might get approximately the same answer if you replace
the second derivative matrix by the negative of the matrix of
first derivative products:
V1 = V0 + inv(d1logL * d1logL`) * d1logL.
If this works, it has the significant advantage that you need only
derive expressions for the first derivatives and then assemble the
appropriate products in order to carry out the iterative step in the
algorithm.
This variant of Newton's method is called either Gauss's method
or the Gauss-Newton method.
It is implemented below in a variation of the program imlml.sas
which appears on 19.1-19.3 of notes.019. The only difference in these
two programs is that in imlml.sas, the following code appears:
------------------------------------------------------------------------
/* The jkth 2nd derivative of the log likelihood for the ith obs */;
jkd2logl = ((1 - yi) * w * xij * xik * ((1 - yi) * w + yi)
-(1 - yi) * w * xij * ((1 - yi) * w * xik))/
((1 - yi) * w + yi)**2
+ (-w * xij * xik * (1 + w) + w * w * xij * xik)/
(1 + w)**2 ;
d2temp = d2l[j, k] ;
d2l[j, k] = d2temp + jkd2logl ;
------------------------------------------------------------------------
In the modified program gaussimlml.sas, this section is replaced
by:
------------------------------------------------------------------------
/* The kth 1st derivative of the log likelihood for the ith obs */;
kdlogl = -(1 - yi) * w * xik / ((1 - yi) * w + yi)
+ w * xik / (1 + w) ;
d2temp = -jdlogl * kdlogl ;
d2l[j, k] = d2l[j, k] + d2temp ;
------------------------------------------------------------------------
Note that kdlogl, the derivative of the log likelihood of the data
for the i-th observation with respect to the k-th parameter, has the
same form as that of jdlogl, except that the covariate xij is replaced
in two places by xik. Also note that d2temp is the NEGATIVE of the
product the jdlogl and kdlogl.
Here is the complete program gaussimlml.sas, followed by printout
for the same data that occurred in the earlier example for imlml.sas:
========================================================================
/* Program to compute maximum likelihood estimates .... */;
/* Based on Gauss-Newton methods - uses IML */;
footnote "program: /home/walleye/john-c/5421/gaussimlml.sas &sysdate &systime" ;
options linesize = 80 ;
data sim ;
one = 1 ;
a = 2 ;
b = 1 ;
n = 300 ;
seed = 20000214 ;
/* Simulated data from logistic distribution */;
do i = 1 to n ;
u = 2 * i / n ;
p = 1 / (1 + exp(-a - b * u)) ;
r = ranuni(seed) ;
yi = 0 ;
if r lt p then yi = 1 ;
output ;
end ;
run ;
/* Logistic analysis of simulated data ...... */;
proc logistic descending ;
model yi = u ;
title1 'Proc logistic results ... ' ;
run ;
/* Begin proc iml .............................. */;
title1 'PROC IML Results ...' ;
proc iml ;
use sim ; /* Input the simulated dataset */;
read all var {one u} into x ; /* Read the design matrix into x */;
read all var {yi} into y ; /* Read the outcome data into y */;
p = 2 ; /* Set the number of parameters */;
beta = {0.00, 0.00} ; /* Initialize the param. vector */;
/* -------------------------------------------------------------------*/;
/* Module which computes loglikelihood and derivatives ... */;
start loglike(beta, p, x, y, l, dl, d2l) ;
n = 300 ;
l = 0 ; /* Initialize log likelihood ... */;
dl = j(p, 1, 0) ; /* Initialize 1st derivatives... */;
d2l = j(p, p, 0) ; /* Initialize 2nd derivatives... */;
do i = 1 to n ;
xi = x[i,] ;
yi = y[i] ;
xibeta = xi * beta ;
w = exp(-xibeta) ;
/* The log likelihood for the i-th observation ... */;
iloglike = log((1 - yi) * w + yi) - log(1 + w) ;
l = l + iloglike ;
do j = 1 to p ;
xij = xi[j] ;
/* The jth 1st derivative of the log likelihood for the ith obs */;
jdlogl = -(1 - yi) * w * xij / ((1 - yi) * w + yi)
+ w * xij / (1 + w) ;
dtemp = dl[j] + jdlogl ;
dl[j] = dtemp ;
do k = 1 to p ;
xik = xi[k] ;
/* The kth 1st derivative of the log likelihood for the ith obs */;
kdlogl = -(1 - yi) * w * xik / ((1 - yi) * w + yi)
+ w * xik / (1 + w) ;
d2temp = -jdlogl * kdlogl ;
d2l[j, k] = d2l[j, k] + d2temp ;
end ;
end ;
end ;
finish loglike ;
/* -------------------------------------------------------------------*/;
eps = 1e-8 ;
diff = 1 ;
/* The following do loop stops when the increments in beta are */;
/* sufficiently small, or when number of iterations reaches 20 */;
do iter = 1 to 20 while(diff > eps) ;
run loglike(beta, p, x, y, l, dl, d2l) ;
invd2l = inv(d2l) ;
beta = beta - invd2l * dl ; /* The key Newton step ... */;
diff = max(abs(invd2l * dl)) ;
print iter l dl d2l diff beta ;
end ;
b1 = beta[1] ;
b2 = beta[2] ;
serr1 = sqrt(-invd2l[1, 1]) ;
serr2 = sqrt(-invd2l[2, 2]) ;
covar = -invd2l ;
llratio = - 2 * l ;
print ' -2 * loglikelihood = ' llratio ;
print 'b1 coeff, std err : ' b1 serr1 ;
print 'b2 coeff, std err : ' b2 serr2 ;
print 'covariance matrix : ' covar ;
quit ;
========================================================================
Proc logistic results ... 1
16:30 Thursday, March 15, 2001
The LOGISTIC Procedure
Data Set: WORK.SIM
Response Variable: YI
Response Levels: 2
Number of Observations: 300
Link Function: Logit
Response Profile
Ordered
Value YI Count
1 1 278
2 0 22
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 159.306 153.781 .
SC 163.010 161.188 .
-2 LOG L 157.306 149.781 7.525 with 1 DF (p=0.0061)
Score . . 7.266 with 1 DF (p=0.0070)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 1.5917 0.3766 17.8675 0.0001 . .
U 1 1.1108 0.4273 6.7592 0.0093 0.354175 3.037
Association of Predicted Probabilities and Observed Responses
Concordant = 66.7% Somers' D = 0.344
Discordant = 32.3% Gamma = 0.348
Tied = 1.0% Tau-a = 0.047
(6116 pairs) c = 0.672
program: /home/walleye/john-c/5421/gaussimlml.sas 15MAR01 16:30
PROC IML Results ... 2
16:30 Thursday, March 15, 2001
ITER L DL D2L DIFF BETA
1 -207.9442 128 -75 -75.25 1.42466 1.42466
135.45333 -75.25 -100.5006 0.2810698
ITER L DL D2L DIFF BETA
2 -86.11339 24.433947 -21.98682 -16.7216 1.3524393 1.5073912
27.881783 -16.7216 -19.59303 1.6335091
ITER L DL D2L DIFF BETA
3 -76.18316 -4.486689 -19.88465 -14.20949 0.3422821 1.5263487
-5.234316 -14.20949 -16.0794 1.291227
ITER L DL D2L DIFF BETA
4 -75.01091 -1.128447 -19.82677 -14.12761 0.15291 1.5783898
-1.706508 -14.12761 -15.96837 1.138317
ITER L DL D2L DIFF BETA
5 -74.89283 -0.120079 -19.82628 -14.11533 0.0251799 1.5902601
-0.234054 -14.11533 -15.94952 1.1131371
ITER L DL D2L DIFF BETA
6 -74.89042 -0.004023 -19.82753 -14.11462 0.0021435 1.591583
-0.015512 -14.11462 -15.94839 1.1109937
ITER L DL D2L DIFF BETA
7 -74.8904 -0.00018 -19.82766 -14.11456 0.0001575 1.5916861
-0.001058 -14.11456 -15.94832 1.1108361
ITER L DL D2L DIFF BETA
8 -74.8904 -0.000013 -19.82767 -14.11456 0.0000114 1.5916936
-0.000076 -14.11456 -15.94831 1.1108247
ITER L DL D2L DIFF BETA
9 -74.8904 -9.028E-7 -19.82767 -14.11455 8.2616E-7 1.5916942
-5.518E-6 -14.11455 -15.94831 1.1108239
ITER L DL D2L DIFF BETA
10 -74.8904 -6.533E-8 -19.82767 -14.11455 5.9797E-8 1.5916942
-3.993E-7 -14.11455 -15.94831 1.1108238
ITER L DL D2L DIFF BETA
11 -74.8904 -4.728E-9 -19.82767 -14.11455 4.328E-9 1.5916942
-2.89E-8 -14.11455 -15.94831 1.1108238
PROC IML Results ... 3
16:30 Thursday, March 15, 2001
LLRATIO
-2 * loglikelihood = 149.78081
B1 SERR1
b1 coeff, std err : 1.5916942 0.3692068
B2 SERR2
b2 coeff, std err : 1.1108238 0.4116691
COVAR
covariance matrix : 0.1363137 -0.12064
-0.12064 0.1694714
program: /home/walleye/john-c/5421/gaussimlml.sas 15MAR01 16:30
========================================================================
Comparing the two printouts for imlml.sas and gaussimlml.sas
shows that they produce exactly the same parameter estimates. The
variance estimates differ slightly. The theorem which justifies the
Gauss method does not say that the observed matrices will exactly
agree; it says that they have the same expectation. In fact they
will differ markedly from each other in regions of the parameter space
which are not near the maximum likelihood estimates themselves.
The Gauss-method program takes 11 iterations to converge in this
case, while the Newton method requires only 7. The final estimates of
the D2L matrices do not agree, and, although identical parameter
estimates are obtained, the estimated standard errors also do not agree.
Which method is better?
It's not clear. The Newton second-derivative method is based
on a Taylor expansion of the first derivative. It is evaluated at
the current estimate. The Gauss method depends on the theorem noted
above,
I(B) = -E{ d2 log(L) / dB2},
where I(B) is Fisher information, defined as
I(B) = E{[d log(L) / dB] * [d log(L) / dB]`}.
In that sense, the Gauss method is an approximation, since the
expectation is not known exactly. The Gauss method may not do well
until you are close to the MLE, because the theorem above applies
only to that point. However, Taylor's theorem is itself an approximation.
It is not clear which approximation is better in general. I am slightly
inclined to think the Newton method will perform better, but it has
the disadvantage that it requires computation of second derivatives.
SAS, in the nonlinear least squares procedure PROC NLIN, makes
use of either the Newton or Gauss methods, as well as several others.
The default method for PROC NLIN is Gauss. The Newton method can
be specified as an option. Note that in the SAS implementation, if
the Newton method is used, the second derivatives are estimated
numerically, whereas the first derivatives are specified analytically.
My own experience with PROC NLIN indicates that both the Gauss and
Newton methods are not quite as reliable as a third variant, MARQUARDT.
========================================================================
PROBLEM 18A.
The following data set shows values of a variable X which
has a distribution which is the mixture of two normal distributions.
That is, there is a probability p1 that X is drawn from a normal
distribution with mean mu1 and standard deviation s. There is a
probability 1 - p1 that X is drawn from another normal distribution
with mean mu2 and standard deviation s.
A histogram is shown below the dataset.
The unknown parameters are: mu1, mu2, s, and p1.
Write a program which will compute maximum likelihood
estimates of these parameters.
1
18:27 Wednesday, November 15, 2006
Obs x
1 0.80625
2 1.03691
3 1.46767
4 0.74435
5 0.76884
6 1.15201
7 1.76863
8 1.35288
9 1.39510
10 1.38580
11 1.36411
12 1.51375
13 0.68575
14 1.11369
15 1.48374
16 1.15337
17 1.23556
18 1.11115
19 1.31487
20 1.52987
21 1.33171
22 1.45874
23 0.59281
24 1.51549
25 1.24540
26 0.82244
27 1.31432
28 0.54313
29 0.79331
30 1.92460
31 0.65663
32 1.33266
33 1.04949
34 0.97332
35 1.49776
36 1.53517
37 0.81621
38 1.48104
39 0.90330
40 1.31403
41 0.93707
42 0.85454
43 1.59543
44 1.85178
45 0.79937
46 1.91335
47 1.12369
48 1.02486
49 0.98977
50 1.28284
51 0.43247
52 0.83582
53 0.87399
54 0.95347
55 1.41779
56 0.78824
57 0.43659
58 1.47591
59 1.76129
60 1.29779
61 1.21414
62 1.30930
63 1.69770
64 1.53651
65 1.54546
66 1.62319
67 0.83417
68 0.50302
69 0.76503
70 1.06025
71 1.39400
72 1.20968
73 0.84474
74 1.10303
75 0.94593
76 1.24635
77 0.77448
78 1.22901
79 1.75974
80 0.70282
81 1.33777
82 0.87791
83 1.68290
84 0.57039
85 1.58249
86 1.26539
87 1.14392
88 1.57961
89 0.95139
90 0.86490
91 0.69971
92 1.05069
93 1.06263
94 0.83222
95 1.24652
96 0.69845
97 1.51541
98 1.25000
99 0.49979
100 1.22121
101 1.21017
102 0.73679
103 0.62642
104 1.62786
105 0.62392
106 1.68102
107 0.85341
108 0.72261
109 0.98163
110 1.02806
111 1.02553
112 1.60963
113 1.02189
114 0.89691
115 1.39085
116 0.92491
117 0.89822
118 0.96070
119 1.27105
120 1.29916
121 1.12870
122 1.20774
123 1.40016
124 1.48849
125 0.91027
126 1.51901
127 0.92196
128 1.27068
129 0.98230
130 1.20435
131 0.66172
132 1.33744
133 1.17518
134 1.33228
135 0.47799
136 1.39439
137 0.88771
138 1.28552
139 0.75350
140 0.81856
141 0.64155
142 1.08935
143 1.76052
144 0.77502
145 0.65953
146 0.87828
147 1.34773
148 0.93423
149 0.83391
150 0.97298
151 0.69556
152 0.57002
153 0.82694
154 1.27763
155 0.57893
156 1.41407
157 0.84335
158 1.29575
159 1.22293
160 1.15268
161 1.24789
162 0.77514
163 1.26712
164 1.74086
165 1.65762
166 0.70038
167 0.61442
168 0.87938
169 1.40703
170 1.48931
171 1.02206
172 1.46202
173 1.72561
174 0.80142
175 1.43894
176 0.60786
177 0.93005
178 1.98187
179 1.31041
180 1.56395
181 1.07366
182 1.04669
183 0.64663
184 0.60105
185 0.59709
186 0.59587
187 1.19483
188 1.42432
189 0.76585
190 0.88321
191 1.42488
192 0.73576
193 1.62198
194 0.68361
195 0.87568
196 0.44357
197 1.39430
198 0.95161
199 0.75769
200 1.07720
Problem 18A, Histogram ... 2
Use of PROC CHART
18:27 Wednesday, November 15, 2006
Frequency
| ****
| **** ****
30 + **** ****
| **** **** ****
| **** **** **** ****
| **** **** **** ****
| **** **** **** **** ****
20 + **** **** **** **** **** ****
| **** **** **** **** **** **** ****
| **** **** **** **** **** **** ****
| **** **** **** **** **** **** ****
| **** **** **** **** **** **** **** ****
10 + **** **** **** **** **** **** **** ****
| **** **** **** **** **** **** **** **** ****
| **** **** **** **** **** **** **** **** **** ****
| **** **** **** **** **** **** **** **** **** **** ****
| **** **** **** **** **** **** **** **** **** **** ****
---------------------------------------------------------------------
0.45 0.60 0.75 0.90 1.05 1.20 1.35 1.50 1.65 1.80 1.95
x Midpoint
========================================================================
~john-c/5421/notes.logistic Last update: November 15, 2010