VARIANCE ESTIMATION SPH 5421 notes.028
Variance-covariance matrices for maximum-likelihood estimates of parameters are
useful for constructing confidence regions and performing tests. SAS and Splus
linear and nonlinear regression routines all provide ways to estimate the asymptotic
covariance matrix for covariates. For maximum likelihood problems in general, the
covariance matrix is the inverse of the 'Fisher Information' matrix.
Fisher information is the expectation of the negative of the matrix of second
partial derivatives of the log likelihood. Let l = log(L; B), where L is the
likelihood of an observed data set and B is a vector of parameters. Then
Fisher Information = E{dl/dbi * dl/dbj}.
The expectation is taken across all the possible realizations of the data.
There is a theorem which says
[1] -E{d2l/dbidbj} = E{dl/dbi * dl/dbj}.
(See, e.g., P. Bickel & K. Doksum, Mathematical Statistics: Basic Ideas and
Selected Topics, Holden-Day, Inc., San Francisco, 1977, p 139.)
Thus Fisher information can be expressed in terms of the expectation of a
matrix of second derivatives.
The problem with Fisher information is that the expectation is very difficult
to calculate. What is done in fact is to evaluate the matrix of second derivatives
at the MLE itself. This works well if you have exact expressions for the second
derivative matrix. Numerical approximation of the second derivative matrix is possible,
(as described below) but in general it is subject to numerical instability because
of small denominators.
An approach using a numerical estimate of the second derivative is implemented
in the program below. In this program, which was discussed originally in notes.019, maximum
likelihood was used to approximate coefficients b0, and b1 for a
logistic risk function. The covariance matrix was estimated using
-inv(d2l/dbidbj)
evaluated analytically at the maximum likelihood estimates themselves.
Numerical approximation of the second derivatives was added to the latter part
of this program. Here is the underlying idea. Let l(b0, b1)
represent the log likelihood, where b0 and b1 are the MLEs.
Then the various second derivatives of the log likelihood may be approximated as follows:
d2l(b0, b1)/db02 = {l(b0 + h, b1) - 2l(b0, b1) + l(b0 - h, b1)} / (h^2)
d2l(b0, b1)/db12 = {l(b0, b1 + h) - 2l(b0, b1) + l(b0, b1 - h)} / (h^2)
d2l(b0, b1)/db1db2 = {l(b0 + h, b1 + h) - l(b0 + h, b1 - h) - l(b0 - h, b1 + h) + l(b0 - h, b1 - h)} / (4*h^2)
These estimates are computed in the following section of the program:
h = 1e-6 ;
beta1ph = beta ;
beta1ph[1] = beta[1] + h ;
beta1mh = beta ;
beta1mh[1] = beta[1] - h ;
beta2ph = beta ;
beta2ph[2] = beta[2] + h ;
beta2mh = beta ;
beta2mh[2] = beta[2] - h ;
beta12phph = beta ;
beta12phph[1] = beta[1] + h ;
beta12phph[2] = beta[2] + h ;
beta12phmh = beta ;
beta12phmh[1] = beta[1] + h ;
beta12phmh[2] = beta[2] - h ;
beta12mhph = beta ;
beta12mhph[1] = beta[1] - h ;
beta12mhph[2] = beta[2] + h ;
beta12mhmh = beta ;
beta12mhmh[1] = beta[1] - h ;
beta12mhmh[2] = beta[2] - h ;
run loglikeonly(beta, p, x, y, l20h0h) ;
run loglikeonly(beta1ph, p, x, y, l2ph0h) ;
run loglikeonly(beta1mh, p, x, y, l2mh0h) ;
run loglikeonly(beta2ph, p, x, y, l20hph) ;
run loglikeonly(beta2mh, p, x, y, l20hmh) ;
run loglikeonly(beta12phph, p, x, y, l2phph) ;
run loglikeonly(beta12phmh, p, x, y, l2phmh) ;
run loglikeonly(beta12mhph, p, x, y, l2mhph) ;
run loglikeonly(beta12mhmh, p, x, y, l2mhmh) ;
d2l11 = (l2ph0h - 2 * l20h0h + l2mh0h) / h**2 ;
d2l22 = (l20hph - 2 * l20h0h + l20hmh) / h**2 ;
d2l12 = (l2phph - l2phmh - l2mhph + l2mhmh) / (4*h*h) ;
d2lnumer = j(2, 2, 0) ;
d2lnumer[1, 1] = -d2l11 ;
d2lnumer[1, 2] = -d2l12 ;
d2lnumer[2, 1] = -d2l12 ;
d2lnumer[2, 2] = -d2l22 ;
covarnumer = inv(d2lnumer) ;
print "Current value of beta: " beta ;
print "Numerical Estimate of -Fisher Information: " d2lnumer ;
print "Numerical Estimate of Covariance Matrix: " covarnumer ;
beta1 = beta[1] ;
beta2 = beta[2] ;
serr1 = sqrt(covarnumer[1, 1]) ;
serr2 = sqrt(covarnumer[2, 2]) ;
print "Numerical: beta1, serr1 = " beta1 serr1 ;
print "Numerical: beta2, serr2 = " beta2 serr2 ;
----------------------------------------------------------------------------------
Note that in this case, h was set equal to 10^(-6). This is about as small as it is
safe to use with the computer's 64-bit arithmetic, because in the expressions above,
h is squared, meaning that the denominators are of the order of 10^(-12).
/* Program to compute maximum likelihood estimates .... */;
/* Based on Newton-Raphson methods - uses IML */;
footnote "program: /home/walleye/john-c/5421/imlml.fisher.sas &sysdate &systime" ;
options linesize = 80 ;
data sim ;
one = 1 ;
a = 2 ;
b = 1 ;
n = 300 ;
seed = 20000214 ;
/* Simulated data from logistic distribution */;
do i = 1 to n ;
u = 2 * i / n ;
p = 1 / (1 + exp(-a - b * u)) ;
r = ranuni(seed) ;
yi = 0 ;
if r lt p then yi = 1 ;
output ;
end ;
run ;
/* Logistic analysis of simulated data ...... */;
proc logistic descending ;
model yi = u ;
title1 'Proc logistic results ... ' ;
run ;
/* Begin proc iml .............................. */;
title1 'PROC IML Results ...' ;
proc iml ;
use sim ; /* Input the simulated dataset */;
read all var {one u} into x ; /* Read the design matrix into x */;
read all var {yi} into y ; /* Read the outcome data into y */;
p = 2 ; /* Set the number of parameters */;
beta = {0.00, 0.00} ; /* Initialize the param. vector */;
/* -------------------------------------------------------------------*/;
/* Module which computes loglikelihood and derivatives ... */;
start loglike(beta, p, x, y, l, dl, d2l) ;
n = 300 ;
l = 0 ; /* Initialize log likelihood ... */;
dl = j(p, 1, 0) ; /* Initialize 1st derivatives... */;
d2l = j(p, p, 0) ; /* Initialize 2nd derivatives... */;
do i = 1 to n ;
xi = x[i,] ;
yi = y[i] ;
xibeta = xi * beta ;
w = exp(-xibeta) ;
/* The log likelihood for the i-th observation ... */;
iloglike = log((1 - yi) * w + yi) - log(1 + w) ;
l = l + iloglike ;
do j = 1 to p ;
xij = xi[j] ;
/* The jth 1st derivative of the log likelihood for the ith obs */;
jdlogl = -(1 - yi) * w * xij / ((1 - yi) * w + yi)
+ w * xij / (1 + w) ;
dtemp = dl[j] + jdlogl ;
dl[j] = dtemp ;
do k = 1 to p ;
xik = xi[k] ;
/* The jkth 2nd derivative of the log likelihood for the ith obs */;
jkd2logl = ((1 - yi) * w * xij * xik * ((1 - yi) * w + yi)
-(1 - yi) * w * xij * ((1 - yi) * w * xik))/
((1 - yi) * w + yi)**2
+ (-w * xij * xik * (1 + w) + w * w * xij * xik)/
(1 + w)**2 ;
d2temp = d2l[j, k] ;
d2l[j, k] = d2temp + jkd2logl ;
end ;
end ;
end ;
finish loglike ;
/* -------------------------------------------------------------------*/;
start loglikeonly(beta, p, x, y, l) ;
n = 300 ;
l = 0 ; /* Initialize log likelihood ... */;
do i = 1 to n ;
xi = x[i,] ;
yi = y[i] ;
xibeta = xi * beta ;
w = exp(-xibeta) ;
/* The log likelihood for the i-th observation ... */;
iloglike = log((1 - yi) * w + yi) - log(1 + w) ;
l = l + iloglike ;
end ;
finish loglikeonly ;
/* -------------------------------------------------------------------*/;
eps = 1e-8 ;
diff = 1 ;
/* The following do loop stops when the increments in beta are */;
/* sufficiently small, or when number of iterations reaches 20 */;
do iter = 1 to 20 while(diff > eps) ;
run loglike(beta, p, x, y, l, dl, d2l) ;
invd2l = inv(d2l) ;
beta = beta - invd2l * dl ; /* The key Newton step ... */;
diff = max(abs(invd2l * dl)) ;
print iter l dl d2l diff beta ;
end ;
b1 = beta[1] ;
b2 = beta[2] ;
serr1 = sqrt(-invd2l[1, 1]) ;
serr2 = sqrt(-invd2l[2, 2]) ;
covar = -invd2l ;
llratio = - 2 * l ;
print ' -2 * loglikelihood = ' llratio ;
print 'b1 coeff, std err : ' b1 serr1 ;
print 'b2 coeff, std err : ' b2 serr2 ;
print 'covariance matrix : ' covar ;
h = 1e-6 ;
beta1ph = beta ;
beta1ph[1] = beta[1] + h ;
beta1mh = beta ;
beta1mh[1] = beta[1] - h ;
beta2ph = beta ;
beta2ph[2] = beta[2] + h ;
beta2mh = beta ;
beta2mh[2] = beta[2] - h ;
beta12phph = beta ;
beta12phph[1] = beta[1] + h ;
beta12phph[2] = beta[2] + h ;
beta12phmh = beta ;
beta12phmh[1] = beta[1] + h ;
beta12phmh[2] = beta[2] - h ;
beta12mhph = beta ;
beta12mhph[1] = beta[1] - h ;
beta12mhph[2] = beta[2] + h ;
beta12mhmh = beta ;
beta12mhmh[1] = beta[1] - h ;
beta12mhmh[2] = beta[2] - h ;
run loglikeonly(beta, p, x, y, l20h0h) ;
run loglikeonly(beta1ph, p, x, y, l2ph0h) ;
run loglikeonly(beta1mh, p, x, y, l2mh0h) ;
run loglikeonly(beta2ph, p, x, y, l20hph) ;
run loglikeonly(beta2mh, p, x, y, l20hmh) ;
run loglikeonly(beta12phph, p, x, y, l2phph) ;
run loglikeonly(beta12phmh, p, x, y, l2phmh) ;
run loglikeonly(beta12mhph, p, x, y, l2mhph) ;
run loglikeonly(beta12mhmh, p, x, y, l2mhmh) ;
d2l11 = (l2ph0h - 2 * l20h0h + l2mh0h) / h**2 ;
d2l22 = (l20hph - 2 * l20h0h + l20hmh) / h**2 ;
d2l12 = (l2phph - l2phmh - l2mhph + l2mhmh) / (4*h*h) ;
d2lnumer = j(2, 2, 0) ;
d2lnumer[1, 1] = -d2l11 ;
d2lnumer[1, 2] = -d2l12 ;
d2lnumer[2, 1] = -d2l12 ;
d2lnumer[2, 2] = -d2l22 ;
covarnumer = inv(d2lnumer) ;
print "Current value of beta: " beta ;
print "Numerical Estimate of -Fisher Information: " d2lnumer ;
print "Numerical Estimate of Covariance Matrix: " covarnumer ;
beta1 = beta[1] ;
beta2 = beta[2] ;
serr1 = sqrt(covarnumer[1, 1]) ;
serr2 = sqrt(covarnumer[2, 2]) ;
print "Numerical: beta1, serr1 = " beta1 serr1 ;
print "Numerical: beta2, serr2 = " beta2 serr2 ;
quit ;
--------------------------------------------------------------------------------------------------
Proc logistic results ... 1
17:34 Wednesday, November 30, 2011
The LOGISTIC Procedure
Model Information
Data Set WORK.SIM
Response Variable yi
Number of Response Levels 2
Number of Observations 300
Model binary logit
Optimization Technique Fisher's scoring
Response Profile
Ordered Total
Value yi Frequency
1 1 278
2 0 22
Probability modeled is yi=1.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 159.306 153.781
SC 163.010 161.188
-2 Log L 157.306 149.781
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 7.5255 1 0.0061
Score 7.2657 1 0.0070
Wald 6.7583 1 0.0093
program: /home/walleye/john-c/5421/imlml.fisher.sas 30NOV11 17:34
Proc logistic results ... 2
17:34 Wednesday, November 30, 2011
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 1.5917 0.3766 17.8686 <.0001
u 1 1.1107 0.4272 6.7583 0.0093
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
u 3.036 1.314 7.015
Association of Predicted Probabilities and Observed Responses
Percent Concordant 66.7 Somers' D 0.344
Percent Discordant 32.3 Gamma 0.348
Percent Tied 1.0 Tau-a 0.047
Pairs 6116 c 0.672
Estimated Covariance Matrix
Variable Intercept u
Intercept 0.14179 -0.1292
u -0.1292 0.18254
program: /home/walleye/john-c/5421/imlml.fisher.sas 30NOV11 17:34
PROC IML Results ... 3
17:34 Wednesday, November 30, 2011
ITER L DL D2L DIFF BETA
1 -207.9442 128 -75 -75.25 1.42466 1.42466
135.45333 -75.25 -100.5006 0.2810698
ITER L DL D2L DIFF BETA
2 -86.11339 24.433947 -39.1126 -36.7227 0.382878 1.6898849
27.881783 -36.7227 -47.38327 0.6639477
ITER L DL D2L DIFF BETA
3 -76.10077 5.4183855 -24.6185 -20.35354 0.3405272 1.6284455
7.0927236 -20.35354 -24.50093 1.0044749
ITER L DL D2L DIFF BETA
4 -74.93779 0.8082239 -20.61113 -15.14088 0.1008562 1.5935698
1.1867474 -15.14088 -17.00239 1.1053311
ITER L DL D2L DIFF BETA
5 -74.89053 0.04008 -19.90169 -14.11319 0.0054776 1.5916993
0.0585386 -14.11319 -15.50634 1.1108087
ITER L DL D2L DIFF BETA
6 -74.8904 0.0001116 -19.8646 -14.05981 0.0000151 1.5916942
0.0001618 -14.05981 -15.4291 1.1108238
ITER L DL D2L DIFF BETA
7 -74.8904 8.55E-10 -19.86449 -14.05967 1.157E-10 1.5916942
1.2388E-9 -14.05967 -15.42889 1.1108238
LLRATIO
-2 * loglikelihood = 149.78081
B1 SERR1
b1 coeff, std err : 1.5916942 0.376554
program: /home/walleye/john-c/5421/imlml.fisher.sas 30NOV11 17:34
PROC IML Results ... 4
17:34 Wednesday, November 30, 2011
B2 SERR2
b2 coeff, std err : 1.1108238 0.4272664
COVAR
covariance matrix : 0.1417929 -0.12921
-0.12921 0.1825565
BETA
Current value of beta: 1.5916942
1.1108238
D2LNUMER
Numerical Estimate of -Fisher Information: 19.753088 14.061641
14.061641 15.319301
COVARNUMER
Numerical Estimate of Covariance Matrix: 0.1460737 -0.134082
-0.134082 0.1883511
BETA1 SERR1
Numerical: beta1, serr1 = 1.5916942 0.3821959
BETA2 SERR2
Numerical: beta2, serr2 = 1.1108238 0.4339943
--------------------------------------------------------------------------------------------------
Here is a comparison of the resulting covariance matrices and resulting standard errors of
parameter estimates:
Estimates of the Covariance Matrix
From PROC LOGISTIC From -d2l / db2: From Numerical Approx
of -d2l/db2
-------------------------------------- ---------------------- ---------------------
Variable Intercept u Intercept u Intercept u
Intercept 0.14179 -0.1292 0.14179 -0.1291 0.1461 -0.1341
u -0.1292 0.18254 -0.1292 0.1856 -0.1341 0.1856
----------------------- ---------------------- -----------------------
Std Err Intcpt: 0.3766 0.3766 0.3822
Std Err b1 (u): 0.4272 0.4273 0.4340
Conclusion: In this case numerical approximation of 2nd derivatives worked reasonably
well, but it does not agree exactly with the MLE computation based on the analytic
second derivative matrix.
This method has the advantage that it can be used even in cases where the second
derivative is not used in the estimating program: e.g., when the EM algorithm is used.
All that is needed is the ability to compute the log likelihood of the observed data.
program: /home/walleye/john-c/5421/imlml.fisher.sas 30NOV11 17:34
==================================================================================
/home/walleye/john-c/5421/notes.028 Last update: November 30, 2011.