VARIANCE ESTIMATION SPH 5421 notes.028 Variance-covariance matrices for maximum-likelihood estimates of parameters are useful for constructing confidence regions and performing tests. SAS and Splus linear and nonlinear regression routines all provide ways to estimate the asymptotic covariance matrix for covariates. For maximum likelihood problems in general, the covariance matrix is the inverse of the 'Fisher Information' matrix. Fisher information is the expectation of the negative of the matrix of second partial derivatives of the log likelihood. Let l = log(L; B), where L is the likelihood of an observed data set and B is a vector of parameters. Then Fisher Information = E{dl/dbi * dl/dbj}. The expectation is taken across all the possible realizations of the data. There is a theorem which says [1] -E{d2l/dbidbj} = E{dl/dbi * dl/dbj}. (See, e.g., P. Bickel & K. Doksum, Mathematical Statistics: Basic Ideas and Selected Topics, Holden-Day, Inc., San Francisco, 1977, p 139.) Thus Fisher information can be expressed in terms of the expectation of a matrix of second derivatives. The problem with Fisher information is that the expectation is very difficult to calculate. What is done in fact is to evaluate the matrix of second derivatives at the MLE itself. This works well if you have exact expressions for the second derivative matrix. Numerical approximation of the second derivative matrix is possible, (as described below) but in general it is subject to numerical instability because of small denominators. An approach using a numerical estimate of the second derivative is implemented in the program below. In this program, which was discussed originally in notes.019, maximum likelihood was used to approximate coefficients b0, and b1 for a logistic risk function. The covariance matrix was estimated using -inv(d2l/dbidbj) evaluated analytically at the maximum likelihood estimates themselves. Numerical approximation of the second derivatives was added to the latter part of this program. Here is the underlying idea. Let l(b0, b1) represent the log likelihood, where b0 and b1 are the MLEs. Then the various second derivatives of the log likelihood may be approximated as follows: d2l(b0, b1)/db02 = {l(b0 + h, b1) - 2l(b0, b1) + l(b0 - h, b1)} / (h^2) d2l(b0, b1)/db12 = {l(b0, b1 + h) - 2l(b0, b1) + l(b0, b1 - h)} / (h^2) d2l(b0, b1)/db1db2 = {l(b0 + h, b1 + h) - l(b0 + h, b1 - h) - l(b0 - h, b1 + h) + l(b0 - h, b1 - h)} / (4*h^2) These estimates are computed in the following section of the program: h = 1e-6 ; beta1ph = beta ; beta1ph[1] = beta[1] + h ; beta1mh = beta ; beta1mh[1] = beta[1] - h ; beta2ph = beta ; beta2ph[2] = beta[2] + h ; beta2mh = beta ; beta2mh[2] = beta[2] - h ; beta12phph = beta ; beta12phph[1] = beta[1] + h ; beta12phph[2] = beta[2] + h ; beta12phmh = beta ; beta12phmh[1] = beta[1] + h ; beta12phmh[2] = beta[2] - h ; beta12mhph = beta ; beta12mhph[1] = beta[1] - h ; beta12mhph[2] = beta[2] + h ; beta12mhmh = beta ; beta12mhmh[1] = beta[1] - h ; beta12mhmh[2] = beta[2] - h ; run loglikeonly(beta, p, x, y, l20h0h) ; run loglikeonly(beta1ph, p, x, y, l2ph0h) ; run loglikeonly(beta1mh, p, x, y, l2mh0h) ; run loglikeonly(beta2ph, p, x, y, l20hph) ; run loglikeonly(beta2mh, p, x, y, l20hmh) ; run loglikeonly(beta12phph, p, x, y, l2phph) ; run loglikeonly(beta12phmh, p, x, y, l2phmh) ; run loglikeonly(beta12mhph, p, x, y, l2mhph) ; run loglikeonly(beta12mhmh, p, x, y, l2mhmh) ; d2l11 = (l2ph0h - 2 * l20h0h + l2mh0h) / h**2 ; d2l22 = (l20hph - 2 * l20h0h + l20hmh) / h**2 ; d2l12 = (l2phph - l2phmh - l2mhph + l2mhmh) / (4*h*h) ; d2lnumer = j(2, 2, 0) ; d2lnumer[1, 1] = -d2l11 ; d2lnumer[1, 2] = -d2l12 ; d2lnumer[2, 1] = -d2l12 ; d2lnumer[2, 2] = -d2l22 ; covarnumer = inv(d2lnumer) ; print "Current value of beta: " beta ; print "Numerical Estimate of -Fisher Information: " d2lnumer ; print "Numerical Estimate of Covariance Matrix: " covarnumer ; beta1 = beta[1] ; beta2 = beta[2] ; serr1 = sqrt(covarnumer[1, 1]) ; serr2 = sqrt(covarnumer[2, 2]) ; print "Numerical: beta1, serr1 = " beta1 serr1 ; print "Numerical: beta2, serr2 = " beta2 serr2 ; ---------------------------------------------------------------------------------- Note that in this case, h was set equal to 10^(-6). This is about as small as it is safe to use with the computer's 64-bit arithmetic, because in the expressions above, h is squared, meaning that the denominators are of the order of 10^(-12). /* Program to compute maximum likelihood estimates .... */; /* Based on Newton-Raphson methods - uses IML */; footnote "program: /home/walleye/john-c/5421/imlml.fisher.sas &sysdate &systime" ; options linesize = 80 ; data sim ; one = 1 ; a = 2 ; b = 1 ; n = 300 ; seed = 20000214 ; /* Simulated data from logistic distribution */; do i = 1 to n ; u = 2 * i / n ; p = 1 / (1 + exp(-a - b * u)) ; r = ranuni(seed) ; yi = 0 ; if r lt p then yi = 1 ; output ; end ; run ; /* Logistic analysis of simulated data ...... */; proc logistic descending ; model yi = u ; title1 'Proc logistic results ... ' ; run ; /* Begin proc iml .............................. */; title1 'PROC IML Results ...' ; proc iml ; use sim ; /* Input the simulated dataset */; read all var {one u} into x ; /* Read the design matrix into x */; read all var {yi} into y ; /* Read the outcome data into y */; p = 2 ; /* Set the number of parameters */; beta = {0.00, 0.00} ; /* Initialize the param. vector */; /* -------------------------------------------------------------------*/; /* Module which computes loglikelihood and derivatives ... */; start loglike(beta, p, x, y, l, dl, d2l) ; n = 300 ; l = 0 ; /* Initialize log likelihood ... */; dl = j(p, 1, 0) ; /* Initialize 1st derivatives... */; d2l = j(p, p, 0) ; /* Initialize 2nd derivatives... */; do i = 1 to n ; xi = x[i,] ; yi = y[i] ; xibeta = xi * beta ; w = exp(-xibeta) ; /* The log likelihood for the i-th observation ... */; iloglike = log((1 - yi) * w + yi) - log(1 + w) ; l = l + iloglike ; do j = 1 to p ; xij = xi[j] ; /* The jth 1st derivative of the log likelihood for the ith obs */; jdlogl = -(1 - yi) * w * xij / ((1 - yi) * w + yi) + w * xij / (1 + w) ; dtemp = dl[j] + jdlogl ; dl[j] = dtemp ; do k = 1 to p ; xik = xi[k] ; /* The jkth 2nd derivative of the log likelihood for the ith obs */; jkd2logl = ((1 - yi) * w * xij * xik * ((1 - yi) * w + yi) -(1 - yi) * w * xij * ((1 - yi) * w * xik))/ ((1 - yi) * w + yi)**2 + (-w * xij * xik * (1 + w) + w * w * xij * xik)/ (1 + w)**2 ; d2temp = d2l[j, k] ; d2l[j, k] = d2temp + jkd2logl ; end ; end ; end ; finish loglike ; /* -------------------------------------------------------------------*/; start loglikeonly(beta, p, x, y, l) ; n = 300 ; l = 0 ; /* Initialize log likelihood ... */; do i = 1 to n ; xi = x[i,] ; yi = y[i] ; xibeta = xi * beta ; w = exp(-xibeta) ; /* The log likelihood for the i-th observation ... */; iloglike = log((1 - yi) * w + yi) - log(1 + w) ; l = l + iloglike ; end ; finish loglikeonly ; /* -------------------------------------------------------------------*/; eps = 1e-8 ; diff = 1 ; /* The following do loop stops when the increments in beta are */; /* sufficiently small, or when number of iterations reaches 20 */; do iter = 1 to 20 while(diff > eps) ; run loglike(beta, p, x, y, l, dl, d2l) ; invd2l = inv(d2l) ; beta = beta - invd2l * dl ; /* The key Newton step ... */; diff = max(abs(invd2l * dl)) ; print iter l dl d2l diff beta ; end ; b1 = beta[1] ; b2 = beta[2] ; serr1 = sqrt(-invd2l[1, 1]) ; serr2 = sqrt(-invd2l[2, 2]) ; covar = -invd2l ; llratio = - 2 * l ; print ' -2 * loglikelihood = ' llratio ; print 'b1 coeff, std err : ' b1 serr1 ; print 'b2 coeff, std err : ' b2 serr2 ; print 'covariance matrix : ' covar ; h = 1e-6 ; beta1ph = beta ; beta1ph[1] = beta[1] + h ; beta1mh = beta ; beta1mh[1] = beta[1] - h ; beta2ph = beta ; beta2ph[2] = beta[2] + h ; beta2mh = beta ; beta2mh[2] = beta[2] - h ; beta12phph = beta ; beta12phph[1] = beta[1] + h ; beta12phph[2] = beta[2] + h ; beta12phmh = beta ; beta12phmh[1] = beta[1] + h ; beta12phmh[2] = beta[2] - h ; beta12mhph = beta ; beta12mhph[1] = beta[1] - h ; beta12mhph[2] = beta[2] + h ; beta12mhmh = beta ; beta12mhmh[1] = beta[1] - h ; beta12mhmh[2] = beta[2] - h ; run loglikeonly(beta, p, x, y, l20h0h) ; run loglikeonly(beta1ph, p, x, y, l2ph0h) ; run loglikeonly(beta1mh, p, x, y, l2mh0h) ; run loglikeonly(beta2ph, p, x, y, l20hph) ; run loglikeonly(beta2mh, p, x, y, l20hmh) ; run loglikeonly(beta12phph, p, x, y, l2phph) ; run loglikeonly(beta12phmh, p, x, y, l2phmh) ; run loglikeonly(beta12mhph, p, x, y, l2mhph) ; run loglikeonly(beta12mhmh, p, x, y, l2mhmh) ; d2l11 = (l2ph0h - 2 * l20h0h + l2mh0h) / h**2 ; d2l22 = (l20hph - 2 * l20h0h + l20hmh) / h**2 ; d2l12 = (l2phph - l2phmh - l2mhph + l2mhmh) / (4*h*h) ; d2lnumer = j(2, 2, 0) ; d2lnumer[1, 1] = -d2l11 ; d2lnumer[1, 2] = -d2l12 ; d2lnumer[2, 1] = -d2l12 ; d2lnumer[2, 2] = -d2l22 ; covarnumer = inv(d2lnumer) ; print "Current value of beta: " beta ; print "Numerical Estimate of -Fisher Information: " d2lnumer ; print "Numerical Estimate of Covariance Matrix: " covarnumer ; beta1 = beta[1] ; beta2 = beta[2] ; serr1 = sqrt(covarnumer[1, 1]) ; serr2 = sqrt(covarnumer[2, 2]) ; print "Numerical: beta1, serr1 = " beta1 serr1 ; print "Numerical: beta2, serr2 = " beta2 serr2 ; quit ; -------------------------------------------------------------------------------------------------- Proc logistic results ... 1 17:34 Wednesday, November 30, 2011 The LOGISTIC Procedure Model Information Data Set WORK.SIM Response Variable yi Number of Response Levels 2 Number of Observations 300 Model binary logit Optimization Technique Fisher's scoring Response Profile Ordered Total Value yi Frequency 1 1 278 2 0 22 Probability modeled is yi=1. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 159.306 153.781 SC 163.010 161.188 -2 Log L 157.306 149.781 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 7.5255 1 0.0061 Score 7.2657 1 0.0070 Wald 6.7583 1 0.0093 program: /home/walleye/john-c/5421/imlml.fisher.sas 30NOV11 17:34 Proc logistic results ... 2 17:34 Wednesday, November 30, 2011 The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 1.5917 0.3766 17.8686 <.0001 u 1 1.1107 0.4272 6.7583 0.0093 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits u 3.036 1.314 7.015 Association of Predicted Probabilities and Observed Responses Percent Concordant 66.7 Somers' D 0.344 Percent Discordant 32.3 Gamma 0.348 Percent Tied 1.0 Tau-a 0.047 Pairs 6116 c 0.672 Estimated Covariance Matrix Variable Intercept u Intercept 0.14179 -0.1292 u -0.1292 0.18254 program: /home/walleye/john-c/5421/imlml.fisher.sas 30NOV11 17:34 PROC IML Results ... 3 17:34 Wednesday, November 30, 2011 ITER L DL D2L DIFF BETA 1 -207.9442 128 -75 -75.25 1.42466 1.42466 135.45333 -75.25 -100.5006 0.2810698 ITER L DL D2L DIFF BETA 2 -86.11339 24.433947 -39.1126 -36.7227 0.382878 1.6898849 27.881783 -36.7227 -47.38327 0.6639477 ITER L DL D2L DIFF BETA 3 -76.10077 5.4183855 -24.6185 -20.35354 0.3405272 1.6284455 7.0927236 -20.35354 -24.50093 1.0044749 ITER L DL D2L DIFF BETA 4 -74.93779 0.8082239 -20.61113 -15.14088 0.1008562 1.5935698 1.1867474 -15.14088 -17.00239 1.1053311 ITER L DL D2L DIFF BETA 5 -74.89053 0.04008 -19.90169 -14.11319 0.0054776 1.5916993 0.0585386 -14.11319 -15.50634 1.1108087 ITER L DL D2L DIFF BETA 6 -74.8904 0.0001116 -19.8646 -14.05981 0.0000151 1.5916942 0.0001618 -14.05981 -15.4291 1.1108238 ITER L DL D2L DIFF BETA 7 -74.8904 8.55E-10 -19.86449 -14.05967 1.157E-10 1.5916942 1.2388E-9 -14.05967 -15.42889 1.1108238 LLRATIO -2 * loglikelihood = 149.78081 B1 SERR1 b1 coeff, std err : 1.5916942 0.376554 program: /home/walleye/john-c/5421/imlml.fisher.sas 30NOV11 17:34 PROC IML Results ... 4 17:34 Wednesday, November 30, 2011 B2 SERR2 b2 coeff, std err : 1.1108238 0.4272664 COVAR covariance matrix : 0.1417929 -0.12921 -0.12921 0.1825565 BETA Current value of beta: 1.5916942 1.1108238 D2LNUMER Numerical Estimate of -Fisher Information: 19.753088 14.061641 14.061641 15.319301 COVARNUMER Numerical Estimate of Covariance Matrix: 0.1460737 -0.134082 -0.134082 0.1883511 BETA1 SERR1 Numerical: beta1, serr1 = 1.5916942 0.3821959 BETA2 SERR2 Numerical: beta2, serr2 = 1.1108238 0.4339943 -------------------------------------------------------------------------------------------------- Here is a comparison of the resulting covariance matrices and resulting standard errors of parameter estimates: Estimates of the Covariance Matrix From PROC LOGISTIC From -d2l / db2: From Numerical Approx of -d2l/db2 -------------------------------------- ---------------------- --------------------- Variable Intercept u Intercept u Intercept u Intercept 0.14179 -0.1292 0.14179 -0.1291 0.1461 -0.1341 u -0.1292 0.18254 -0.1292 0.1856 -0.1341 0.1856 ----------------------- ---------------------- ----------------------- Std Err Intcpt: 0.3766 0.3766 0.3822 Std Err b1 (u): 0.4272 0.4273 0.4340 Conclusion: In this case numerical approximation of 2nd derivatives worked reasonably well, but it does not agree exactly with the MLE computation based on the analytic second derivative matrix. This method has the advantage that it can be used even in cases where the second derivative is not used in the estimating program: e.g., when the EM algorithm is used. All that is needed is the ability to compute the log likelihood of the observed data. program: /home/walleye/john-c/5421/imlml.fisher.sas 30NOV11 17:34 ================================================================================== /home/walleye/john-c/5421/notes.028 Last update: November 30, 2011.