VARIANCE ESTIMATION SPH 5421 notes.028
Variance-covariance matrices for maximum-likelihood estimates of parameters are
useful for constructing confidence regions and performing tests. SAS and Splus
linear and nonlinear regression routines all provide ways to estimate the asymptotic
covariance matrix for covariates. For maximum likelihood problems in general, the
covariance matrix is the inverse of the 'Fisher Information' matrix.
Fisher information is the expectation of the negative of the matrix of second
partial derivatives of the log likelihood. Let l = log(L; B), where L is the
likelihood of an observed data set and B is a vector of parameters. Then
Fisher Information = -E{d2l/dbidbj}.
The expectation is taken across all the possible realizations of the data.
There is a theorem which says
[1] -E{d2l/dbidbj} = E{dl/dbi * dl/dbj}.
(See, e.g., P. Bickel & K. Doksum, Mathematical Statistics: Basic Ideas and
Selected Topics, Holden-Day, Inc., San Francisco, 1977, p 139.)
Thus Fisher information can be expressed in terms of the expectation of a
matrix of first derivatives.
The problem with Fisher information is that the expectation is very difficult
to calculate. What is done in fact is to evaluate the matrix of second derivatives
at the MLE itself. This works well if you have exact expressions for the second
derivative matrix. Numerical approximation of the second derivative matrix does NOT
work very well, because the denominators of the approximating expressions are
extremely small, making the arithmetic unstable.
You might think that you could use the above theorem [1] and base the
approximations on the observed value of the matrix of products of the first
derivatives at the MLEs. In general, numerical approximations of first derivatives
are reliable. The problem with this approach is that if you evaluate the first
derivatives at the MLE, they will all be zero. The expectation of a product of
functions is not equal to the product of the expectations, in general.
To summarize: it is difficult or impossible to compute the expectations. Trying
the estimate the second derivative matrix numerically does not work well. Evaluating
the matrix of products of first derivatives at the MLE produces all zeroes. What
should be done?
An approach that appears to work well is to approximate the right side of [1]
by evaluating the first derivatives of the log likelihood for each individual
observation at the MLE, then adding them all up across the observations, then
computing the product shown in [1]. This appears to give a fairly good
approximation to the covariance of the parameter estimates.
This approach is implemented in the PROC IML 'amoeba2' program in notes.026.
Note that for this purpose the log-likelihood expression for a least-squares problem
must include a factor (1 / (2 * sig2)) for the variance parameter. In this case
sig2 is estimated as the sum of squares divided by n, where n is the number of
observations:
a
sig2 = sum of squared errors / n
[This is the maximum likelihood estimate of the true sig2, and, as usually
happens, it is biased, with the bias being greater in small samples. The bias
can be reduced by replacing the denominator n by n - p, where p is the number of
parameters being estimated.]
The estimated sig2 is used to provide an estimate of the Fisher information
in the following two statements in the program:
d1l = (1 / (2 * sig2)) * d1l ;
gtg = d1l` * d1l ;
The sections of the amoeba program which produce estimates of the covariance
matrix are the following:
==================================================================================
* Function to get log likelihood and pieces needed to approximate ;
* the first derivatives and Hessian matrix ;
*======================================================================;
start logldiff(x, y, n, p, beta, loglike, d1sum, gtg, fcall) ;
h = 1e-8;
fcall = fcall + 4 * p ;
hh = h * beta ;
d1 = j(n, p, 0) ;
d1sum = j(p, 1, 0) ;
do j = 1 to n ;
do i = 1 to p;
if abs(hh[i]) = 0 then hh[i] = h;
hhi = hh[i] ;
modparm1 = beta ;
modparm2 = beta ;
modparm3 = beta ;
modparm4 = beta ;
modparm1[i] = beta[i] + 1 * hhi ;
modparm2[i] = beta[i] + 2 * hhi ;
modparm3[i] = beta[i] - 1 * hhi ;
modparm4[i] = beta[i] - 2 * hhi ;
run loglikej(j, x, y, n, p, modparm1, ll) ;
llp1h = ll;
run loglikej(j, x, y, n, p, modparm2, ll) ;
llp2h = ll;
run loglikej(j, x, y, n, p, modparm3, ll) ;
llm1h = ll;
run loglikej(j, x, y, n, p, modparm4, ll) ;
llm2h = ll;
derivj = (llm2h - 8*llm1h + 8*llp1h - llp2h) / (12*hh[i]);
d1[j, i] = derivj ;
d1sum[i] = d1sum[i] + derivj ;
end ;
end ;
run loglike(x, y, n, p, beta, ll, fcall) ;
loglike = ll ;
sig2 = loglike / n ; *** Mean squared error computed here
d1 = (1 / (2 * sig2)) * d1 ; *** Used in covariance estimate here
gtg = (d1` * d1) ;
finish logldiff ;
*======================================================================;
The basic idea in this section of the amoeba2 program is to numerically compute
an estimate of the derivative of the log-likelihood for each observation, with
respect to each of the parameters. The simplest way to estimate a derivative for a
function f(x) and x0 is to evaluate
a
f'(x) = (f(x + h) - f(x)) / h,
for very small values of h (e.g., h < 10**(-6)).
A better numerical approximation to the first derivative is by a method called
'Richardson extrapolation.' The appropriate expression is
a
f'(x) = (f(x - 2*h) - 8*f(x - h) + 8*f(x + h) - f(x + 2*h)) / 12*h,
and this is the method used in amoeba2.
The simplex program amoeba2 in notes.026 is essentially a sum-of-squares
minimization program. As noted above, to estimate the standard error, it is
necessary to first obtain an estimate of the mean squared error, that is,
s2 = (1/n) * sum(Yi - f(Xi; B))^2.
This is used in estimating the covariance matrix of the coefficient estimates in the
lines above marked with ***. Note that this appears to differ from the method for
estimating the covariance matrix which was used in notes.019 (program
gaussimlml.sas). The reason for this is that simplex program is minimizing not the
complete expression for the negative of the log likelihood, but only the numerator
of this expression.
Note that the function logldiff is called only once in running the amoeba2
program: at the end just before the parameter estimates and their standard errors
are printed.
The above method of estimating the covariance matrix of parameter estimates
will work with other methods of obtaining parameter estimates. The following
example shows how it can be implemented in Splus (after obtaining parameter
estimates from nlmin):
==================================================================================
# This Splus program plots chlorine vs. time from the
# Draper dataset (see notes.026: 44 observations),
# and does least-squares regression using the S+ function nlmin.
# To run this program in Splus, say: source("draper.s").
# Then to see parameter estimates, say: m
# To see estimated standard errors, say: der.f
postscript("draper.ps", horizontal = F, onefile = T)
par(mfrow = c(2, 1), oma = c(2, 2, 2, 2))
rm(sasdat)
sasdat <- sas.get(library = "/home/gnome/john-c/5421", mem = "drape")
time <- sasdat[, "time"]
chlorine <- sasdat[, "chlorine"]
hist(chlorine, nclass = 6, main = "SPH 5421 Draper data: chlorine.")
par(new = F)
plot(time, chlorine)
b <- c(2, 2)
f1 <- function(v)
{
w <- exp(-v[2] * (time - 8))
f <- v[1] + (.49 - v[1]) * w
ssq <- sum((chlorine - f)^2)
ssq
}
m <- nlmin(f1, b, max.fcal = 1000, max.iter = 1000)
derivf <- function(v)
{
h <- 1e-6
a <- v[1]
b <- v[2]
aph <- v[1] + h
bph <- v[2] + h
w0 <- exp(-b * (time - 8))
f0 <- a + (.49 - a) * w0
ssq0 <- (chlorine - f0)^2
faph <- aph + (.49 - aph) * w0
ssqaph <- (chlorine - faph)^2
wbph <- exp(-bph * (time - 8))
fbph <- a + (.49 - a) * wbph
ssqbph <- (chlorine - fbph)^2
dla <- (ssqaph - ssq0) / h # partial derivs of SSQ estimated here ...
dlb <- (ssqbph - ssq0) / h
# Note that these derivative estimates are actually n x 1 vectors.
s2 <- (sum((chlorine - f0)^2)) / 44
dl <- cbind(dla, dlb)
dl <- (1 / (2 * s2)) * dl
# Note that here dl is an (n x p) matrix ...
gtg <- t(dl) %*% dl
cov <- solve(gtg)
se1 <- sqrt(cov[1,1])
se2 <- sqrt(cov[2,2])
serrs <- c(se1, se2)
serrs
}
beta <- c(.3901400, .1016327)
der.f <- derivf(beta)
# rm(time, chlorine, sasdat)
dev.off()
==================================================================================
Note that the derivative approximation here is based simply on the usual
definition of the derivative,
a
f'(x) = (f(x + h) - f(x)) / h,
rather than using the theoretically more accurate Richardson Extrapolation formula
that was used in the simplex program.
The program above produces the following estimated standard errors for the two
coefficient estimates:
std err(a) = .005637
std err(b) = .012618
These estimates are similar to, but not the same as, the estimates produced by
proc nlin, and identical to those produced by the simplex method:
proc nlin simplex nlmin
--------- -------- ----------
std err(a) = .005045 .005637 .004625
std err(b) = .013360 .012618 .012246
==================================================================================
PROBLEM 30
Refer to Problem 19 in notes.020: compute numerical estimates of the covariance
matrix for the estimates of B0 and B1, based on a method like the 'amoeba2' program.
Note that this is not a least-squares problem, and the parameter sig2 should not
enter into the calculations.
==================================================================================
/home/walleye/john-c/5421/notes.028 Last update: December 8, 2003.