VARIANCE ESTIMATION, Continued SPH 5421 notes.029 THE 'DELTA' METHOD The 'delta' method, also known as 'propagation of errors', is a useful way to estimate variances of functions of random variables. The basic problem is as follows: Assume A and B are random variables with means MUA and MUB, and variances SIG2A and SIG2B respectively, and covariance COVAB. Suppose f(A, B) is a real-valued function of these two random variables. What is the variance of f(A, B), and how may it be used for construction of confidence intervals? An answer to this is shown below. Before giving the answer, I must point out that it is not always explained very well what conditions must be met for the answer to be valid. Essentially the conditions are that X and Y must behave like maximum likelihood estimates of parameters. More precisely, the multivariate distribution of A and B must be such that sqrt(N) * [(A, B)` - (MUA, MUB}'] converges in distribution (as sample size N gets large) to a multivariate normal distribution with mean (MUA, MUB)` and covariance matrix SIGMA, where | SIG2A COVAB | SIGMA = | | | COVAB SIG2B |. When these conditions are met, the approximate variance of f(A, B) is given by a var(f(A, B)) = (df/dA)2 * var(A) + 2 * (df/dA)*(df/dB)*cov(A, B) + (df/dB)2 * var(B). This can be written in matrix terms as: a var(f(A, B)) = (df/dA df/dB) * SIGMA * (df/dA df/dB)`. This form of the delta method lends itself to computation in PROC IML. The 'delta method' is based on the first-degree Taylor's expansion of the function f(A, B), and requires that the function be at least differentiable. It generalizes to higher dimensions in a straightforward way. CONFIDENCE REGIONS Many statistical routines which compute maximum likelihood estimates also produce estimated covariance matrices for those estimates. If these are diagonal matrices, the corresponding confidence regions are ellipsoids whose axes are parallel to the coordinate axes in the parameter space. One can compute confidence regions for the estimates by describing an ellipsoid with a specified volume. In two dimensions, the area of an ellipse is the product pi * a * b, where a and b are the lengths of the major and minor axes of the ellipse. There is a more complicated expression for higher dimensions. In general the covariance matrix is not diagonal. Confidence regions are ellipses, but they do not have their axes parallel to the coordinate axes. To find their volumes (or areas) one needs to (1) find their major and minor axes, and (2) compute their volumes using the lengths of these axes. As explained in notes.021 the axes can be computed by finding the eigenvectors of the covariance matrix. EXAMPLE: Assume that T = time to death (in days) for a male mayfly, and A = number of encounters that the mayfly has with female mayflies during his lifetime. It will be assumed that A has a Poisson distribution with parameter equal to theta * T, where theta is an unknown parameter. That is, the longer the mayfly lives, the greater his number of likely encounters with female mayflies. Assume you have sampled N male mayflies. One plausible estimate of theta is: thetahat = (Sum of Ai) / (Sum of Ti). This can also be written as thetahat = (mean of Ai) / (mean of Ti). Use the delta method to compute the variance of thetahat. SOLUTION: Let Tbar = mean of the T's and Abar = mean of the A's. The estimate of thetahat is thus: thetahat = Abar / Tbar. You will need the covariance of T and A. Given a dataset with the following format, Obs T A --- ----- ----- 1 3.9 7 2 1.4 5 3 6.6 13 . . . . . . . . . N 2.0 6 you can find the estimated covariance of T and A by the use of PROC CORR: ---------------------------------- proc corr data = mayflies cov ; var t a ; run ; ---------------------------------- Then (assuming independence of the data for different mayflies) the estimated covariance of Tbar and Abar is (1/N) * covar(T, A). Further, var(Tbar) = (1/N) * var(T) and var(Abar) = (1/N) * var(A). Estimates of both var(T) and var(A) are produced by PROC CORR. Now let f(X, Y) = X / Y. Then the estimated variance of f(Abar, Tbar) is [1] (df/dX)^2 * var(Abar) + 2*(df/dX)*(df/dY)*cov(Abar, Tbar) + (df/dY)^2 * var(Tbar). Here df/dX will be evaluated at X = Abar, Y = Tbar. Note that df/dX = 1/Y and df/dY = -X/Y^2. The rest is arithmetic, inserting the right numbers into [1]. There is an important qualification regarding the delta method. It does not work for random variables in general. In fact, sufficient conditions on the random variables are that they are maximum likelihood estimates of the corresponding parameters. In the case of Abar and Tbar, these are both simple mean values, which are MLEs of the underlying true mean values, so the conditions are met. ================================================================================== PROBLEM 31 Refer to Problem 19, notes.020. Use a PROC IML implementation of the method described here to compute the standard error of the maximum likelihood estimate of 1 - exp(-b0 - b1 * 2)), where b0 and b1 are the maximum likelihood estimates of the true values B0 and B1. Note that you are assuming t = 1 and age = 2. Use this to compute 95% confidence limits of the true value of the above expression. ================================================================================== PROBLEM 31a: Assume (b1, b2) are maximum likelihood estimates of parameters (B1, B2), and that the estimated covariance matrix for (b1, b2) is | .02 -.01 | | | |-.01 .03 |. Find a 95% confidence region for the true parameters (B1, B2). ================================================================================== /home/walleye/john-c/5421/notes.029 Last update: December 7, 2011.