VARIANCE ESTIMATION, Continued                                 SPH 5421 notes.029

THE 'DELTA' METHOD

     The 'delta' method, also known as 'propagation of errors', is a useful
way to estimate variances of functions of random variables.  The basic problem
is as follows:

     Assume A and B are random variables with means MUA and MUB, and variances
SIG2A and SIG2B respectively, and covariance COVAB.  Suppose f(A, B) is a
real-valued function of these two random variables.  What is the variance of
f(A, B), and how may it be used for construction of confidence intervals?

     An answer to this is shown below.  Before giving the answer, I must point
out that it is not always explained very well what conditions must be met for
the answer to be valid.  Essentially the conditions are that X and Y must
behave like maximum likelihood estimates of parameters.  More precisely,
the multivariate distribution of A and B must be such that

                sqrt(N) * [(A, B)` - (MUA, MUB}']

converges in distribution (as sample size N gets large) to a multivariate normal
distribution with mean (MUA, MUB)` and covariance matrix SIGMA, where

                        |  SIG2A     COVAB  |
                SIGMA = |                   |
                        |  COVAB     SIG2B  |.

     When these conditions are met, the approximate variance of f(A, B) is
given by

                    a
       var(f(A, B)) = (df/dA)² * var(A) + 2 * (df/dA)*(df/dB)*cov(A, B) +
                      (df/dB)² * var(B).

       This can be written in matrix terms as:

                    a
       var(f(A, B)) = (df/dA  df/dB) * SIGMA * (df/dA  df/dB)`.


       This form of the delta method lends itself to computation in PROC IML.


       The 'delta method' is based on the first-degree Taylor's expansion of
the function f(A, B), and requires that the function be at least differentiable.
It generalizes to higher dimensions in a straightforward way.


CONFIDENCE REGIONS

     Many statistical routines which compute maximum likelihood
estimates also produce estimated covariance matrices for those
estimates.  If these are diagonal matrices, the corresponding
confidence regions are ellipsoids whose axes are parallel to the
coordinate axes in the parameter space.  One can compute confidence
regions for the estimates by describing an ellipsoid with a specified
volume.  In two dimensions, the area of an ellipse is the product

         pi * a * b,

where a and b are the lengths of the major and minor axes of the
ellipse.  There is a more complicated expression for higher
dimensions.

     In general the covariance matrix is not diagonal.  Confidence
regions are ellipses, but they do not have their axes parallel to
the coordinate axes.  To find their volumes (or areas) one needs to
(1) find their major and minor axes, and (2) compute their volumes
using the lengths of these axes.  As explained in notes.021 the
axes can be computed by finding the eigenvectors of the covariance
matrix.

EXAMPLE:

     Assume that T = time to death (in days) for a male mayfly, and
A = number of encounters that the mayfly has with female mayflies during
his lifetime.  It will be assumed that A has a Poisson distribution
with parameter equal to theta * T, where theta is an unknown parameter.
That is, the longer the mayfly lives, the greater his number of likely
encounters with female mayflies.

     Assume you have sampled N male mayflies.

     One plausible estimate of theta is:

     thetahat = (Sum of A_i) / (Sum of T_i).

     This can also be written as

     thetahat = (mean of A_i) / (mean of T_i).

     Use the delta method to compute the variance of thetahat.

SOLUTION:

     Let Tbar = mean of the T's and Abar = mean of the A's.

     The estimate of thetahat is thus: thetahat = Abar / Tbar.

     You will need the covariance of T and A.  Given a dataset with
the following format,

     Obs     T      A
     ---   -----  -----
      1     3.9     7
      2     1.4     5
      3     6.6    13
      .      .      .
      .      .      .
      .      .      .
      N     2.0     6

you can find the estimated covariance of T and A by the use of PROC CORR:

----------------------------------
proc corr data = mayflies cov ;
     var  t a ;
run ;

----------------------------------

Then (assuming independence of the data for different mayflies) the estimated 
covariance of Tbar and Abar is

           (1/N) * covar(T, A).

Further, var(Tbar) = (1/N) * var(T) and var(Abar) = (1/N) * var(A).

Estimates of both var(T) and var(A) are produced by PROC CORR.

Now let f(X, Y) = X / Y.  Then the estimated variance of f(Abar, Tbar) is

[1]  (df/dX)^2 * var(Abar) + 2*(df/dX)*(df/dY)*cov(Abar, Tbar) + (df/dY)^2 * var(Tbar).

Here df/dX will be evaluated at X = Abar, Y = Tbar.

Note that df/dX = 1/Y and df/dY = -X/Y^2.

The rest is arithmetic, inserting the right numbers into [1].

There is an important qualification regarding the delta method.  It does not work
for random variables in general.  In fact, sufficient conditions on the random
variables are that they are maximum likelihood estimates of the corresponding parameters.
In the case of Abar and Tbar, these are both simple mean values, which are MLEs of
the underlying true mean values, so the conditions are met.

==================================================================================

PROBLEM 31


     Refer to Problem 19, notes.020.  Use a PROC IML implementation of the
method described here to compute the standard error of the maximum
likelihood estimate of

     1 - exp(-b₀ - b₁ * 2)),

where b₀ and b₁ are the maximum likelihood estimates of the true values
B₀ and B₁.  Note that you are assuming t = 1 and age = 2.

     Use this to compute 95% confidence limits of the true value of
the above expression.

==================================================================================

PROBLEM 31a:

     Assume (b₁, b₂) are maximum likelihood estimates of
parameters (B₁, B₂), and that the estimated covariance matrix for
(b1, b2) is

           | .02   -.01 |
           |            |
           |-.01    .03 |.

     Find a 95% confidence region for the true parameters (B₁, B₂).

==================================================================================

/home/walleye/john-c/5421/notes.029   Last update: December 7, 2011.