notes.016b  October 16, 2010

Using PROC IML to Find Bootstrap Estimates ...
===============================================================================

The 'Bootstrap' procedure is a method for estimating characteristics of a distribution
without making parametric assumptions.  The basic procedure is simple.  Assume you
are given a datafile XDATA with the numeric variable X.  You want to estimate the variance
of the median of X.  Here is the procedure:

(1) You create a new file YDATA, same length as XDATA, by sampling from XDATA ***with replacement***.
    This is called a 'bootstrap sample'.

(2) You find the median of Y in YDATA.  You append this value to another file called ACCUM.MEDIANS.

(3) You cycle back and do steps (1) and (2) N times.  (You choose N to be, say, 1000).

(4) You use PROC MEANS or PROC UNIVARIATE to compute the variance of the accumulated bootstrap medians.

The bootstrap can also be used to estimate, for example, a 95% confidence interval for
the statistic you are interested in (the median, perhaps).

Another possible use is in comparing statistics from randomized groups in a clinical trial.
In that case, the bootstrap sampling must be constructed in such a way as to preserve
the clinical trial design.  If the clinical trial has groups A and B, and you are interested
in the variance of the difference of the medians between these two groups, the bootstrap
samples must be ***within*** the randomized groups.

The bootstrap is one of several computing-intensive nonparametric procedures.
was developed originally by Bradley Efron of Stanford University.  A good reference for the
bootstrap is the following:

Efron, B.; Tibshirani, R. (1993). An Introduction to the Bootstrap. Boca Raton, FL: Chapman & Hall/CRC.

Below is a program which (1) generates a random sample from a lognormal distribution, (2) uses
PROC IML to derive bootstrap samples, (3) uses PROC MEANS to find the median of each bootstrap
sample, (4) accumulates a file of the medians, and (5) uses PROC UNIVARIATE to estimate the
variance of the median.  The usefulness of PROC IML in this case is that it provides an
efficient way to construct bootstrap samples.  This is because it has the capability of
turning a datafile into an array.
*===================================================================;

options linesize = 80 MPRINT ;
footnote "~john-c/5421/bootstrap.sas &sysdate &systime" ;

FILENAME GRAPH 'gsas.grf' ;
LIBNAME  loc '' ;

OPTIONS  LINESIZE = 80 MPRINT ;

GOPTIONS
         RESET = GLOBAL
         ROTATE = landscape
         FTEXT = SWISSB
         DEVICE = PSCOLOR
         GACCESS = SASGASTD
         GSFNAME = GRAPH
         GSFMODE = REPLACE
         GUNIT = PCT BORDER
         CBACK = WHITE
         HTITLE = 2 HTEXT = 1 ;

*===================================================================== ;        

data skewed ;

     n = 400 ;

     do i = 1 to n ;

        x = rannor(-1) ;
        y = exp(x) ;
        output ;

     end ;

run ;

%macro bootmedian(datafile, n, m, outfile) ;

       %do i = 1 %to &m ;

           proc iml ;

             y = j(&n, 1, 0) ;
             z = j(&n, 1, 0) ;

             use &datafile ;

             read all var {y} into y ;

             do j = 1 to &n ;

                yrandindex = 1 + int(&n * ranuni(-1)) ;
                z[j] = y[yrandindex] ;

             end ;

           varnames = "zvalues" ;

           create zboots from z [colname = varnames] ;
           append from z ;

        quit ;

        proc means data = zboots n mean stddev min max median noprint ;
             var zvalues ;
             output out = zmedian median = samplemedian ;
        run ;

        data outfile ;
             set outfile zmedian ;
        run ;

      %end ;

%mend ;


data outfile ;
run ;

proc means data = skewed n mean var stddev stderr median ;
     var y ;
title1 'Statistics for the original data ...' ;
run ;

%bootmedian(skewed, 400, 1000, outfile) ;

proc sort data = outfile ; by samplemedian ;
run ;

proc univariate data = outfile ;
     var samplemedian ;
run ;

proc print data = outfile ;
     var samplemedian ;
run ;

proc sort data = skewed ; by y ;
run ;

data skewed ;
     retain cumulative 0 casenum 0  cumold 0 yold ;
     retain cumlognormal 0 ;
     set skewed ;
     if casenum eq 0 then yold = y ;
     casenum = casenum + 1 ;
     cumlognormal = probnorm(log(y)) ;
     cumulative = cumulative + 1/ 400 ;
     yold = y ;
     output ;
run ;

proc print data = skewed ;
     var y ;
title 'Printout of original data ordered by y ...' ;
run ;

symbol1 i = j   v = none   c = black   w = 3   h = 3 ;

proc gplot data = skewed ;
     where y le 2 ;
     plot cumulative * y ;
title1 h = 2 'The Empirical CDF for Lognormal data' ;
run ;

goptions gsfmode = append ;

proc gplot data = skewed ;
     where y le 2 ;
     plot cumlognormal * y ;
title1 h = 2 'The True CDF for the Lognormal' ;
run ;


endsas ;

*===================================================================;


                      Statistics for the original data ...                     1
                                                  13:47 Sunday, October 17, 2010

                              The MEANS Procedure

                             Analysis Variable : y 
 
   N           Mean       Variance        Std Dev      Std Error         Median
 ------------------------------------------------------------------------------
 400      1.8081035      7.8903770      2.8089815      0.1404491      1.0516644
 ------------------------------------------------------------------------------
 
 
 
                    ~john-c/5421/bootstrap.sas 17OCT10 13:47
                      Statistics for the original data ...                     2
                                                  13:47 Sunday, October 17, 2010

                            The UNIVARIATE Procedure
                            Variable:  samplemedian

                                    Moments

        N                        1000    Sum Weights               1000
        Mean               1.05297519    Sum Observations    1052.97519
        Std Deviation      0.06992941    Variance            0.00489012
        Skewness           0.47589645    Kurtosis             0.2333124
        Uncorrected SS     1113.64198    Corrected SS          4.885232
        Coeff Variation    6.64112592    Std Error Mean      0.00221136


                           Basic Statistical Measures
 
                 Location                    Variability

             Mean     1.052975     Std Deviation            0.06993
             Median   1.052809     Variance                 0.00489
             Mode     1.016383     Range                    0.41545
                                   Interquartile Range      0.08393

    NOTE: The mode displayed is the smallest of 5 modes with a count of 19.


                           Tests for Location: Mu0=0
 
                Test           -Statistic-    -----p Value------

                Student's t    t  476.1659    Pr > |t|    <.0001
                Sign           M       500    Pr >= |M|   <.0001
                Signed Rank    S    250250    Pr >= |S|   <.0001


                            Quantiles (Definition 5)
 
                            Quantile       Estimate

                            100% Max       1.307558
                            99%            1.237299
                            95%            1.188665
                            90%            1.146497
                            75% Q3         1.090729
                            50% Median     1.052809
                            25% Q1         1.006804
                            10%            0.957871
                            5%             0.950592
                            1%             0.918669
                            0% Min         0.892106


 
 
 
 
 
                    ~john-c/5421/bootstrap.sas 17OCT10 13:47
                      Statistics for the original data ...                     3
                                                  13:47 Sunday, October 17, 2010

                            The UNIVARIATE Procedure
                            Variable:  samplemedian

                              Extreme Observations
 
                  ------Lowest------        -----Highest-----
 
                      Value      Obs           Value      Obs

                   0.892106        3         1.24358      997
                   0.892106        2         1.25550      998
                   0.892466        4         1.27638      999
                   0.892826        5         1.27638     1000
                   0.898620        6         1.30756     1001


                                 Missing Values
 
                                         -----Percent Of-----
                  Missing                             Missing
                    Value       Count     All Obs         Obs

                        .           1        0.10      100.00
 
 
 
                    ~john-c/5421/bootstrap.sas 17OCT10 13:47
                      Statistics for the original data ...                     4
                                                  13:47 Sunday, October 17, 2010

                               Obs    samplemedian

                                 1        .       
                                 2       0.89211  
                                 3       0.89211  
                                 4       0.89247  
                                 5       0.89283  
                                 6       0.89862  
                                 7       0.89873  
                                 8       0.90154  
                                 9       0.90154  
                                10       0.90839  
                                11       0.91528  
                                12       0.92206  
                                13       0.92211  
                                14       0.92211  
                                15       0.92216  
                                16       0.92216  
                                17       0.92341  
                                18       0.92552  
                                19       0.92637  
                                20       0.93073  
                                21       0.93158  
                                22       0.93158  
                                23       0.93410  
                                24       0.93680  
                                25       0.93680  
                                26       0.93932  
 Middle observations omitted ...

                               974       1.20931  
                               975       1.20931  
                               976       1.21232  
                               977       1.21232  
                               978       1.21232  
                               979       1.21232  
                               980       1.22435  
                               981       1.22584  
                               982       1.22584  
                               983       1.22733  
                               984       1.23144  
                               985       1.23144  
                               986       1.23144  
                               987       1.23407  
                               988       1.23556  
                               989       1.23599  
                               990       1.23641  
                               991       1.23641  
                               992       1.23819  
                               993       1.23861  
                               994       1.24082  
                               995       1.24082  
                               996       1.24082  
                               997       1.24358  
                               998       1.25550  
                               999       1.27638  
                              1000       1.27638  
                              1001       1.30756  
 
 
 
                    ~john-c/5421/bootstrap.sas 17OCT10 13:47



Note that this program yields two ways of computing a 95% confidence interval for the median.
One is by estimating the sample median +/- 1.96 * std err, where the standard error is the
square root of the variance of the bootstrap median estimates.  The other is by looking at
lower 2.5% and upper 2.5% of the bootstrap sample medians.  In general these are not
identical.  In general preference would be given to the latter method, because the former
method depends on asymptotic normality of the mean of the bootstrap sample medians.
The values obtained by these two methods are the following:

      median +/- 1.96 * stderr = 1.05166 +/- 1.96 * .06993 =                (0.9146, 1.18872)

      95% CL from the upper and lower 2.5% of the bootstrap sample medians: (0.9393, 1.2123)

There is an analytic expression for the large-sample variance of the median.  It is:

      var(x_median) = 1 / (4 * N * f^2(x_median)),

where N is the sample size and f(.) is the pdf for the distribution.  See:

      http://mathworld.wolfram.com/StatisticalMedian.html

This is not terribly useful in real life, because in those situations where you might 
want to use the bootstrap, it will often be true that you don't know what the 
pdf f(.) might be.

However, in the case of the lognormal, this can be computed.  The median for the standard
lognormal is exp(0) = 1.  The cdf is given by

          F(x) = (1/sqrt(2 * pi)) * integral [-infinity to log(x)] exp(-u^2 / 2) du]

The pdf is the derivative of the cdf.  Therefore the pdf is

          f(x) = F'(x) = (1/sqrt(2 * pi)) * (1/x) * exp(-log(x)^2 / 2).

          Then note that the median of x is at the value x = 1.  Therefore

          f(x_median) = f(1) = (1/sqrt( 2 * pi)) * 1/1 * exp(0)

                             = 1/sqrt(2*pi) = 1/sqrt(6.283).

Therefore the variance is 1/(4*N*f^2(1)) = 6.28/(4 * N).
In the example above, N = 400.  So the asymptotic variance is 6.28 / (4 * 400) = .003925.

You can compare this to the bootstrap estimate in the data above: variance = approx. .00489.
This is reasonably good agreement.


=======================================================================================================

Problem 1:

      Generate a file with 1000 pseudorandom observations from the exponential distribution.

      Let Y = abs(mean(X) - 1).

      Use the bootstrap to estimate (1) stderr(Y), and (2) 95% confidence limits for the expectation of Y.

=======================================================================================================

/home/gnome/john-c/5421/notes.016b     Last date revised: October 17, 2010.