notes.016b October 16, 2010
Using PROC IML to Find Bootstrap Estimates ...
===============================================================================
The 'Bootstrap' procedure is a method for estimating characteristics of a distribution
without making parametric assumptions. The basic procedure is simple. Assume you
are given a datafile XDATA with the numeric variable X. You want to estimate the variance
of the median of X. Here is the procedure:
(1) You create a new file YDATA, same length as XDATA, by sampling from XDATA ***with replacement***.
This is called a 'bootstrap sample'.
(2) You find the median of Y in YDATA. You append this value to another file called ACCUM.MEDIANS.
(3) You cycle back and do steps (1) and (2) N times. (You choose N to be, say, 1000).
(4) You use PROC MEANS or PROC UNIVARIATE to compute the variance of the accumulated bootstrap medians.
The bootstrap can also be used to estimate, for example, a 95% confidence interval for
the statistic you are interested in (the median, perhaps).
Another possible use is in comparing statistics from randomized groups in a clinical trial.
In that case, the bootstrap sampling must be constructed in such a way as to preserve
the clinical trial design. If the clinical trial has groups A and B, and you are interested
in the variance of the difference of the medians between these two groups, the bootstrap
samples must be ***within*** the randomized groups.
The bootstrap is one of several computing-intensive nonparametric procedures.
was developed originally by Bradley Efron of Stanford University. A good reference for the
bootstrap is the following:
Efron, B.; Tibshirani, R. (1993). An Introduction to the Bootstrap. Boca Raton, FL: Chapman & Hall/CRC.
Below is a program which (1) generates a random sample from a lognormal distribution, (2) uses
PROC IML to derive bootstrap samples, (3) uses PROC MEANS to find the median of each bootstrap
sample, (4) accumulates a file of the medians, and (5) uses PROC UNIVARIATE to estimate the
variance of the median. The usefulness of PROC IML in this case is that it provides an
efficient way to construct bootstrap samples. This is because it has the capability of
turning a datafile into an array.
*===================================================================;
options linesize = 80 MPRINT ;
footnote "~john-c/5421/bootstrap.sas &sysdate &systime" ;
FILENAME GRAPH 'gsas.grf' ;
LIBNAME loc '' ;
OPTIONS LINESIZE = 80 MPRINT ;
GOPTIONS
RESET = GLOBAL
ROTATE = landscape
FTEXT = SWISSB
DEVICE = PSCOLOR
GACCESS = SASGASTD
GSFNAME = GRAPH
GSFMODE = REPLACE
GUNIT = PCT BORDER
CBACK = WHITE
HTITLE = 2 HTEXT = 1 ;
*===================================================================== ;
data skewed ;
n = 400 ;
do i = 1 to n ;
x = rannor(-1) ;
y = exp(x) ;
output ;
end ;
run ;
%macro bootmedian(datafile, n, m, outfile) ;
%do i = 1 %to &m ;
proc iml ;
y = j(&n, 1, 0) ;
z = j(&n, 1, 0) ;
use &datafile ;
read all var {y} into y ;
do j = 1 to &n ;
yrandindex = 1 + int(&n * ranuni(-1)) ;
z[j] = y[yrandindex] ;
end ;
varnames = "zvalues" ;
create zboots from z [colname = varnames] ;
append from z ;
quit ;
proc means data = zboots n mean stddev min max median noprint ;
var zvalues ;
output out = zmedian median = samplemedian ;
run ;
data outfile ;
set outfile zmedian ;
run ;
%end ;
%mend ;
data outfile ;
run ;
proc means data = skewed n mean var stddev stderr median ;
var y ;
title1 'Statistics for the original data ...' ;
run ;
%bootmedian(skewed, 400, 1000, outfile) ;
proc sort data = outfile ; by samplemedian ;
run ;
proc univariate data = outfile ;
var samplemedian ;
run ;
proc print data = outfile ;
var samplemedian ;
run ;
proc sort data = skewed ; by y ;
run ;
data skewed ;
retain cumulative 0 casenum 0 cumold 0 yold ;
retain cumlognormal 0 ;
set skewed ;
if casenum eq 0 then yold = y ;
casenum = casenum + 1 ;
cumlognormal = probnorm(log(y)) ;
cumulative = cumulative + 1/ 400 ;
yold = y ;
output ;
run ;
proc print data = skewed ;
var y ;
title 'Printout of original data ordered by y ...' ;
run ;
symbol1 i = j v = none c = black w = 3 h = 3 ;
proc gplot data = skewed ;
where y le 2 ;
plot cumulative * y ;
title1 h = 2 'The Empirical CDF for Lognormal data' ;
run ;
goptions gsfmode = append ;
proc gplot data = skewed ;
where y le 2 ;
plot cumlognormal * y ;
title1 h = 2 'The True CDF for the Lognormal' ;
run ;
endsas ;
*===================================================================;
Statistics for the original data ... 1
13:47 Sunday, October 17, 2010
The MEANS Procedure
Analysis Variable : y
N Mean Variance Std Dev Std Error Median
------------------------------------------------------------------------------
400 1.8081035 7.8903770 2.8089815 0.1404491 1.0516644
------------------------------------------------------------------------------
~john-c/5421/bootstrap.sas 17OCT10 13:47
Statistics for the original data ... 2
13:47 Sunday, October 17, 2010
The UNIVARIATE Procedure
Variable: samplemedian
Moments
N 1000 Sum Weights 1000
Mean 1.05297519 Sum Observations 1052.97519
Std Deviation 0.06992941 Variance 0.00489012
Skewness 0.47589645 Kurtosis 0.2333124
Uncorrected SS 1113.64198 Corrected SS 4.885232
Coeff Variation 6.64112592 Std Error Mean 0.00221136
Basic Statistical Measures
Location Variability
Mean 1.052975 Std Deviation 0.06993
Median 1.052809 Variance 0.00489
Mode 1.016383 Range 0.41545
Interquartile Range 0.08393
NOTE: The mode displayed is the smallest of 5 modes with a count of 19.
Tests for Location: Mu0=0
Test -Statistic- -----p Value------
Student's t t 476.1659 Pr > |t| <.0001
Sign M 500 Pr >= |M| <.0001
Signed Rank S 250250 Pr >= |S| <.0001
Quantiles (Definition 5)
Quantile Estimate
100% Max 1.307558
99% 1.237299
95% 1.188665
90% 1.146497
75% Q3 1.090729
50% Median 1.052809
25% Q1 1.006804
10% 0.957871
5% 0.950592
1% 0.918669
0% Min 0.892106
~john-c/5421/bootstrap.sas 17OCT10 13:47
Statistics for the original data ... 3
13:47 Sunday, October 17, 2010
The UNIVARIATE Procedure
Variable: samplemedian
Extreme Observations
------Lowest------ -----Highest-----
Value Obs Value Obs
0.892106 3 1.24358 997
0.892106 2 1.25550 998
0.892466 4 1.27638 999
0.892826 5 1.27638 1000
0.898620 6 1.30756 1001
Missing Values
-----Percent Of-----
Missing Missing
Value Count All Obs Obs
. 1 0.10 100.00
~john-c/5421/bootstrap.sas 17OCT10 13:47
Statistics for the original data ... 4
13:47 Sunday, October 17, 2010
Obs samplemedian
1 .
2 0.89211
3 0.89211
4 0.89247
5 0.89283
6 0.89862
7 0.89873
8 0.90154
9 0.90154
10 0.90839
11 0.91528
12 0.92206
13 0.92211
14 0.92211
15 0.92216
16 0.92216
17 0.92341
18 0.92552
19 0.92637
20 0.93073
21 0.93158
22 0.93158
23 0.93410
24 0.93680
25 0.93680
26 0.93932
Middle observations omitted ...
974 1.20931
975 1.20931
976 1.21232
977 1.21232
978 1.21232
979 1.21232
980 1.22435
981 1.22584
982 1.22584
983 1.22733
984 1.23144
985 1.23144
986 1.23144
987 1.23407
988 1.23556
989 1.23599
990 1.23641
991 1.23641
992 1.23819
993 1.23861
994 1.24082
995 1.24082
996 1.24082
997 1.24358
998 1.25550
999 1.27638
1000 1.27638
1001 1.30756
~john-c/5421/bootstrap.sas 17OCT10 13:47
Note that this program yields two ways of computing a 95% confidence interval for the median.
One is by estimating the sample median +/- 1.96 * std err, where the standard error is the
square root of the variance of the bootstrap median estimates. The other is by looking at
lower 2.5% and upper 2.5% of the bootstrap sample medians. In general these are not
identical. In general preference would be given to the latter method, because the former
method depends on asymptotic normality of the mean of the bootstrap sample medians.
The values obtained by these two methods are the following:
median +/- 1.96 * stderr = 1.05166 +/- 1.96 * .06993 = (0.9146, 1.18872)
95% CL from the upper and lower 2.5% of the bootstrap sample medians: (0.9393, 1.2123)
There is an analytic expression for the large-sample variance of the median. It is:
var(x_median) = 1 / (4 * N * f^2(x_median)),
where N is the sample size and f(.) is the pdf for the distribution. See:
http://mathworld.wolfram.com/StatisticalMedian.html
This is not terribly useful in real life, because in those situations where you might
want to use the bootstrap, it will often be true that you don't know what the
pdf f(.) might be.
However, in the case of the lognormal, this can be computed. The median for the standard
lognormal is exp(0) = 1. The cdf is given by
F(x) = (1/sqrt(2 * pi)) * integral [-infinity to log(x)] exp(-u^2 / 2) du]
The pdf is the derivative of the cdf. Therefore the pdf is
f(x) = F'(x) = (1/sqrt(2 * pi)) * (1/x) * exp(-log(x)^2 / 2).
Then note that the median of x is at the value x = 1. Therefore
f(x_median) = f(1) = (1/sqrt( 2 * pi)) * 1/1 * exp(0)
= 1/sqrt(2*pi) = 1/sqrt(6.283).
Therefore the variance is 1/(4*N*f^2(1)) = 6.28/(4 * N).
In the example above, N = 400. So the asymptotic variance is 6.28 / (4 * 400) = .003925.
You can compare this to the bootstrap estimate in the data above: variance = approx. .00489.
This is reasonably good agreement.
=======================================================================================================
Problem 1:
Generate a file with 1000 pseudorandom observations from the exponential distribution.
Let Y = abs(mean(X) - 1).
Use the bootstrap to estimate (1) stderr(Y), and (2) 95% confidence limits for the expectation of Y.
=======================================================================================================
/home/gnome/john-c/5421/notes.016b Last date revised: October 17, 2010.