notes.016b October 16, 2010 Using PROC IML to Find Bootstrap Estimates ... =============================================================================== The 'Bootstrap' procedure is a method for estimating characteristics of a distribution without making parametric assumptions. The basic procedure is simple. Assume you are given a datafile XDATA with the numeric variable X. You want to estimate the variance of the median of X. Here is the procedure: (1) You create a new file YDATA, same length as XDATA, by sampling from XDATA ***with replacement***. This is called a 'bootstrap sample'. (2) You find the median of Y in YDATA. You append this value to another file called ACCUM.MEDIANS. (3) You cycle back and do steps (1) and (2) N times. (You choose N to be, say, 1000). (4) You use PROC MEANS or PROC UNIVARIATE to compute the variance of the accumulated bootstrap medians. The bootstrap can also be used to estimate, for example, a 95% confidence interval for the statistic you are interested in (the median, perhaps). Another possible use is in comparing statistics from randomized groups in a clinical trial. In that case, the bootstrap sampling must be constructed in such a way as to preserve the clinical trial design. If the clinical trial has groups A and B, and you are interested in the variance of the difference of the medians between these two groups, the bootstrap samples must be ***within*** the randomized groups. The bootstrap is one of several computing-intensive nonparametric procedures. was developed originally by Bradley Efron of Stanford University. A good reference for the bootstrap is the following: Efron, B.; Tibshirani, R. (1993). An Introduction to the Bootstrap. Boca Raton, FL: Chapman & Hall/CRC. Below is a program which (1) generates a random sample from a lognormal distribution, (2) uses PROC IML to derive bootstrap samples, (3) uses PROC MEANS to find the median of each bootstrap sample, (4) accumulates a file of the medians, and (5) uses PROC UNIVARIATE to estimate the variance of the median. The usefulness of PROC IML in this case is that it provides an efficient way to construct bootstrap samples. This is because it has the capability of turning a datafile into an array. *===================================================================; options linesize = 80 MPRINT ; footnote "~john-c/5421/bootstrap.sas &sysdate &systime" ; FILENAME GRAPH 'gsas.grf' ; LIBNAME loc '' ; OPTIONS LINESIZE = 80 MPRINT ; GOPTIONS RESET = GLOBAL ROTATE = landscape FTEXT = SWISSB DEVICE = PSCOLOR GACCESS = SASGASTD GSFNAME = GRAPH GSFMODE = REPLACE GUNIT = PCT BORDER CBACK = WHITE HTITLE = 2 HTEXT = 1 ; *===================================================================== ; data skewed ; n = 400 ; do i = 1 to n ; x = rannor(-1) ; y = exp(x) ; output ; end ; run ; %macro bootmedian(datafile, n, m, outfile) ; %do i = 1 %to &m ; proc iml ; y = j(&n, 1, 0) ; z = j(&n, 1, 0) ; use &datafile ; read all var {y} into y ; do j = 1 to &n ; yrandindex = 1 + int(&n * ranuni(-1)) ; z[j] = y[yrandindex] ; end ; varnames = "zvalues" ; create zboots from z [colname = varnames] ; append from z ; quit ; proc means data = zboots n mean stddev min max median noprint ; var zvalues ; output out = zmedian median = samplemedian ; run ; data outfile ; set outfile zmedian ; run ; %end ; %mend ; data outfile ; run ; proc means data = skewed n mean var stddev stderr median ; var y ; title1 'Statistics for the original data ...' ; run ; %bootmedian(skewed, 400, 1000, outfile) ; proc sort data = outfile ; by samplemedian ; run ; proc univariate data = outfile ; var samplemedian ; run ; proc print data = outfile ; var samplemedian ; run ; proc sort data = skewed ; by y ; run ; data skewed ; retain cumulative 0 casenum 0 cumold 0 yold ; retain cumlognormal 0 ; set skewed ; if casenum eq 0 then yold = y ; casenum = casenum + 1 ; cumlognormal = probnorm(log(y)) ; cumulative = cumulative + 1/ 400 ; yold = y ; output ; run ; proc print data = skewed ; var y ; title 'Printout of original data ordered by y ...' ; run ; symbol1 i = j v = none c = black w = 3 h = 3 ; proc gplot data = skewed ; where y le 2 ; plot cumulative * y ; title1 h = 2 'The Empirical CDF for Lognormal data' ; run ; goptions gsfmode = append ; proc gplot data = skewed ; where y le 2 ; plot cumlognormal * y ; title1 h = 2 'The True CDF for the Lognormal' ; run ; endsas ; *===================================================================; Statistics for the original data ... 1 13:47 Sunday, October 17, 2010 The MEANS Procedure Analysis Variable : y N Mean Variance Std Dev Std Error Median ------------------------------------------------------------------------------ 400 1.8081035 7.8903770 2.8089815 0.1404491 1.0516644 ------------------------------------------------------------------------------ ~john-c/5421/bootstrap.sas 17OCT10 13:47 Statistics for the original data ... 2 13:47 Sunday, October 17, 2010 The UNIVARIATE Procedure Variable: samplemedian Moments N 1000 Sum Weights 1000 Mean 1.05297519 Sum Observations 1052.97519 Std Deviation 0.06992941 Variance 0.00489012 Skewness 0.47589645 Kurtosis 0.2333124 Uncorrected SS 1113.64198 Corrected SS 4.885232 Coeff Variation 6.64112592 Std Error Mean 0.00221136 Basic Statistical Measures Location Variability Mean 1.052975 Std Deviation 0.06993 Median 1.052809 Variance 0.00489 Mode 1.016383 Range 0.41545 Interquartile Range 0.08393 NOTE: The mode displayed is the smallest of 5 modes with a count of 19. Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 476.1659 Pr > |t| <.0001 Sign M 500 Pr >= |M| <.0001 Signed Rank S 250250 Pr >= |S| <.0001 Quantiles (Definition 5) Quantile Estimate 100% Max 1.307558 99% 1.237299 95% 1.188665 90% 1.146497 75% Q3 1.090729 50% Median 1.052809 25% Q1 1.006804 10% 0.957871 5% 0.950592 1% 0.918669 0% Min 0.892106 ~john-c/5421/bootstrap.sas 17OCT10 13:47 Statistics for the original data ... 3 13:47 Sunday, October 17, 2010 The UNIVARIATE Procedure Variable: samplemedian Extreme Observations ------Lowest------ -----Highest----- Value Obs Value Obs 0.892106 3 1.24358 997 0.892106 2 1.25550 998 0.892466 4 1.27638 999 0.892826 5 1.27638 1000 0.898620 6 1.30756 1001 Missing Values -----Percent Of----- Missing Missing Value Count All Obs Obs . 1 0.10 100.00 ~john-c/5421/bootstrap.sas 17OCT10 13:47 Statistics for the original data ... 4 13:47 Sunday, October 17, 2010 Obs samplemedian 1 . 2 0.89211 3 0.89211 4 0.89247 5 0.89283 6 0.89862 7 0.89873 8 0.90154 9 0.90154 10 0.90839 11 0.91528 12 0.92206 13 0.92211 14 0.92211 15 0.92216 16 0.92216 17 0.92341 18 0.92552 19 0.92637 20 0.93073 21 0.93158 22 0.93158 23 0.93410 24 0.93680 25 0.93680 26 0.93932 Middle observations omitted ... 974 1.20931 975 1.20931 976 1.21232 977 1.21232 978 1.21232 979 1.21232 980 1.22435 981 1.22584 982 1.22584 983 1.22733 984 1.23144 985 1.23144 986 1.23144 987 1.23407 988 1.23556 989 1.23599 990 1.23641 991 1.23641 992 1.23819 993 1.23861 994 1.24082 995 1.24082 996 1.24082 997 1.24358 998 1.25550 999 1.27638 1000 1.27638 1001 1.30756 ~john-c/5421/bootstrap.sas 17OCT10 13:47 Note that this program yields two ways of computing a 95% confidence interval for the median. One is by estimating the sample median +/- 1.96 * std err, where the standard error is the square root of the variance of the bootstrap median estimates. The other is by looking at lower 2.5% and upper 2.5% of the bootstrap sample medians. In general these are not identical. In general preference would be given to the latter method, because the former method depends on asymptotic normality of the mean of the bootstrap sample medians. The values obtained by these two methods are the following: median +/- 1.96 * stderr = 1.05166 +/- 1.96 * .06993 = (0.9146, 1.18872) 95% CL from the upper and lower 2.5% of the bootstrap sample medians: (0.9393, 1.2123) There is an analytic expression for the large-sample variance of the median. It is: var(x_median) = 1 / (4 * N * f^2(x_median)), where N is the sample size and f(.) is the pdf for the distribution. See: http://mathworld.wolfram.com/StatisticalMedian.html This is not terribly useful in real life, because in those situations where you might want to use the bootstrap, it will often be true that you don't know what the pdf f(.) might be. However, in the case of the lognormal, this can be computed. The median for the standard lognormal is exp(0) = 1. The cdf is given by F(x) = (1/sqrt(2 * pi)) * integral [-infinity to log(x)] exp(-u^2 / 2) du] The pdf is the derivative of the cdf. Therefore the pdf is f(x) = F'(x) = (1/sqrt(2 * pi)) * (1/x) * exp(-log(x)^2 / 2). Then note that the median of x is at the value x = 1. Therefore f(x_median) = f(1) = (1/sqrt( 2 * pi)) * 1/1 * exp(0) = 1/sqrt(2*pi) = 1/sqrt(6.283). Therefore the variance is 1/(4*N*f^2(1)) = 6.28/(4 * N). In the example above, N = 400. So the asymptotic variance is 6.28 / (4 * 400) = .003925. You can compare this to the bootstrap estimate in the data above: variance = approx. .00489. This is reasonably good agreement. ======================================================================================================= Problem 1: Generate a file with 1000 pseudorandom observations from the exponential distribution. Let Y = abs(mean(X) - 1). Use the bootstrap to estimate (1) stderr(Y), and (2) 95% confidence limits for the expectation of Y. ======================================================================================================= /home/gnome/john-c/5421/notes.016b Last date revised: October 17, 2010.