SPH 5421 notes.006: Some program examples ... 1. Illustration of the Box-Muller algorithm for generating normal random variables from uniform random variables: See Box & Muller, Annals of Statistics, 1958. An Splus program is used to show the distribution of a sample generated by the Box-Muller program. 2. Example of using a sort-and-carry routine. The sorting algorithm employed in this program is called a 'bubble sort'. It is easy to write but it is not an efficient sort. For arrays of size 10 or less it performs adequately. More efficient sorting algorithms are described in the book 'Numerical Recipes', Flannery et al., and Knuth's book Sorting and Searching. 3. Example of a program to simulate 2 x 2 tables and compute chi-square statistics. =======================================================================; * SAS program boxmuller.sas which writes 1000 observations to the ; * SAS system file 'boxmull'. ; * ; libname sasplus '' ; footnote "~john-c/5421/boxmuller.sas &sysdate &systime" ; %let n = 10000 ; data rands ; twopi = 2 * 3.141592654 ; seed = today() ; low = -4 ; high = 4 ; h = (high - low) / &n ; do i = 1 to &n ; x = i * h + low ; r = ranuni(seed) ; s = ranuni(seed) ; z = sqrt(-2 * log(r)) * cos(twopi * s) ; pdfz = (1 / sqrt(twopi)) * exp(- x * x / 2) ; output ; end ; run ; data sasplus.boxmull ; set rands ; run ; proc univariate plot normal ; var z ; title "SAS program to illustrate the Box-Muller algorithm ... n = &n"; run ; =======================================================================; # Splus program hist.s, which (1) reads the SAS file boxmull, and (2) plots # two histograms and an overlaid probability density function. # The postscript graphics output file is hist.ps, which can be viewed and # printed using ghostview: ghostview hist.ps. # postscript("hist.ps", horizontal = F, onefile = T) par(mfrow = c(2,1), oma = c(2, 2, 2, 2)) rm(sasdat) sasdat <- sas.get(library = "/home/walleye/john-c/5421", mem = "boxmull") r <- sasdat[, "r"] s <- sasdat[, "s"] z <- sasdat[, "z"] x <- sasdat[, "x"] nobs <- max(sasdat[, "i"]) pdfz <- sasdat[, "pdfz"] y <- nobs * pdfz hist(r, nclass = 20, xlim = c(0, 1), cex = .7, main = "SPH 5421 Histogram example: r (U[0, 1])") hist(z, nclass = 20, xlim = c(-5, 5), cex = .7, main = "SPH 5421 Histogram example: z (N(0, 1))") par(new = T, xaxs = "d") plot(x, y, axes = F, type = "l") # rm(r, s, x, z, y) dev.off() =======================================================================; PRINTOUT FROM boxmuller.sas: SAS program to illustrate the Box-Muller algorithm ... n = 10000 1 18:28 Monday, September 22, 2003 Univariate Procedure Variable=Z Moments Quantiles(Def=5) Extremes N 10000 Sum Wgts 10000 100% Max 3.924911 99% 2.312681 Lowest Obs Highest Obs Mean 0.004308 Sum 43.08068 75% Q3 0.686031 95% 1.636393 -3.7367( 9062) 3.409322( 1858) Std Dev 1.005764 Variance 1.01156 50% Med 0.010654 90% 1.27482 -3.6254( 1192) 3.474899( 6128) Skewness -0.00433 Kurtosis -0.05557 25% Q1 -0.67618 10% -1.31243 -3.50457( 3876) 3.543545( 6560) USS 10114.78 CSS 10114.59 0% Min -3.7367 5% -1.66405 -3.49485( 5332) 3.610974( 280) CV 23346.05 Std Mean 0.010058 1% -2.29995 -3.1509( 8977) 3.924911( 5299) T:Mean=0 0.428338 Pr>|T| 0.6684 Range 7.661615 Num ^= 0 10000 Num > 0 5043 Q3-Q1 1.362207 M(Sign) 43 Pr>=|M| 0.3953 Mode -3.7367 Sgn Rank 182589 Pr>=|S| 0.5271 D:Normal 0.008707 Pr>D 0.0660 Histogram # Boxplot Normal Probability Plot 3.75+* 3 0 3.75+ * .* 13 0 | * .** 46 0 | * 2.25+***** 166 | 2.25+ ****** .************ 436 | | ****** .************************* 944 | | ****** 0.75+***************************************** 1567 +-----+ 0.75+ ****** .************************************************ 1868 *--+--* | ****** .************************************************ 1872 | | | ****** -0.75+************************************** 1466 +-----+ -0.75+ ****** .************************ 922 | | ****** .************ 462 | | ****** -2.25+***** 182 | -2.25+****** .** 44 0 |* .* 6 0 |* -3.75+* 3 0 -3.75+* ----+----+----+----+----+----+----+----+----+--- +----+----+----+----+----+----+----+----+----+----+ * may represent up to 39 counts -2 -1 0 +1 +2 ~john-c/5421/boxmuller.sas 22SEP03 18:28 =======================================================================; SORTING SMALL ARRAYS SAS has the feature that, within an observation in a data set, it is possible to define arrays. An array is a collection of variables with some sort of index (usually an integer). Here is an example: data examp ; array DBP(10) ; In this example, the array is called 'DBP' and could be used to denote diastolic blood pressure of an individual at 10 different time points. For example, DBP(4) would represent the person's diastolic BP at time point 4. It is frequently desirable to sort arrays by the values of the variables at the various 'time points'. However SAS does not have a function for sorting arrays. SAS does provide a way of sorting entire files (PROC SORT), but there is no built-in function to sort arrays. have a function for sorting arrays. Here is a simple program that can be used to sort arrays. This is called a 'bubble sort'. Assume that DBP(1), DBP(2), ..., DBP(10) are diastolic blood pressures. The bubble sort algorithm is as follows for sorting these values in ascending order: ==================================================================== do i = 2 to 10 ; do j = 1 to i - 1 ; temp = DBP(i) ; if DBP(j) > DBP(i) then do ; DBP(i) = DBP(j) ; DBP(j) = temp ; end ; end ; end ; Here is how this program works with the following data: data dbps ; array dbp(10) ; DBP(1) = 102; DBP(2) = 86; DBP(3) = 98; DBP(4) = 76; DBP(5) = 66 ; DBP(6) = 100; DBP(7) = 88; DBP(98 = 90; DBP(9) = 94; DBP(10) = 64 ; n = 10 ; do i = 2 to n ; do j = 1 to i - 1 ; temp = DBP(i) ; if DBP(j) > DBP(i) then do ; DBP(i) = DBP(j) ; DBP(j) = temp ; end ; output ; end ; end ; run ; proc print data = dbps ; var DBP1 DBP2 DBP3 DBP4 DBP5 DBP6 DBP7 DBP8 DBP9 DBP10 ; title 'Step-by-step execution of a bubble sort:' ; ==================================================================== Here is the printout from this program: Step-by-step execution of a bubble sort: 1 17:13 Sunday, September 12, 2010 Obs dbp1 dbp2 dbp3 dbp4 dbp5 dbp6 dbp7 dbp8 dbp9 dbp10 1 86 102 98 76 66 100 88 90 94 64 2 86 102 98 76 66 100 88 90 94 64 3 86 98 102 76 66 100 88 90 94 64 4 76 98 102 86 66 100 88 90 94 64 5 76 86 102 98 66 100 88 90 94 64 6 76 86 98 102 66 100 88 90 94 64 7 66 86 98 102 76 100 88 90 94 64 8 66 76 98 102 86 100 88 90 94 64 9 66 76 86 102 98 100 88 90 94 64 10 66 76 86 98 102 100 88 90 94 64 11 66 76 86 98 102 100 88 90 94 64 12 66 76 86 98 102 100 88 90 94 64 13 66 76 86 98 102 100 88 90 94 64 14 66 76 86 98 102 100 88 90 94 64 15 66 76 86 98 100 102 88 90 94 64 16 66 76 86 98 100 102 88 90 94 64 17 66 76 86 98 100 102 88 90 94 64 18 66 76 86 98 100 102 88 90 94 64 19 66 76 86 88 100 102 98 90 94 64 20 66 76 86 88 98 102 100 90 94 64 21 66 76 86 88 98 100 102 90 94 64 22 66 76 86 88 98 100 102 90 94 64 23 66 76 86 88 98 100 102 90 94 64 24 66 76 86 88 98 100 102 90 94 64 25 66 76 86 88 98 100 102 90 94 64 26 66 76 86 88 90 100 102 98 94 64 27 66 76 86 88 90 98 102 100 94 64 28 66 76 86 88 90 98 100 102 94 64 29 66 76 86 88 90 98 100 102 94 64 30 66 76 86 88 90 98 100 102 94 64 31 66 76 86 88 90 98 100 102 94 64 32 66 76 86 88 90 98 100 102 94 64 33 66 76 86 88 90 98 100 102 94 64 34 66 76 86 88 90 94 100 102 98 64 35 66 76 86 88 90 94 98 102 100 64 36 66 76 86 88 90 94 98 100 102 64 37 64 76 86 88 90 94 98 100 102 66 38 64 66 86 88 90 94 98 100 102 76 39 64 66 76 88 90 94 98 100 102 86 40 64 66 76 86 90 94 98 100 102 88 41 64 66 76 86 88 94 98 100 102 90 42 64 66 76 86 88 90 98 100 102 94 43 64 66 76 86 88 90 94 100 102 98 44 64 66 76 86 88 90 94 98 102 100 45 64 66 76 86 88 90 94 98 100 102 ~john-c/5421/bubble.sort.sas 12SEP10 17:13 ============================================================================= As you can see, 45 steps were required for the program to sort this array of 10 elements. The bubble sort is not very efficient, and it is not recommended for large arrays. For small arrays (in general 20 elements or less), its slow speed is usually not a problem. For large arrays, there are much faster sorting algorithms: quicksort, heapsort, and various merge sorts are asymptotically quite efficient. However programs to execute the faster sorts are considerably more complicated than that for the bubble sort. ======================================================================================== It is often desirable to do what is called 'sort and carry'. This means that you have two arrays with the same number of elements. You want to sort one of them, and permute the other one in parallel with the sorting of the first one. Here is a modification of the bubble sort program which does this: data dbps ; array dbp(10) ; array idx(10) ; DBP(1) = 102; DBP(2) = 86; DBP(3) = 98; DBP(4) = 76; DBP(5) = 66 ; DBP(6) = 100; DBP(7) = 88; DBP(98 = 90; DBP(9) = 94; DBP(10) = 64 ; idx(1) = 1 ; idx(2) = 2 ; idx(3) = 3 ; idx(4) = 4 ; idx(5) = 5 ; idx(6) = 6 ; idx(7) = 7 ; idx(8) = 8 ; idx(9) = 9 ; idx(10) = 10 ; n = 10 ; do i = 2 to n ; do j = 1 to i - 1 ; tempdbp = DBP(i) ; tempidx = idx(i) ; if DBP(j) > DBP(i) then do ; DBP(i) = DBP(j) ; DBP(j) = tempdbp ; idx(i) = idx(j) ; idx(j) = tempidx ; end ; output ; end ; end ; run ; ========================================================================= The above program will have the effect of permuting the array idx in parallel with the sorting of the array dbp. =======================================================================; * SPH 5421 SAS program ~john-c/5421/sim.sas ; * Example of simulating 2 x 2 tables and computing statistics ... ; * ; * SPH 5421 * ; * Example of simulating 2 x 2 tables and computing statistics ... * ; * * ; options linesize = 80 ; data sim ; seed = 20000122 ; pa = .5 ; pb = .7 ; na = 100 ; nb = 100 ; n = 3 ; do i = 1 to n ; a = ranbin(seed, na, pa) ; c = na - a ; b = ranbin(seed, nb, pb) ; d = nb - b ; m0 = a + c ; m1 = b + d ; n0 = a + b ; n1 = c + d ; total = a + b + c + d ; chisquar = total * (a * d - b * c)**2 / (m0 * m1 * n0 * n1) ; prob = 1 - probchi(chisquar, 1) ; fisherleft = probhypr(a + b + c + d, a + b, a + c, a) ; tableprob = probhypr(a + b + c + d, a + b, a + c, a) - probhypr(a + b + c + d, a + b, a + c, a - 1) ; output ; end ; run ; proc print ; var a b c d chisquar prob fisherleft tableprob ; title 'Simulated 2 x 2 tables: non-PROC computation of probabilities' ; data tabs ; set sim ; x = 0 ; y = 0 ; m = a ; output ; x = 0 ; y = 1 ; m = b ; output ; x = 1 ; y = 0 ; m = c ; output ; x = 1 ; y = 1 ; m = d ; output ; run ; proc freq data = tabs ; by i ; weight m ; tables x * y / chisqr ; title1 'PROC FREQ computation of chi-squares ...' ; *----------------------------------------------------------------------; * * Below is the output from sim.sas * =======================================================================; Simulated 2 x 2 tables: non-PROC computation of probabilities 1 18:10 Monday, October 1, 2007 Obs a b c d chisquar prob fisherleft tableprob 1 59 68 41 32 1.74738 0.18621 0.11995 0.049125 2 56 69 44 31 3.60533 0.05759 0.03966 0.019396 3 62 73 38 27 2.75783 0.09678 0.06540 0.030589 PROC FREQ computation of chi-squares ... 2 18:10 Monday, October 1, 2007 ------------------------------------- i=1 -------------------------------------- The FREQ Procedure Table of x by y x y Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0 | 59 | 68 | 127 | 29.50 | 34.00 | 63.50 | 46.46 | 53.54 | | 59.00 | 68.00 | ---------+--------+--------+ 1 | 41 | 32 | 73 | 20.50 | 16.00 | 36.50 | 56.16 | 43.84 | | 41.00 | 32.00 | ---------+--------+--------+ Total 100 100 200 50.00 50.00 100.00 Statistics for Table of x by y Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 1.7474 0.1862 Likelihood Ratio Chi-Square 1 1.7507 0.1858 Continuity Adj. Chi-Square 1 1.3806 0.2400 Mantel-Haenszel Chi-Square 1 1.7386 0.1873 Phi Coefficient -0.0935 Contingency Coefficient 0.0931 Cramer's V -0.0935 Fisher's Exact Test ---------------------------------- Cell (1,1) Frequency (F) 59 Left-sided Pr <= F 0.1200 Right-sided Pr >= F 0.9292 Table Probability (P) 0.0491 Two-sided Pr <= P 0.2399 Sample Size = 200 PROC FREQ computation of chi-squares ... 3 18:10 Monday, October 1, 2007 ------------------------------------- i=2 -------------------------------------- The FREQ Procedure Table of x by y x y Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0 | 56 | 69 | 125 | 28.00 | 34.50 | 62.50 | 44.80 | 55.20 | | 56.00 | 69.00 | ---------+--------+--------+ 1 | 44 | 31 | 75 | 22.00 | 15.50 | 37.50 | 58.67 | 41.33 | | 44.00 | 31.00 | ---------+--------+--------+ Total 100 100 200 50.00 50.00 100.00 Statistics for Table of x by y Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 3.6053 0.0576 Likelihood Ratio Chi-Square 1 3.6192 0.0571 Continuity Adj. Chi-Square 1 3.0720 0.0797 Mantel-Haenszel Chi-Square 1 3.5873 0.0582 Phi Coefficient -0.1343 Contingency Coefficient 0.1331 Cramer's V -0.1343 Fisher's Exact Test ---------------------------------- Cell (1,1) Frequency (F) 56 Left-sided Pr <= F 0.0397 Right-sided Pr >= F 0.9797 Table Probability (P) 0.0194 Two-sided Pr <= P 0.0793 Sample Size = 200 PROC FREQ computation of chi-squares ... 4 18:10 Monday, October 1, 2007 ------------------------------------- i=3 -------------------------------------- The FREQ Procedure Table of x by y x y Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0 | 62 | 73 | 135 | 31.00 | 36.50 | 67.50 | 45.93 | 54.07 | | 62.00 | 73.00 | ---------+--------+--------+ 1 | 38 | 27 | 65 | 19.00 | 13.50 | 32.50 | 58.46 | 41.54 | | 38.00 | 27.00 | ---------+--------+--------+ Total 100 100 200 50.00 50.00 100.00 Statistics for Table of x by y Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 2.7578 0.0968 Likelihood Ratio Chi-Square 1 2.7678 0.0962 Continuity Adj. Chi-Square 1 2.2792 0.1311 Mantel-Haenszel Chi-Square 1 2.7440 0.0976 Phi Coefficient -0.1174 Contingency Coefficient 0.1166 Cramer's V -0.1174 Fisher's Exact Test ---------------------------------- Cell (1,1) Frequency (F) 62 Left-sided Pr <= F 0.0654 Right-sided Pr >= F 0.9652 Table Probability (P) 0.0306 Two-sided Pr <= P 0.1308 Sample Size = 200 ====================================================================== ~john-c/5421/notes.006 Last update: September 13, 2010.