MYSTERIES OF THE SAS DATA STEP SPH 5421 notes.011 Probably the most complicated aspect of SAS is the DATA step. It is not obvious what the DATA step actually is, and it is not easy to predict what it produces when it is used with multiple SET statements, MERGE statements, and other statements. The role of the OUTPUT statement is also not necessarily clear. The SAS Language manual is the best source of information on the DATA step. It is particularly important to understand the SET statement. Below is an example which may shed some light on how the DATA step works. ================================================================================= * Program to explore the SAS DATA step, SET statement, MERGE statement; * program file: /home/gnome/john-c/5421/datastep.sas ; * ; options linesize = 80 ; data examp1 ; seed = 20000128 ; n = 10 ; do casenum = 1 to n ; x = rannor(seed) ; y = rannor(seed) ; output ; end ; run ; proc print data = examp1 ; title 'Listing of dataset examp1' ; data examp2 ; set examp1 ; if casenum eq 5 then output ; run ; proc print data = examp2 ; title 'Listing of dataset examp2' ; data examp3 ; seed = 200001281 ; n = 12 ; do casenum = 1 to 12 ; y = casenum ; z = -casenum * casenum ; output ; end ; run ; proc print data = examp3 ; title1 'Listing of dataset examp3' ; data examp4 ; set examp1 examp3 ; proc print data = examp4 ; title1 'Listing of dataset examp4' ; data examp5 ; merge examp1 examp3 ; by casenum ; proc print data = examp5 ; title1 'Listing of dataset examp5' ; data examp6 ; merge examp3 examp1 ; by casenum ; run ; proc print data = examp6 ; title1 'Listing of examp6' ; data examp7 ; set examp1 ; if casenum eq 7 then set examp3 ; run ; proc print data = examp7 ; title1 'Listing of examp7' ; data examp8 ; set examp1 examp3 ; by casenum ; proc print data = examp8 ; title1 'Listing of examp8' ; endsas ; ================================================================================= Listing of dataset examp1 1 18:26 Tuesday, February 1, 2000 OBS SEED N CASENUM X Y 1 20000128 10 1 -2.28501 1.19334 2 20000128 10 2 1.05265 -0.26866 3 20000128 10 3 -0.43929 -1.57871 4 20000128 10 4 0.83069 -0.41444 5 20000128 10 5 0.80023 -1.05532 6 20000128 10 6 1.14747 0.75366 7 20000128 10 7 -1.13534 0.98361 8 20000128 10 8 -1.35994 -1.19237 9 20000128 10 9 2.01172 1.67323 10 20000128 10 10 0.80378 -0.33271 ------------------------------------------------------------------------ Listing of dataset examp2 2 18:26 Tuesday, February 1, 2000 OBS SEED N CASENUM X Y 1 20000128 10 5 0.80023 -1.05532 ------------------------------------------------------------------------ Listing of dataset examp3 3 18:26 Tuesday, February 1, 2000 OBS SEED N CASENUM Y Z 1 200001281 12 1 1 -1 2 200001281 12 2 2 -4 3 200001281 12 3 3 -9 4 200001281 12 4 4 -16 5 200001281 12 5 5 -25 6 200001281 12 6 6 -36 7 200001281 12 7 7 -49 8 200001281 12 8 8 -64 9 200001281 12 9 9 -81 10 200001281 12 10 10 -100 11 200001281 12 11 11 -121 12 200001281 12 12 12 -144 ------------------------------------------------------------------------ Listing of dataset examp4 4 18:26 Tuesday, February 1, 2000 OBS SEED N CASENUM X Y Z 1 20000128 10 1 -2.28501 1.1933 . 2 20000128 10 2 1.05265 -0.2687 . 3 20000128 10 3 -0.43929 -1.5787 . 4 20000128 10 4 0.83069 -0.4144 . 5 20000128 10 5 0.80023 -1.0553 . 6 20000128 10 6 1.14747 0.7537 . 7 20000128 10 7 -1.13534 0.9836 . 8 20000128 10 8 -1.35994 -1.1924 . 9 20000128 10 9 2.01172 1.6732 . 10 20000128 10 10 0.80378 -0.3327 . 11 200001281 12 1 . 1.0000 -1 12 200001281 12 2 . 2.0000 -4 13 200001281 12 3 . 3.0000 -9 14 200001281 12 4 . 4.0000 -16 15 200001281 12 5 . 5.0000 -25 16 200001281 12 6 . 6.0000 -36 17 200001281 12 7 . 7.0000 -49 18 200001281 12 8 . 8.0000 -64 19 200001281 12 9 . 9.0000 -81 20 200001281 12 10 . 10.0000 -100 21 200001281 12 11 . 11.0000 -121 22 200001281 12 12 . 12.0000 -144 ------------------------------------------------------------------------ Listing of dataset examp5 5 18:26 Tuesday, February 1, 2000 OBS SEED N CASENUM X Y Z 1 200001281 12 1 -2.28501 1 -1 2 200001281 12 2 1.05265 2 -4 3 200001281 12 3 -0.43929 3 -9 4 200001281 12 4 0.83069 4 -16 5 200001281 12 5 0.80023 5 -25 6 200001281 12 6 1.14747 6 -36 7 200001281 12 7 -1.13534 7 -49 8 200001281 12 8 -1.35994 8 -64 9 200001281 12 9 2.01172 9 -81 10 200001281 12 10 0.80378 10 -100 11 200001281 12 11 . 11 -121 12 200001281 12 12 . 12 -144 ------------------------------------------------------------------------ Listing of examp6 6 18:26 Tuesday, February 1, 2000 OBS SEED N CASENUM Y Z X 1 20000128 10 1 1.1933 -1 -2.28501 2 20000128 10 2 -0.2687 -4 1.05265 3 20000128 10 3 -1.5787 -9 -0.43929 4 20000128 10 4 -0.4144 -16 0.83069 5 20000128 10 5 -1.0553 -25 0.80023 6 20000128 10 6 0.7537 -36 1.14747 7 20000128 10 7 0.9836 -49 -1.13534 8 20000128 10 8 -1.1924 -64 -1.35994 9 20000128 10 9 1.6732 -81 2.01172 10 20000128 10 10 -0.3327 -100 0.80378 11 200001281 12 11 11.0000 -121 . 12 200001281 12 12 12.0000 -144 . ------------------------------------------------------------------------ Listing of examp7 7 18:26 Tuesday, February 1, 2000 OBS SEED N CASENUM X Y Z 1 20000128 10 1 -2.28501 1.19334 . 2 20000128 10 2 1.05265 -0.26866 . 3 20000128 10 3 -0.43929 -1.57871 . 4 20000128 10 4 0.83069 -0.41444 . 5 20000128 10 5 0.80023 -1.05532 . 6 20000128 10 6 1.14747 0.75366 . 7 200001281 12 1 -1.13534 1.00000 -1 8 20000128 10 8 -1.35994 -1.19237 -1 9 20000128 10 9 2.01172 1.67323 -1 10 20000128 10 10 0.80378 -0.33271 -1 ------------------------------------------------------------------------ Listing of examp8 8 18:26 Tuesday, February 1, 2000 OBS SEED N CASENUM X Y Z 1 20000128 10 1 -2.28501 1.1933 . 2 200001281 12 1 . 1.0000 -1 3 20000128 10 2 1.05265 -0.2687 . 4 200001281 12 2 . 2.0000 -4 5 20000128 10 3 -0.43929 -1.5787 . 6 200001281 12 3 . 3.0000 -9 7 20000128 10 4 0.83069 -0.4144 . 8 200001281 12 4 . 4.0000 -16 9 20000128 10 5 0.80023 -1.0553 . 10 200001281 12 5 . 5.0000 -25 11 20000128 10 6 1.14747 0.7537 . 12 200001281 12 6 . 6.0000 -36 13 20000128 10 7 -1.13534 0.9836 . 14 200001281 12 7 . 7.0000 -49 15 20000128 10 8 -1.35994 -1.1924 . 16 200001281 12 8 . 8.0000 -64 17 20000128 10 9 2.01172 1.6732 . 18 200001281 12 9 . 9.0000 -81 19 20000128 10 10 0.80378 -0.3327 . 20 200001281 12 10 . 10.0000 -100 21 200001281 12 11 . 11.0000 -121 22 200001281 12 12 . 12.0000 -144 ================================================================================= Notes on the program and output: 1. The first 'proc print' section just prints the 10 observations in dataset examp1. 2. Dataset examp2 consists of only one observation. If the 'output' statement in the 'data examp2' section were omitted, examp2 would have 10 observations. On the other hand, if such a statement were included but never executed, as follows, if casenum eq 35 then output; then the datafile examp2 would have NO observations. Evidently the SAS compiler looks to see if there are any 'output' statements in a given data step before it decides how that datastep will be executed. 3. Dataset examp3 is a new dataset. Note that it includes variables y and z, but not x. (z is set equal to the negative of the square of the observation number). Dataset examp3 has 12 observations. 4. Dataset examp4 is equal to dataset examp3 appended to the end of dataset examp1. This is the result of the 'set examp1 examp3' statement. Note that there are 22 observations (10 + 12). The first 10 are missing for y and z, and the last 12 are missing for x. 5. Dataset examp5 is the result of merging datasets examp1 and examp3 by the variable casenum. Note that the x-values come from dataset examp1 (because x is not defined in dataset examp3), but the y- and z-values come from dataset examp3. 6. Dataset examp6 is the result of merging datasets examp3 and examp1 (in reverse order from examp5). The results are correspondingly different. Note that both datasets examp5 and examp6 have 12 observations. (the maximum number from either of datasets examp1 and examp3). 7. Dataset examp7 is rather hard to explain. It is defined by first saying 'set examp1', but then at the 7th observation of examp1, the program says 'set examp3'. At that point the variables seed, n, casenum, y, and z are entered from the first observation of examp3. Then in the 8th through 10th observations, seed, n, casenum, and y are from examp1. But z is still from the first observation of dataset examp3. Strange! 8. Dataset examp8 illustrates that if the 'set' statement names more than one dataset and is followed by a 'by' statement, the result is that the output file is interleaved (alternates) between the datasets. The basic conclusion from all this is that, in general, SAS probably produces what you expect. But you should not hesitate to use 'proc print' to find out exactly what is in any given dataset. ======================================================================== The following program shows an alternative method of detecting the end of a datafile in SAS, pointed out by Yang Xie. It uses the 'end = ' option on the SET statement: options linesize = 80 ; data randend ; seed = today() ; randnum = int(20 * ranuni(seed)) ; call symput('arand', left(put(randnum, 3.0))) ; do i = 1 to randnum ; x = ranuni(seed) ; output ; end ; run ; data endonly ; set randend end = endmark ; irandnum = &arand ; if endmark eq 1 then output ; run ; proc print data = randend ; title 'Print of dataset randend:' ; proc print data = endonly ; title 'Print of dataset endonly' ; ======================================================================== Print of dataset randend: 1 17:13 Saturday, October 5, 2002 OBS SEED RANDNUM I X 1 15618 14 1 0.84556 2 15618 14 2 0.45601 3 15618 14 3 0.28804 4 15618 14 4 0.58906 5 15618 14 5 0.17365 6 15618 14 6 0.24873 7 15618 14 7 0.96133 8 15618 14 8 0.68310 9 15618 14 9 0.63640 10 15618 14 10 0.11744 11 15618 14 11 0.80980 12 15618 14 12 0.31235 13 15618 14 13 0.28872 14 15618 14 14 0.79861 Print of dataset endonly 2 17:13 Saturday, October 5, 2002 OBS SEED RANDNUM I X IRANDNUM 1 15618 14 14 0.79861 14 ======================================================================== PROBLEM 12 1. Refer back to notes009. Two file structures were shown, File Structure 1 and File Structure 2. They had equivalent data, with File Structure 1 having 1 observation for each value of casenum, and 5 weights in each observation. File Structure 2 had 5 observations for each value of casenum and 1 weight per observation. The example code showed how, given data in File Structure 1, you can put it into File Structure 2. The problem here is to do the reverse. Start with a file called struct2. The only variables on it are casenum, visit, and weight. Write out a file with the same data in the form of File Structure 1, with all 5 weights for a given casenum on the same observation. 1.1 Simplest case: struct2 is ordered by casenum and, within casenum, by visit, and there are always 5 observations per casenum. 1.2 Next simplest case: struct2 is not necessarily ordered by casenum and visit, but there are always 5 observations per casenum. 1.3 Worst case: struct2 is not ordered, and the number of observations per casenum varies between 1 and 5. Various hints: you will probably need to use the RETAIN statement. For parts 1.2 and 1.3 you will need to use PROC SORT. For part 1.3, you may need to initialize some things to missing. SOME QUESTIONS See the SAS program above. What will the resulting datasets be in the following examples? ------------------------------------------------------------------------------ data examp8 ; set examp1 ; set examp3 ; ------------------------------------------------------------------------------ data examp9 ; set examp1 ; if casenum le 5 then set examp3 ; ------------------------------------------------------------------------------ data examp10 ; merge examp1 examp3 ; ------------------------------------------------------------------------------ ======================================================================== EXAMPLE: Use of macro variables, multiple output files, 'file' statement, and 'put' statements: ------------------------------------------------------------------------ options linesize = 80 MPRINT ; footnote "~john-c/5421/output.sas &sysdate &systime" ; %let infile = xdata ; * macro variable definition; %let outfile = lastobs ; * macro variable definition; data &infile &outfile ; * use of macro variable ; infile "&infile" end = endmark ; retain n 0 xsum 0 x2sum 0 xmin 9999999 xmax -9999999 ; input x ; if x ne . then do ; n = n + 1 ; xsum = xsum + x ; x2sum = x2sum + x*x ; if x lt xmin and x ne . then xmin = x ; if x gt xmax then xmax = x ; keep x n xsum x2sum xmin xmax ; output &infile ; * use of macro variable ; end ; if endmark = 1 then do ; xmean = xsum / n ; xvar = (x2sum - xsum*xsum/n)/(n - 1) ; xstd = sqrt(xvar) ; xserr = xstd/sqrt(n) ; keep n xmean xvar xstd xserr xmin xmax ; output &outfile ; end ; run ; data &outfile ; * use of macro var ; set &outfile ; * use of macro var ; file "&outfile" ; * use of macro var ; put "Summary Stats for dataset = &outfile" ; * use of macro var; put '------------------------------------' ; put ' ' ; put 'n = ' n ; put 'xmean = ' xmean ; put 'xvar = ' xvar ; * put statements ... ; put 'xstd = ' xstd ; put 'xserr = ' xserr ; put 'xmin = ' xmin ; put 'xmax = ' xmax ; put ' ' ; put "Output file: &outfile" ; * use of macro variable ; run ; proc print data = &infile ; title "Original data ... : dataset &infile" ; * use of macro var ; proc print data = &outfile ; title "Summary statistics, mean, variance, etc: dataset = &outfile" ; proc means data = xdata n mean std stderr min max ; var x ; title "Output from proc means: dataset &infile" ; * use of macro var; ======================================================================== Original data ... : dataset xdata 1 19:44 Wednesday, September 29, 2004 OBS N XSUM X2SUM XMIN XMAX X XMEAN XVAR XSTD XSERR 1 1 13.000 169 13.0000 13 13.000 . . . . 2 2 31.000 493 13.0000 18 18.000 . . . . 3 3 36.000 518 5.0000 18 5.000 . . . . 4 4 736.000 490518 5.0000 700 700.000 . . . . 5 5 836.000 500518 5.0000 700 100.000 . . . . 6 6 874.000 501962 5.0000 700 38.000 . . . . 7 7 874.000 501962 0.0001 700 0.000 . . . . 8 8 855.000 502323 -19.0000 700 -19.000 . . . . ~john-c/5421/output.sas 29SEP04 19:44 Summary statistics, mean, variance, etc: dataset = lastobs 2 19:44 Wednesday, September 29, 2004 OBS N XSUM X2SUM XMIN XMAX X XMEAN XVAR XSTD XSERR 1 8 855.000 502323 -19 700 -19 106.875 58706.41 242.294 85.6639 ~john-c/5421/output.sas 29SEP04 19:44 Output from proc means: dataset xdata 3 19:44 Wednesday, September 29, 2004 Analysis Variable : X N Mean Std Dev Std Error Minimum Maximum ----------------------------------------------------------------------- 8 106.8750125 242.2940521 85.6638836 -19.0000000 700.0000000 ----------------------------------------------------------------------- ~john-c/5421/output.sas 29SEP04 19:44 ======================================================================== Print of output file 'lastobs': Summary Stats for dataset = lastobs ------------------------------------ n = 8 xmean = 106.8750125 xvar = 58706.407661 xstd = 242.29405205 xserr = 85.663883624 xmin = -19 xmax = 700 Output file: lastobs ======================================================================== Problem 12.A Assume you have data on two files, data1 and data2. On data1, you have heights for a number of women. On data2, you have heights for a number of men. Your task is to (1) find the minimum height and maximum height for the women on data1, and (2) find how many men in data2 are shorter than the minimum height for women, and how many are taller than the maximum height for women. Write a program which will do this computation by creating macro variables using the 'symput' routine, and without using any SAS Procedures. Test your program by generating two sets of random data, with 100 observations in each set. The simulated heights of women in the first set should be normally distributed with mean 66 and standard deviation 3, and those for men in the second set should have mean 70 and standard deviation 4. ======================================================================== /home/gnome/john-c/5421/notes.011 Last update: October 14, 2005.