MYSTERIES OF THE SAS DATA STEP SPH 5421 notes.011
Probably the most complicated aspect of SAS is the DATA step. It is not
obvious what the DATA step actually is, and it is not easy to predict what it
produces when it is used with multiple SET statements, MERGE statements, and
other statements. The role of the OUTPUT statement is also not necessarily
clear.
The SAS Language manual is the best source of information on the DATA
step. It is particularly important to understand the SET statement.
Below is an example which may shed some light on how the DATA step works.
=================================================================================
* Program to explore the SAS DATA step, SET statement, MERGE statement;
* program file: /home/gnome/john-c/5421/datastep.sas ;
* ;
options linesize = 80 ;
data examp1 ;
seed = 20000128 ;
n = 10 ;
do casenum = 1 to n ;
x = rannor(seed) ;
y = rannor(seed) ;
output ;
end ;
run ;
proc print data = examp1 ;
title 'Listing of dataset examp1' ;
data examp2 ;
set examp1 ;
if casenum eq 5 then output ;
run ;
proc print data = examp2 ;
title 'Listing of dataset examp2' ;
data examp3 ;
seed = 200001281 ;
n = 12 ;
do casenum = 1 to 12 ;
y = casenum ;
z = -casenum * casenum ;
output ;
end ;
run ;
proc print data = examp3 ;
title1 'Listing of dataset examp3' ;
data examp4 ;
set examp1 examp3 ;
proc print data = examp4 ;
title1 'Listing of dataset examp4' ;
data examp5 ;
merge examp1 examp3 ; by casenum ;
proc print data = examp5 ;
title1 'Listing of dataset examp5' ;
data examp6 ;
merge examp3 examp1 ; by casenum ;
run ;
proc print data = examp6 ;
title1 'Listing of examp6' ;
data examp7 ;
set examp1 ;
if casenum eq 7 then set examp3 ;
run ;
proc print data = examp7 ;
title1 'Listing of examp7' ;
data examp8 ;
set examp1 examp3 ;
by casenum ;
proc print data = examp8 ;
title1 'Listing of examp8' ;
endsas ;
=================================================================================
Listing of dataset examp1 1
18:26 Tuesday, February 1, 2000
OBS SEED N CASENUM X Y
1 20000128 10 1 -2.28501 1.19334
2 20000128 10 2 1.05265 -0.26866
3 20000128 10 3 -0.43929 -1.57871
4 20000128 10 4 0.83069 -0.41444
5 20000128 10 5 0.80023 -1.05532
6 20000128 10 6 1.14747 0.75366
7 20000128 10 7 -1.13534 0.98361
8 20000128 10 8 -1.35994 -1.19237
9 20000128 10 9 2.01172 1.67323
10 20000128 10 10 0.80378 -0.33271
------------------------------------------------------------------------
Listing of dataset examp2 2
18:26 Tuesday, February 1, 2000
OBS SEED N CASENUM X Y
1 20000128 10 5 0.80023 -1.05532
------------------------------------------------------------------------
Listing of dataset examp3 3
18:26 Tuesday, February 1, 2000
OBS SEED N CASENUM Y Z
1 200001281 12 1 1 -1
2 200001281 12 2 2 -4
3 200001281 12 3 3 -9
4 200001281 12 4 4 -16
5 200001281 12 5 5 -25
6 200001281 12 6 6 -36
7 200001281 12 7 7 -49
8 200001281 12 8 8 -64
9 200001281 12 9 9 -81
10 200001281 12 10 10 -100
11 200001281 12 11 11 -121
12 200001281 12 12 12 -144
------------------------------------------------------------------------
Listing of dataset examp4 4
18:26 Tuesday, February 1, 2000
OBS SEED N CASENUM X Y Z
1 20000128 10 1 -2.28501 1.1933 .
2 20000128 10 2 1.05265 -0.2687 .
3 20000128 10 3 -0.43929 -1.5787 .
4 20000128 10 4 0.83069 -0.4144 .
5 20000128 10 5 0.80023 -1.0553 .
6 20000128 10 6 1.14747 0.7537 .
7 20000128 10 7 -1.13534 0.9836 .
8 20000128 10 8 -1.35994 -1.1924 .
9 20000128 10 9 2.01172 1.6732 .
10 20000128 10 10 0.80378 -0.3327 .
11 200001281 12 1 . 1.0000 -1
12 200001281 12 2 . 2.0000 -4
13 200001281 12 3 . 3.0000 -9
14 200001281 12 4 . 4.0000 -16
15 200001281 12 5 . 5.0000 -25
16 200001281 12 6 . 6.0000 -36
17 200001281 12 7 . 7.0000 -49
18 200001281 12 8 . 8.0000 -64
19 200001281 12 9 . 9.0000 -81
20 200001281 12 10 . 10.0000 -100
21 200001281 12 11 . 11.0000 -121
22 200001281 12 12 . 12.0000 -144
------------------------------------------------------------------------
Listing of dataset examp5 5
18:26 Tuesday, February 1, 2000
OBS SEED N CASENUM X Y Z
1 200001281 12 1 -2.28501 1 -1
2 200001281 12 2 1.05265 2 -4
3 200001281 12 3 -0.43929 3 -9
4 200001281 12 4 0.83069 4 -16
5 200001281 12 5 0.80023 5 -25
6 200001281 12 6 1.14747 6 -36
7 200001281 12 7 -1.13534 7 -49
8 200001281 12 8 -1.35994 8 -64
9 200001281 12 9 2.01172 9 -81
10 200001281 12 10 0.80378 10 -100
11 200001281 12 11 . 11 -121
12 200001281 12 12 . 12 -144
------------------------------------------------------------------------
Listing of examp6 6
18:26 Tuesday, February 1, 2000
OBS SEED N CASENUM Y Z X
1 20000128 10 1 1.1933 -1 -2.28501
2 20000128 10 2 -0.2687 -4 1.05265
3 20000128 10 3 -1.5787 -9 -0.43929
4 20000128 10 4 -0.4144 -16 0.83069
5 20000128 10 5 -1.0553 -25 0.80023
6 20000128 10 6 0.7537 -36 1.14747
7 20000128 10 7 0.9836 -49 -1.13534
8 20000128 10 8 -1.1924 -64 -1.35994
9 20000128 10 9 1.6732 -81 2.01172
10 20000128 10 10 -0.3327 -100 0.80378
11 200001281 12 11 11.0000 -121 .
12 200001281 12 12 12.0000 -144 .
------------------------------------------------------------------------
Listing of examp7 7
18:26 Tuesday, February 1, 2000
OBS SEED N CASENUM X Y Z
1 20000128 10 1 -2.28501 1.19334 .
2 20000128 10 2 1.05265 -0.26866 .
3 20000128 10 3 -0.43929 -1.57871 .
4 20000128 10 4 0.83069 -0.41444 .
5 20000128 10 5 0.80023 -1.05532 .
6 20000128 10 6 1.14747 0.75366 .
7 200001281 12 1 -1.13534 1.00000 -1
8 20000128 10 8 -1.35994 -1.19237 -1
9 20000128 10 9 2.01172 1.67323 -1
10 20000128 10 10 0.80378 -0.33271 -1
------------------------------------------------------------------------
Listing of examp8 8
18:26 Tuesday, February 1, 2000
OBS SEED N CASENUM X Y Z
1 20000128 10 1 -2.28501 1.1933 .
2 200001281 12 1 . 1.0000 -1
3 20000128 10 2 1.05265 -0.2687 .
4 200001281 12 2 . 2.0000 -4
5 20000128 10 3 -0.43929 -1.5787 .
6 200001281 12 3 . 3.0000 -9
7 20000128 10 4 0.83069 -0.4144 .
8 200001281 12 4 . 4.0000 -16
9 20000128 10 5 0.80023 -1.0553 .
10 200001281 12 5 . 5.0000 -25
11 20000128 10 6 1.14747 0.7537 .
12 200001281 12 6 . 6.0000 -36
13 20000128 10 7 -1.13534 0.9836 .
14 200001281 12 7 . 7.0000 -49
15 20000128 10 8 -1.35994 -1.1924 .
16 200001281 12 8 . 8.0000 -64
17 20000128 10 9 2.01172 1.6732 .
18 200001281 12 9 . 9.0000 -81
19 20000128 10 10 0.80378 -0.3327 .
20 200001281 12 10 . 10.0000 -100
21 200001281 12 11 . 11.0000 -121
22 200001281 12 12 . 12.0000 -144
=================================================================================
Notes on the program and output:
1. The first 'proc print' section just prints the 10 observations in
dataset examp1.
2. Dataset examp2 consists of only one observation. If the 'output'
statement in the 'data examp2' section were omitted, examp2 would
have 10 observations. On the other hand, if such a statement were
included but never executed, as follows,
if casenum eq 35 then output;
then the datafile examp2 would have NO observations. Evidently the SAS
compiler looks to see if there are any 'output' statements in a given
data step before it decides how that datastep will be executed.
3. Dataset examp3 is a new dataset. Note that it includes variables y and
z, but not x. (z is set equal to the negative of the square of the
observation number). Dataset examp3 has 12 observations.
4. Dataset examp4 is equal to dataset examp3 appended to the end of dataset
examp1. This is the result of the 'set examp1 examp3' statement.
Note that there are 22 observations (10 + 12). The first 10 are missing
for y and z, and the last 12 are missing for x.
5. Dataset examp5 is the result of merging datasets examp1 and examp3 by
the variable casenum. Note that the x-values come from dataset examp1
(because x is not defined in dataset examp3), but the y- and z-values
come from dataset examp3.
6. Dataset examp6 is the result of merging datasets examp3 and examp1
(in reverse order from examp5). The results are correspondingly
different. Note that both datasets examp5 and examp6 have 12 observations.
(the maximum number from either of datasets examp1 and examp3).
7. Dataset examp7 is rather hard to explain. It is defined by first
saying 'set examp1', but then at the 7th observation of examp1,
the program says 'set examp3'. At that point the variables seed, n, casenum,
y, and z are entered from the first observation of examp3. Then in the
8th through 10th observations, seed, n, casenum, and y are from examp1.
But z is still from the first observation of dataset examp3. Strange!
8. Dataset examp8 illustrates that if the 'set' statement names more
than one dataset and is followed by a 'by' statement, the
result is that the output file is interleaved (alternates)
between the datasets.
The basic conclusion from all this is that, in general, SAS probably
produces what you expect. But you should not hesitate to use 'proc print'
to find out exactly what is in any given dataset.
========================================================================
The following program shows an alternative method of detecting
the end of a datafile in SAS, pointed out by Yang Xie. It uses
the 'end = ' option on the SET statement:
options linesize = 80 ;
data randend ;
seed = today() ;
randnum = int(20 * ranuni(seed)) ;
call symput('arand', left(put(randnum, 3.0))) ;
do i = 1 to randnum ;
x = ranuni(seed) ;
output ;
end ;
run ;
data endonly ;
set randend end = endmark ;
irandnum = &arand ;
if endmark eq 1 then output ;
run ;
proc print data = randend ;
title 'Print of dataset randend:' ;
proc print data = endonly ;
title 'Print of dataset endonly' ;
========================================================================
Print of dataset randend: 1
17:13 Saturday, October 5, 2002
OBS SEED RANDNUM I X
1 15618 14 1 0.84556
2 15618 14 2 0.45601
3 15618 14 3 0.28804
4 15618 14 4 0.58906
5 15618 14 5 0.17365
6 15618 14 6 0.24873
7 15618 14 7 0.96133
8 15618 14 8 0.68310
9 15618 14 9 0.63640
10 15618 14 10 0.11744
11 15618 14 11 0.80980
12 15618 14 12 0.31235
13 15618 14 13 0.28872
14 15618 14 14 0.79861
Print of dataset endonly 2
17:13 Saturday, October 5, 2002
OBS SEED RANDNUM I X IRANDNUM
1 15618 14 14 0.79861 14
========================================================================
PROBLEM 12
1. Refer back to notes009. Two file structures were shown, File Structure 1
and File Structure 2. They had equivalent data, with File Structure 1
having 1 observation for each value of casenum, and 5 weights in each
observation. File Structure 2 had 5 observations for each value of
casenum and 1 weight per observation. The example code showed how,
given data in File Structure 1, you can put it into File Structure 2.
The problem here is to do the reverse. Start with a file called struct2.
The only variables on it are casenum, visit, and weight. Write out
a file with the same data in the form of File Structure 1, with all
5 weights for a given casenum on the same observation.
1.1 Simplest case: struct2 is ordered by casenum and, within casenum,
by visit, and there are always 5 observations per casenum.
1.2 Next simplest case: struct2 is not necessarily ordered by casenum and
visit, but there are always 5 observations per casenum.
1.3 Worst case: struct2 is not ordered, and the number of observations
per casenum varies between 1 and 5.
Various hints: you will probably need to use the RETAIN statement. For
parts 1.2 and 1.3 you will need to use PROC SORT. For part 1.3, you may need
to initialize some things to missing.
SOME QUESTIONS
See the SAS program above. What will the resulting datasets be in the
following examples?
------------------------------------------------------------------------------
data examp8 ;
set examp1 ;
set examp3 ;
------------------------------------------------------------------------------
data examp9 ;
set examp1 ;
if casenum le 5 then set examp3 ;
------------------------------------------------------------------------------
data examp10 ;
merge examp1 examp3 ;
------------------------------------------------------------------------------
========================================================================
EXAMPLE: Use of macro variables, multiple output files, 'file'
statement, and 'put' statements:
------------------------------------------------------------------------
options linesize = 80 MPRINT ;
footnote "~john-c/5421/output.sas &sysdate &systime" ;
%let infile = xdata ; * macro variable definition;
%let outfile = lastobs ; * macro variable definition;
data &infile &outfile ; * use of macro variable ;
infile "&infile" end = endmark ;
retain n 0 xsum 0 x2sum 0 xmin 9999999 xmax -9999999 ;
input x ;
if x ne . then do ;
n = n + 1 ;
xsum = xsum + x ;
x2sum = x2sum + x*x ;
if x lt xmin and x ne . then xmin = x ;
if x gt xmax then xmax = x ;
keep x n xsum x2sum xmin xmax ;
output &infile ; * use of macro variable ;
end ;
if endmark = 1 then do ;
xmean = xsum / n ;
xvar = (x2sum - xsum*xsum/n)/(n - 1) ;
xstd = sqrt(xvar) ;
xserr = xstd/sqrt(n) ;
keep n xmean xvar xstd xserr xmin xmax ;
output &outfile ;
end ;
run ;
data &outfile ; * use of macro var ;
set &outfile ; * use of macro var ;
file "&outfile" ; * use of macro var ;
put "Summary Stats for dataset = &outfile" ; * use of macro var;
put '------------------------------------' ;
put ' ' ;
put 'n = ' n ;
put 'xmean = ' xmean ;
put 'xvar = ' xvar ; * put statements ... ;
put 'xstd = ' xstd ;
put 'xserr = ' xserr ;
put 'xmin = ' xmin ;
put 'xmax = ' xmax ;
put ' ' ;
put "Output file: &outfile" ; * use of macro variable ;
run ;
proc print data = &infile ;
title "Original data ... : dataset &infile" ; * use of macro var ;
proc print data = &outfile ;
title "Summary statistics, mean, variance, etc: dataset = &outfile" ;
proc means data = xdata n mean std stderr min max ;
var x ;
title "Output from proc means: dataset &infile" ; * use of macro var;
========================================================================
Original data ... : dataset xdata 1
19:44 Wednesday, September 29, 2004
OBS N XSUM X2SUM XMIN XMAX X XMEAN XVAR XSTD XSERR
1 1 13.000 169 13.0000 13 13.000 . . . .
2 2 31.000 493 13.0000 18 18.000 . . . .
3 3 36.000 518 5.0000 18 5.000 . . . .
4 4 736.000 490518 5.0000 700 700.000 . . . .
5 5 836.000 500518 5.0000 700 100.000 . . . .
6 6 874.000 501962 5.0000 700 38.000 . . . .
7 7 874.000 501962 0.0001 700 0.000 . . . .
8 8 855.000 502323 -19.0000 700 -19.000 . . . .
~john-c/5421/output.sas 29SEP04 19:44
Summary statistics, mean, variance, etc: dataset = lastobs 2
19:44 Wednesday, September 29, 2004
OBS N XSUM X2SUM XMIN XMAX X XMEAN XVAR XSTD XSERR
1 8 855.000 502323 -19 700 -19 106.875 58706.41 242.294 85.6639
~john-c/5421/output.sas 29SEP04 19:44
Output from proc means: dataset xdata 3
19:44 Wednesday, September 29, 2004
Analysis Variable : X
N Mean Std Dev Std Error Minimum Maximum
-----------------------------------------------------------------------
8 106.8750125 242.2940521 85.6638836 -19.0000000 700.0000000
-----------------------------------------------------------------------
~john-c/5421/output.sas 29SEP04 19:44
========================================================================
Print of output file 'lastobs':
Summary Stats for dataset = lastobs
------------------------------------
n = 8
xmean = 106.8750125
xvar = 58706.407661
xstd = 242.29405205
xserr = 85.663883624
xmin = -19
xmax = 700
Output file: lastobs
========================================================================
Problem 12.A
Assume you have data on two files, data1 and data2.
On data1, you have heights for a number of women.
On data2, you have heights for a number of men.
Your task is to (1) find the minimum height and maximum height
for the women on data1, and (2) find how many men in data2 are
shorter than the minimum height for women, and how many are taller
than the maximum height for women.
Write a program which will do this computation by creating
macro variables using the 'symput' routine, and without using any
SAS Procedures.
Test your program by generating two sets of random data,
with 100 observations in each set. The simulated heights of
women in the first set should be normally distributed with
mean 66 and standard deviation 3, and those for men in the
second set should have mean 70 and standard deviation 4.
========================================================================
/home/gnome/john-c/5421/notes.011 Last update: October 14, 2005.