MYSTERIES OF THE SAS DATA STEP                            SPH 5421 notes.011
 
     Probably the most complicated aspect of SAS is the DATA step.  It is not
obvious what the DATA step actually is, and it is not easy to predict what it
produces when it is used with multiple SET statements, MERGE statements, and
other statements.  The role of the OUTPUT statement is also not necessarily
clear.

     The SAS Language manual is the best source of information on the DATA
step.  It is particularly important to understand the SET statement.

     Below is an example which may shed some light on how the DATA step works.

=================================================================================

*  Program to explore the SAS DATA step, SET statement, MERGE statement;
*  program file:  /home/gnome/john-c/5421/datastep.sas               ;
*                                                                      ;

options linesize = 80 ;

 data examp1 ;

      seed = 20000128 ;
      n = 10 ;

      do casenum = 1 to n ;

         x = rannor(seed) ;
         y = rannor(seed) ;
         output ;

      end ;

run ;

proc print data = examp1 ;
title 'Listing of dataset examp1' ;

data examp2 ;
     set examp1 ;
     if casenum eq 5 then output ;
run ;

proc print data = examp2 ;
title 'Listing of dataset examp2' ;

data examp3 ;

     seed = 200001281 ;
     n = 12 ;

     do casenum = 1 to 12 ;

        y = casenum ;
        z = -casenum * casenum ;
        output ;

     end ;

run ;

proc print data = examp3 ;
title1 'Listing of dataset examp3' ;

data examp4 ;
     set examp1 examp3 ;

proc print data = examp4 ;
title1 'Listing of dataset examp4' ;

data examp5 ;
     merge examp1 examp3 ; by casenum ;

proc print data = examp5 ;
title1 'Listing of dataset examp5' ;

data examp6 ;
     merge examp3 examp1 ; by casenum ;

run ;

proc print data = examp6 ;
title1 'Listing of examp6' ;

data examp7 ;
     set examp1 ;
     if casenum eq 7 then set examp3 ;

run ;

proc print data = examp7 ;
title1 'Listing of examp7' ;

data examp8 ;
     set examp1 examp3 ;
     by casenum ;

proc print data = examp8 ;
title1 'Listing of examp8' ;

endsas ;
=================================================================================
                           Listing of dataset examp1                           1
                                                 18:26 Tuesday, February 1, 2000

            OBS      SEED       N    CASENUM        X           Y

              1    20000128    10        1      -2.28501     1.19334
              2    20000128    10        2       1.05265    -0.26866
              3    20000128    10        3      -0.43929    -1.57871
              4    20000128    10        4       0.83069    -0.41444
              5    20000128    10        5       0.80023    -1.05532
              6    20000128    10        6       1.14747     0.75366
              7    20000128    10        7      -1.13534     0.98361
              8    20000128    10        8      -1.35994    -1.19237
              9    20000128    10        9       2.01172     1.67323
             10    20000128    10       10       0.80378    -0.33271
------------------------------------------------------------------------

                            Listing of dataset examp2                           2
                                                 18:26 Tuesday, February 1, 2000

            OBS      SEED       N    CASENUM       X           Y

             1     20000128    10       5       0.80023    -1.05532
------------------------------------------------------------------------

                            Listing of dataset examp3                           3
                                                 18:26 Tuesday, February 1, 2000

                OBS       SEED       N    CASENUM     Y       Z

                  1    200001281    12        1       1      -1
                  2    200001281    12        2       2      -4
                  3    200001281    12        3       3      -9
                  4    200001281    12        4       4     -16
                  5    200001281    12        5       5     -25
                  6    200001281    12        6       6     -36
                  7    200001281    12        7       7     -49
                  8    200001281    12        8       8     -64
                  9    200001281    12        9       9     -81
                 10    200001281    12       10      10    -100
                 11    200001281    12       11      11    -121
                 12    200001281    12       12      12    -144
------------------------------------------------------------------------

                            Listing of dataset examp4                           4
                                                 18:26 Tuesday, February 1, 2000

       OBS       SEED       N    CASENUM        X              Y       Z

         1     20000128    10        1      -2.28501      1.1933       .
         2     20000128    10        2       1.05265     -0.2687       .
         3     20000128    10        3      -0.43929     -1.5787       .
         4     20000128    10        4       0.83069     -0.4144       .
         5     20000128    10        5       0.80023     -1.0553       .
         6     20000128    10        6       1.14747      0.7537       .
         7     20000128    10        7      -1.13534      0.9836       .
         8     20000128    10        8      -1.35994     -1.1924       .
         9     20000128    10        9       2.01172      1.6732       .
        10     20000128    10       10       0.80378     -0.3327       .
        11    200001281    12        1        .           1.0000      -1
        12    200001281    12        2        .           2.0000      -4
        13    200001281    12        3        .           3.0000      -9
        14    200001281    12        4        .           4.0000     -16
        15    200001281    12        5        .           5.0000     -25
        16    200001281    12        6        .           6.0000     -36
        17    200001281    12        7        .           7.0000     -49
        18    200001281    12        8        .           8.0000     -64
        19    200001281    12        9        .           9.0000     -81
        20    200001281    12       10        .          10.0000    -100
        21    200001281    12       11        .          11.0000    -121
        22    200001281    12       12        .          12.0000    -144
------------------------------------------------------------------------

                            Listing of dataset examp5                           5
                                                 18:26 Tuesday, February 1, 2000

          OBS       SEED       N    CASENUM        X        Y       Z

            1    200001281    12        1      -2.28501     1      -1
            2    200001281    12        2       1.05265     2      -4
            3    200001281    12        3      -0.43929     3      -9
            4    200001281    12        4       0.83069     4     -16
            5    200001281    12        5       0.80023     5     -25
            6    200001281    12        6       1.14747     6     -36
            7    200001281    12        7      -1.13534     7     -49
            8    200001281    12        8      -1.35994     8     -64
            9    200001281    12        9       2.01172     9     -81
           10    200001281    12       10       0.80378    10    -100
           11    200001281    12       11        .         11    -121
           12    200001281    12       12        .         12    -144
------------------------------------------------------------------------

                                Listing of examp6                               6
                                                 18:26 Tuesday, February 1, 2000

       OBS       SEED       N    CASENUM           Y       Z        X

         1     20000128    10        1        1.1933      -1    -2.28501
         2     20000128    10        2       -0.2687      -4     1.05265
         3     20000128    10        3       -1.5787      -9    -0.43929
         4     20000128    10        4       -0.4144     -16     0.83069
         5     20000128    10        5       -1.0553     -25     0.80023
         6     20000128    10        6        0.7537     -36     1.14747
         7     20000128    10        7        0.9836     -49    -1.13534
         8     20000128    10        8       -1.1924     -64    -1.35994
         9     20000128    10        9        1.6732     -81     2.01172
        10     20000128    10       10       -0.3327    -100     0.80378
        11    200001281    12       11       11.0000    -121      .     
        12    200001281    12       12       12.0000    -144      .     
------------------------------------------------------------------------

                                Listing of examp7                               7
                                                 18:26 Tuesday, February 1, 2000

        OBS       SEED       N    CASENUM        X           Y        Z

          1     20000128    10        1      -2.28501     1.19334     .
          2     20000128    10        2       1.05265    -0.26866     .
          3     20000128    10        3      -0.43929    -1.57871     .
          4     20000128    10        4       0.83069    -0.41444     .
          5     20000128    10        5       0.80023    -1.05532     .
          6     20000128    10        6       1.14747     0.75366     .
          7    200001281    12        1      -1.13534     1.00000    -1
          8     20000128    10        8      -1.35994    -1.19237    -1
          9     20000128    10        9       2.01172     1.67323    -1
         10     20000128    10       10       0.80378    -0.33271    -1
------------------------------------------------------------------------

                                Listing of examp8                               8
                                                 18:26 Tuesday, February 1, 2000

       OBS       SEED       N    CASENUM        X              Y       Z

         1     20000128    10        1      -2.28501      1.1933       .
         2    200001281    12        1        .           1.0000      -1
         3     20000128    10        2       1.05265     -0.2687       .
         4    200001281    12        2        .           2.0000      -4
         5     20000128    10        3      -0.43929     -1.5787       .
         6    200001281    12        3        .           3.0000      -9
         7     20000128    10        4       0.83069     -0.4144       .
         8    200001281    12        4        .           4.0000     -16
         9     20000128    10        5       0.80023     -1.0553       .
        10    200001281    12        5        .           5.0000     -25
        11     20000128    10        6       1.14747      0.7537       .
        12    200001281    12        6        .           6.0000     -36
        13     20000128    10        7      -1.13534      0.9836       .
        14    200001281    12        7        .           7.0000     -49
        15     20000128    10        8      -1.35994     -1.1924       .
        16    200001281    12        8        .           8.0000     -64
        17     20000128    10        9       2.01172      1.6732       .
        18    200001281    12        9        .           9.0000     -81
        19     20000128    10       10       0.80378     -0.3327       .
        20    200001281    12       10        .          10.0000    -100
        21    200001281    12       11        .          11.0000    -121
        22    200001281    12       12        .          12.0000    -144
=================================================================================

Notes on the program and output:

   1.  The first 'proc print' section just prints the 10 observations in
       dataset examp1.

   2.  Dataset examp2 consists of only one observation.  If the 'output'
       statement in the 'data examp2' section were omitted, examp2 would
       have 10 observations.  On the other hand, if such a statement were
       included but never executed, as follows,

                  if casenum eq 35 then output;

       then the datafile examp2 would have NO observations.  Evidently the SAS
       compiler looks to see if there are any 'output' statements in a given
       data step before it decides how that datastep will be executed.

   3.  Dataset examp3 is a new dataset.  Note that it includes variables y and
       z, but not x. (z is set equal to the negative of the square of the
       observation number).  Dataset examp3 has 12 observations.

   4.  Dataset examp4 is equal to dataset examp3 appended to the end of dataset
       examp1.  This is the result of the 'set examp1 examp3' statement.
       Note that there are 22 observations (10 + 12).  The first 10 are missing
       for y and z, and the last 12 are missing for x.

   5.  Dataset examp5 is the result of merging datasets examp1 and examp3 by
       the variable casenum.  Note that the x-values come from dataset examp1
       (because x is not defined in dataset examp3), but the y- and z-values
       come from dataset examp3.

   6.  Dataset examp6 is the result of merging datasets examp3 and examp1
       (in reverse order from examp5).  The results are correspondingly
       different.  Note that both datasets examp5 and examp6 have 12 observations.
       (the maximum number from either of datasets examp1 and examp3).

   7.  Dataset examp7 is rather hard to explain.  It is defined by first
       saying 'set examp1', but then at the 7th observation of examp1,
       the program says 'set examp3'.  At that point the variables seed, n, casenum,
       y, and z are entered from the first observation of examp3.  Then in the
       8th through 10th observations, seed, n, casenum, and y are from examp1.
       But z is still from the first observation of dataset examp3.  Strange!

   8.  Dataset examp8 illustrates that if the 'set' statement names more
       than one dataset and is followed by a 'by' statement, the
       result is that the output file is interleaved (alternates)
       between the datasets.

      The basic conclusion from all this is that, in general, SAS probably
produces what you expect.  But you should not hesitate to use 'proc print'
to find out exactly what is in any given dataset.

========================================================================

     The following program shows an alternative method of detecting
the end of a datafile in SAS, pointed out by Yang Xie.  It uses
the 'end = ' option on the SET statement:


options linesize = 80 ;

data randend ;

     seed = today() ;

     randnum = int(20 * ranuni(seed)) ;

     call symput('arand', left(put(randnum, 3.0))) ;

     do i = 1 to randnum ;

        x = ranuni(seed) ;
        output ;

     end ;

 run ;

 data endonly ;
      set randend end = endmark ;
      irandnum = &arand ;
      if endmark eq 1 then output ;

 run ;

 proc print data = randend ;
 title 'Print of dataset randend:' ;

 proc print data = endonly ;
 title 'Print of dataset endonly' ;

========================================================================
                           Print of dataset randend:                           1
                                                 17:13 Saturday, October 5, 2002

                    OBS     SEED    RANDNUM     I       X

                      1    15618       14       1    0.84556
                      2    15618       14       2    0.45601
                      3    15618       14       3    0.28804
                      4    15618       14       4    0.58906
                      5    15618       14       5    0.17365
                      6    15618       14       6    0.24873
                      7    15618       14       7    0.96133
                      8    15618       14       8    0.68310
                      9    15618       14       9    0.63640
                     10    15618       14      10    0.11744
                     11    15618       14      11    0.80980
                     12    15618       14      12    0.31235
                     13    15618       14      13    0.28872
                     14    15618       14      14    0.79861

                             Print of dataset endonly                           2
                                                 17:13 Saturday, October 5, 2002

              OBS     SEED    RANDNUM     I       X       IRANDNUM

               1     15618       14      14    0.79861       14   
========================================================================

PROBLEM 12

1.  Refer back to notes009.  Two file structures were shown, File Structure 1
    and File Structure 2.  They had equivalent data, with File Structure 1
    having 1 observation for each value of casenum, and 5 weights in each
    observation.  File Structure 2 had 5 observations for each value of
    casenum and 1 weight per observation.  The example code showed how,
    given data in File Structure 1, you can put it into File Structure 2.

    The problem here is to do the reverse.  Start with a file called struct2.
    The only variables on it are casenum, visit, and weight.  Write out
    a file with the same data in the form of File Structure 1, with all
    5 weights for a given casenum on the same observation.

    1.1  Simplest case:  struct2 is ordered by casenum and, within casenum,
         by visit, and there are always 5 observations per casenum.

    1.2  Next simplest case: struct2 is not necessarily ordered by casenum and
         visit, but there are always 5 observations per casenum.

    1.3  Worst case: struct2 is not ordered, and the number of observations
         per casenum varies between 1 and 5.


    Various hints: you will probably need to use the RETAIN statement.  For
parts 1.2 and 1.3 you will need to use PROC SORT.  For part 1.3, you may need
to initialize some things to missing.

SOME QUESTIONS

    See the SAS program above.  What will the resulting datasets be in the
following examples?

------------------------------------------------------------------------------
    data examp8 ;
         set examp1 ;
         set examp3 ;
------------------------------------------------------------------------------
    data examp9 ;
         set examp1 ;
         if casenum le 5 then set examp3 ;
------------------------------------------------------------------------------
    data examp10 ;
         merge examp1 examp3 ;
------------------------------------------------------------------------------
========================================================================

EXAMPLE: Use of macro variables, multiple output files, 'file'
         statement, and 'put' statements:

------------------------------------------------------------------------

options linesize = 80 MPRINT ;
footnote "~john-c/5421/output.sas &sysdate &systime" ;

%let infile = xdata ;                  *  macro variable definition;
%let outfile = lastobs ;               *  macro variable definition;

data &infile &outfile ;                *  use of macro variable ;
     infile "&infile" end = endmark ;
     retain n 0  xsum 0  x2sum 0  xmin 9999999 xmax -9999999 ;

     input x ;

     if x ne . then do ;

        n = n + 1 ;
        xsum = xsum + x ;
        x2sum = x2sum + x*x ;
        if x lt xmin and x ne . then xmin = x ;
        if x gt xmax then xmax = x ;
        keep x n xsum x2sum xmin xmax ;

        output &infile ;                * use of macro variable ;

     end ;

     if endmark = 1 then do ;

        xmean = xsum / n ;
        xvar = (x2sum - xsum*xsum/n)/(n - 1) ;
        xstd = sqrt(xvar) ;
        xserr = xstd/sqrt(n) ;

        keep n xmean xvar xstd xserr xmin xmax ;

        output &outfile ;

      end ;

 run ;

 data &outfile ;                        *  use of macro var ;
 set &outfile ;                         *  use of macro var ;
 file "&outfile" ;                      *  use of macro var ;

      put "Summary Stats for dataset = &outfile" ;  *  use of macro var;
      put '------------------------------------' ;
      put ' ' ;
      put 'n     = ' n ;
      put 'xmean = ' xmean ;
      put 'xvar  = ' xvar ;             * put statements ... ;
      put 'xstd  = ' xstd ;
      put 'xserr = ' xserr ;
      put 'xmin  = ' xmin ;
      put 'xmax  = ' xmax ;
      put ' ' ;
      put "Output file: &outfile" ;     *  use of macro variable ;

run ;

 proc print data = &infile ;
 title "Original data ... : dataset &infile" ;  * use of macro var ;

 proc print data = &outfile ;
 title "Summary statistics, mean, variance, etc: dataset = &outfile" ;

 proc means data = xdata n mean std stderr min max ;
      var x ;
 title "Output from proc means: dataset &infile" ; *  use of macro var;

========================================================================

                       Original data ... : dataset xdata                       1
                                             19:44 Wednesday, September 29, 2004

  OBS  N    XSUM    X2SUM      XMIN  XMAX         X  XMEAN  XVAR  XSTD  XSERR

   1   1   13.000     169   13.0000    13    13.000    .      .     .     .  
   2   2   31.000     493   13.0000    18    18.000    .      .     .     .  
   3   3   36.000     518    5.0000    18     5.000    .      .     .     .  
   4   4  736.000  490518    5.0000   700   700.000    .      .     .     .  
   5   5  836.000  500518    5.0000   700   100.000    .      .     .     .  
   6   6  874.000  501962    5.0000   700    38.000    .      .     .     .  
   7   7  874.000  501962    0.0001   700     0.000    .      .     .     .  
   8   8  855.000  502323  -19.0000   700   -19.000    .      .     .     .  
 
 
                     ~john-c/5421/output.sas 29SEP04 19:44

            Summary statistics, mean, variance, etc: dataset = lastobs          2
                                             19:44 Wednesday, September 29, 2004

 OBS  N    XSUM    X2SUM  XMIN  XMAX   X    XMEAN     XVAR      XSTD    XSERR

  1   8  855.000  502323   -19   700  -19  106.875  58706.41  242.294  85.6639
 
 
                     ~john-c/5421/output.sas 29SEP04 19:44

                      Output from proc means: dataset xdata                     3
                                             19:44 Wednesday, September 29, 2004

    Analysis Variable : X


    N          Mean       Std Dev     Std Error       Minimum       Maximum
    -----------------------------------------------------------------------
    8   106.8750125   242.2940521    85.6638836   -19.0000000   700.0000000
    -----------------------------------------------------------------------
 
                     ~john-c/5421/output.sas 29SEP04 19:44

========================================================================

Print of output file 'lastobs':


Summary Stats for dataset = lastobs
------------------------------------
 
n     = 8
xmean = 106.8750125
xvar  = 58706.407661
xstd  = 242.29405205
xserr = 85.663883624
xmin  = -19
xmax  = 700
 
Output file: lastobs
========================================================================

Problem 12.A

   Assume you have data on two files, data1 and data2.

   On data1, you have heights for a number of women.

   On data2, you have heights for a number of men.

   Your task is to (1) find the minimum height and maximum height
for the women on data1, and (2) find how many men in data2 are
shorter than the minimum height for women, and how many are taller
than the maximum height for women.

   Write a program which will do this computation by creating
macro variables using the 'symput' routine, and without using any 
SAS Procedures.

   Test your program by generating two sets of random data,
with 100 observations in each set.  The simulated heights of
women in the first set should be normally distributed with
mean 66 and standard deviation 3, and those for men in the
second set should have mean 70 and standard deviation 4.

========================================================================
/home/gnome/john-c/5421/notes.011    Last update: October 14, 2005.