THE SAS DATA STEP / INPUT                                        n54703.001

     The Data step in SAS is probably its most complex feature.
It is not obvious what it is doing.  In most cases, the basic
structure is the following:

  1.  Input data with variables x, y, z, etc..
  2.  Transform some of the variables.
  3.  Output the data.

     Here is how this works in a typical program:

------------------------------------------------------------------------

1 data xyz ;                        /* Specify name of dataset    */
2
3  infile 'C:\datafiles\xyzdata' ;  /* Specify name of input file */
4  input x y z ;                    /* Input variables x, y, z    */
5
6  s = (x + y)**2 ;                 /* Create a new variable      */
7
8 run ;                             /* End of data step           */
9
------------------------------------------------------------------------

Line 1:  The dataset name is chosen by the user.  It specifies
         only a 'local' temporary file.

Line 3:  The data in this case are in a permanent file on the C: disk.

Line 4:  The input statement reads each line in the data file
         until it comes to the end.

Line 6:  A data transformation which creates a new variable is used.
         The new variable 's' will be on the dataset xyz along with
         x, y, and z.

Line 8:  The 'run' statement signifies the end of the data step.
         As each line of the input file is read, an output line
         is created on the temporary SAS dataset 'xyz' ;

------------------------------------------------------------------------

     There are enormous variations possible on this basic
outline.  If, for example, the following line is inserted after
line 4,

  if x < 3 then output ;

then the effect is to write out only observations which have x-values
less than 3, and only the variables x, y, and z are written to the
dataset.  The variable 's' is computed only after the 'output'
statement occurs.  Thus we would have the following:

------------------------------------------------------------------------

     Input file xyzdata         Output dataset xyz
     ------------------        -------------------
     Obs   x    y    z         Obs    x     y    z
     ---  ---  ---  ---        ---   ---   ---  ---
      1    1   1.4  14          1     1    1.4  14
      2    4   3.3  81          2     2    4.4  44
      3    2   4.4  44          3     1    1.2  65
      4    1   1.2  65          4     0    7.5  11
      5    5   6.0  28          5     1    3.3  29
      6    0   7.5  11
      7    3   7.5   8
      8    1   3.3  29

------------------------------------------------------------------------

     If the statement

     if x < 3 then output ;

were moved to just after the line in which 's' is created, the following
would be the result:

------------------------------------------------------------------------

     Input file xyzdata         Output dataset xyz
     ------------------        -------------------------
     Obs   x    y    z         Obs    x     y    z    s
     ---  ---  ---  ---        ---   ---   ---  --- -----
      1    1   1.4  14          1     1    1.4  14   5.76
      2    4   3.3  81          2     2    4.4  44  40.96
      3    2   4.4  44          3     1    1.2  65   4.84
      4    1   1.2  65          4     0    7.5  11  56.25
      5    5   6.0  28          5     1    3.3  29  18.49
      6    0   7.5  11
      7    3   7.5   8
      8    1   3.3  29

------------------------------------------------------------------------

     The 'input' statement is somewhat of a black box.  It looks like
a single instruction, but in fact it, and all the data transformations
following it, are repeated for every line of the input datafile.  If
the data step does not have an explicit 'output' statement, then a line
is created on the output dataset for each line that is read in the input
data file.

     The input data file can be embedded in the data step itself in the
form of data following either a 'cards' statement or a 'datalines'
statement.  This is what is done in most of the program/datafiles on the
Computer Programs and Datafiles web page for this course.

     Here is another simple example.  The lines are numbered for
reference later.
----------------------------------------------------------------------------------

1   data xyz ;
2     retain xcount 0 ;
3     input x y z ;
4
5   if x ne . the xcount = xcount + 1 ;
6   minxyz = min(x, y, z) ;
7   maxxyz = max(x, y, z) ;
8   datalines ;
9      14   18   2
10      .    7   7
11      s    3   1
12      .    8   4
13  ;
14  run ;
15
16  proc print data = xyz ;
17       var x y z minxyz maxxyz ;
18  title1 'Printout of the dataset xyz' ;
19  run ;
20
----------------------------------------------------------------------------------

Line 2:  The 'retain' statement causes the value of the specified variable to be
         retained as the program goes through the lines in the datafile.  In
         this case, the variable 'xcount' is initialized to zero.  It is
         intended to be a count of the number of observations in the datafile
         which have a nonmissing value for x.  Note that in SAS, missing
         values for numeric variables are denoted by '.'

Line 5:  Here the value of 'xcount' is increased by 1 for every occurrence
         in which x is not equal ('ne') to missing.  For the last line of
         the file, in this case xcount will equal  ???.

Line 16: proc print is one of the most useful procedures in SAS.  It is
         often essential to use proc print to figure out what a given
         datafile contains.

         If your datafile has 10,000 lines, when you use proc print, it will
         produce 10,000 lines of printout.  Usually you are not going to want
         to read 10,000 lines of printout, much less print it (it would run
         to about 200 pages).  You can often restrict how many lines are
         printed using a 'where' statement, as in the following example:

----------------------------------------------------------------------------------

1   data xyz ;
2     infile 'mystery.file' ;
3     retain case 0 ;
4     input x y z ;
5
6   case = case + 1 ;
7   minxyz = min(x, y, z) ;
8   maxxyz = max(x, y, z) ;
9
10  run ;
11
12  proc print data = xyz ;
13       where case le 10 ;
17       var case x y z minxyz maxxyz ;
18  title1 'Printout of the dataset xyz - first 10 observations only.' ;
19  run ;
20
21  data xyzw ;
22    set xyz ;
23
24    w = x + y + z ;
25
26  run ;
27
28
----------------------------------------------------------------------------------

Line 3:  Put a 'case' counter in a retain statement.

Line 6:  Increment the variable 'case' for each line of the input data file.

Line 13: Use the 'where' statement to print only the first 10 lines of the dataset.

----------------------------------------------------------------------------------

     Data sets may be modified and augmented in many ways.  In the example
just given, line 21 defines a new dataset called 'xyzw'.  The 'set xyz'
statement tells SAS to read in each line of the dataset xyz.  A new variable,
w, is added to each line of the new dataset, and it contains all the old
variables also.  Both datasets continue to exist during the running of this
program.

     If Line 21 of the preceding example had been replaced by:

21  data xyz ;

then the effect would have been to write over the old dataset, xyz, with
the new data.  Of course here the old data and the new data are almost
exactly the same, except that the new data have the additional variable w.


CONCATENATING DATASETS

In the following program, two datasets are concatenated together using
the 'set' command.  Then the two datasets are merge by 'name':

----------------------------------------------------------------------------------
options linesize = 80  pagesize = 30 ;

data men ;
length name $5 ;
input name  height weight ;
cards ;
  Black   74    208
  Blau    65    155
  Brown   71    199
  Green   63    104
  ;
run ;

data women ;
length name $5 ;
input name shoesize dressiz ;
cards ;
  Black   9    7
  Blau    6    8
  Brown   7    5
  Green   7   18
  ;
run ;

data both ;
  set men women ;
run ;

data merged ;
     merge men women ; by name ;
run ;

proc print data = both ;
title1 'Two datasets concatenated together' ;
run ;

proc print data = merged ;
title1 "Two datasets merged by the variable 'name'" ;
run ;
endsas ;
----------------------------------------------------------------------------------
                       Two datasets concatenated together                      1


                                                 18:01 Tuesday, January 20, 2004

            OBS    NAME     HEIGHT    WEIGHT    SHOESIZE    DRESSIZ

             1     Black      74        208         .           .  
             2     Blau       65        155         .           .  
             3     Brown      71        199         .           .  
             4     Green      63        104         .           .  
             5     Black       .          .         9           7  
             6     Blau        .          .         6           8  
             7     Brown       .          .         7           5  
             8     Green       .          .         7          18  


                    Two datasets merged by the variable 'name'                  2

            OBS    NAME     HEIGHT    WEIGHT    SHOESIZE    DRESSIZ

             1     Black      74        208         9           7  
             2     Blau       65        155         6           8  
             3     Brown      71        199         7           5  
             4     Green      63        104         7          18  

----------------------------------------------------------------------------------

     Note that the two datasets do not have the same variables.  The concatenated
dataset has missing values for the variables that are in one file but not
the other.

     Note that on the MERGED dataset, all five of the variables occur.

Note that the 'merge' command requires that both datasets are sorted
by the 'merge' variable.  If they are not, the merge will not work and an error
message will be printed on the log file.

----------------------------------------------------------------------------------

ALTERNATING OBSERVATIONS

     There are situations where you have two datasets with the
same number of observations, and you want to create a new dataset
which is comprised of alternating observations, rather than merged
observations, from each.  This can be accomplished by using the
'set' and 'output' commands in the right order, as in the following
example:

========================================================================

 options linesize = 80 pagesize = 30 ;
 footnote "~john-c/5421/altern.sas &sysdate &systime" ;

 data one ;
      input x y ;
cards;
1 2
3 4
5 6
7 8
9 10
;
run ;

 data two ;
      input x y ;
cards;
1 4
9 16
25 36
49 64
81 100
;
run ;

data onetwo ;
     set one ;
     output ;
     set two ;
     output ;

run ;

proc print data = one ;
title1 'Proc print of dataset one' ;
run ;
proc print data = two ;
title1 'Proc print of dataset two' ;
run ;

proc print data = onetwo ;
title1 'Proc print of dataset onetwo - ' ;
title2 'Note alternating observations from datasets one and two.' ;
run ;
------------------------------------------------------------------------

                           Proc print of dataset one                           1
                                                  12:00 Sunday, January 25, 2004

                                 OBS    X     Y

                                  1     1     2
                                  2     3     4
                                  3     5     6
                                  4     7     8
                                  5     9    10

                     ~john-c/5421/altern.sas 25JAN04 12:00

                            Proc print of dataset two                           2
                                                  12:00 Sunday, January 25, 2004

                                OBS     X      Y

                                 1      1      4
                                 2      9     16
                                 3     25     36
                                 4     49     64
                                 5     81    100
 
 
                     ~john-c/5421/altern.sas 25JAN04 12:00

                         Proc print of dataset onetwo -                         3
            Note alternating observations from datasets one and two.
                                                  12:00 Sunday, January 25, 2004

                                OBS     X      Y

                                  1     1      2
                                  2     1      4
                                  3     3      4
                                  4     9     16
                                  5     5      6
                                  6    25     36
                                  7     7      8
                                  8    49     64
                                  9     9     10
                                 10    81    100
 
                     ~john-c/5421/altern.sas 25JAN04 12:00

========================================================================

READING EXCEL FILES IN SAS:

     SAS can read datafiles in a number of 'external' formats from
database management systems and spreadsheets.  In particular, SAS
can read Excel files.  This is well-explained at the following
UCLA website:

     http://www.ats.ucla.edu/stat/sas/faq/readxls.htm

     It is important to note that SAS cannot read Excel spreadsheets
from the more recent versions of Excel.  You will need to save your
Excel file in Excel 5.0 or earlier, using the "Save as" option from
the File menu in Excel.

     Below is an example of a SAS program which reads an Excel
spreadsheet:

========================================================================

footnote "~john-c/5421/excelread.sas &sysdate &systime" ;
options linesize = 80 ;


------------------------------------------------------------------------

HOW TO WRITE OUT SAS SYSTEM FILES and HOW TO READ SAS SYSTEM FILES


  The following program writes out a SAS dataset as a permanent file:
------------------------------------------------------------------------

proc import datafile = "auto.xls" out = auto replace  ;
run ;

proc print data = auto ;
run ;
------------------------------------------------------------------------

libname sasperm '' ;
footnote "~john-c/5421/sasperm.sas &sysdate &systime" ;

options linesize = 80 ;

data indata ;
input x y n ;

   nxy = n * x * y ;

cards ;
    11 24  101
     5 18   88
    33 12  140
    25 25  100
    ;
run ;

data sasperm.output ;
     set indata ;
run ;

proc print data = sasperm.output ;
title "SAS program to illustrate making a SAS dataset into a permanent file' ;
run ;

------------------------------------------------------------------------

Notes:

1.  Note that a library name is specified:

    libname sasperm '' ;

    This is simply a "local" name for the library in which the
    file will be stored.

2.  Note that the actual name of the permanent file is indicated
    in the line

    data sasperm.output ;

    Here the prefix is the "local" library name.  The file is actually
    stored on the computer's hard disk as:

         output.ssd01

    The format for storage is determined by SAS.  You cannot examine
    the contents of this file with an ordinary text editor.  It is
    readable by SAS as will be shown below.

------------------------------------------------------------------------

  The following program reads in a SAS dataset stored as a permanent file:

------------------------------------------------------------------------

libname sasperm '' ;
footnote "~john-c/5421/inperm.sas &sysdate &systime" ;

options linesize = 80 ;

data sasperm.in ;
     set sasperm.output ;
run ;

proc contents data = sasperm.in ;
title1 'Example of the use of proc contents ...' ;
run ;

proc print data = sasperm.in ;
title 'SAS program to illustrate reading a permanent SAS datafile ' ;
run ;

------------------------------------------------------------------------

                    Example of the use of proc contents ...                    1
                                                  13:51 Monday, January 24, 2005

                               CONTENTS PROCEDURE

Data Set Name: SASPERM.IN                              Observations:         4 
Member Type:   DATA                                    Variables:            4 
Engine:        V612                                    Indexes:              0 
Created:       13:51 Monday, January 24, 2005          Observation Length:   32
Last Modified: 13:51 Monday, January 24, 2005          Deleted Observations: 0 
Protection:                                            Compressed:           NO
Data Set Type:                                         Sorted:               NO
Label:                                                                         

                  -----Engine/Host Dependent Information-----

           Data Set Page Size:       8192    
           Number of Data Set Pages: 1       
           File Format:              607     
           First Data Page:          1       
           Max Obs per Page:         254     
           Obs in First Data Page:   4       
           File Name:                /home/gnome/john-c/5421/in.ssd01
           Inode Number:             571724
           Access Permission:        rw-r--r--
           Owner Name:               john-c
           File Size (bytes):        16384

             -----Alphabetic List of Variables and Attributes-----
 
                      #    Variable    Type    Len    Pos
                      -----------------------------------
                      3    N           Num       8     16
                      4    NXY         Num       8     24
                      1    X           Num       8      0
                      2    Y           Num       8      8
 
 
                     ~john-c/5421/inperm.sas 24JAN05 13:51


           SAS program to illustrate reading a permanent SAS datafile           2
                                                  13:51 Monday, January 24, 2005

                        OBS     X     Y     N      NXY

                         1     11    24    101    26664
                         2      5    18     88     7920
                         3     33    12    140    55440
                         4     25    25    100    62500
 
 
                     ~john-c/5421/inperm.sas 24JAN05 13:51
----------------------------------------------------------------------------------
Notes:

1.  Note that again a library name is specified:

    libname sasperm '' ;

    This is simply a "local" name for the library in which the
    file will be stored.

2.  Note that the actual name of the permanent file is indicated
    in the line

    set sasperm.output ;

    Here the prefix is the "local" library name.  The file is actually
    stored on the computer's hard disk as:

         output.ssd01

  SAS System files, like that just shown, include not only the data,
but also variable names and labels and formatting information.  If
the data that was stored was a date, this fact is included in the
stored information also.  Running PROC CONTENTS, as in the example
just given, gives you a printout of this kind of information for
each variable, as well as how many observations are on the file, etc.

  Note that 'proc contents' is used in the above program.  This is a
helpful utility procedure that tells you the nature of the data on
on the input file.


There are several reasons for using SAS system files:

  1.  You do not have to recreate all the variables, their labels,
      etc., as you did in the original program.

  2.  It saves computer time.

  3.  SAS system files are often portable between different kinds of
      computers and operating systems.

  4.  The data are stored fairly efficiently; that is, system files
      generally do not take up as much space on disk as text files.


There are some disadvantages also:

  1.  The actual definitions of the variables are not included.  Thus
      if there was a mistake made in constructing the variables that
      are on the system file, you may not be able to discover it and
      you may not be able to fix it.

  2.  If the files from which the SAS system file was made are changed,
      then you will probably need to run the program over again which
      created the system file.  It is a good idea to have the date of
      creation of the system file as a variable on the file itself.

============================================================================= ;

How to identify specific lines in a file ...

  Within a given data step, SAS creates an 'automatic variable' called _n_,
which represent the observation number.  It is not, however, an actual variable
on the data file, and it cannot be used directly in procedures.  However it
can be used to create an actual variable on the file, which then CAN be used
later on.  Here is an example:
----------------------------------------------------------------------------------

options linesize = 80 ;
footnote "~john-c/5421/n.sas &sysdate &systime" ;

data ten ;
  input x @@ ;
  obsnum = _n_ ;

cards ;
1 4 9 16 25 36 49 64 81 100
;
run ;

proc print n = 'Number printed: ' ;
     where obsnum ge 3 and obsnum le 7 ;
     sum obsnum x ;
     var obsnum x ;
title1 'Printout of a selected subset of observations from a dataset ...' ;
title2 'obsnum = observation number - where obsnum ge 3 and obsnum le 7' ;
run ;

----------------------------------------------------------------------------------
Printout from this program:

----------------------------------------------------------------------------------

        Printout of a selected subset of observations from a dataset ...       1
        obsnum = observation number - where obsnum ge 3 and obsnum le 7
                                                  14:57 Sunday, January 23, 2005

                              Obs    obsnum      x

                                3       3        9
                                4       4       16
                                5       5       25
                                6       6       36
                                7       7       49
                                     ======    ===
                                       25      135

                               Number printed: 5
 
 
                        ~john-c/5421/n.sas 23JAN05 14:57
----------------------------------------------------------------------------------

  Note that there are two other features in the program n.sas: 

  1.  The 'proc print' line includes an "n = " option:

         proc print n = 'Number printed: ' ;

      This causes the line below the main table in the printout:

                               Number printed: 5

  2.  The proc print procedure includes the line:

        sum obsnum x ;

      which causes the printout to include the sum of the two specified variables.

==================================================================================

How to identify the LAST observation on a given dataset:

The following is an example which shows how to identify the last
observation on a file:

----------------------------------------------------------------------------------

options linesize = 80 ;
footnote "~john-c/5421/firstlast.sas &sysdate &systime" ;

data somedata ;
     infile 'somedata.extern' end = lastobs ;
     input id age weight ;
     obsnum = _n_ ;

if lastobs eq 1 then do ;

   file 'somedata.out' ;
   put 'Last observation on file somedata: ' obsnum
        '  age, weight = ' age weight ;

end ;

run ;

proc print data = somedata (firstobs = 3 obs = 4) n = 'Number printed: ' ;
     sum age weight ;
     var id age weight ;
title1 'Use of firstobs, obs in proc print ...' ;
run ;

----------------------------------------------------------------------------------

Printout from the program 'firstlast.sas':

1.  The file 'somedata.out' :

----------------------------------------------------------------------------------
Last observation on file somedata: 10   age, weight = 2 23
----------------------------------------------------------------------------------
2.  The listing file, 'firstlast.lst' :
----------------------------------------------------------------------------------

                     Use of firstobs, obs in proc print ...                    1
                                                  15:44 Sunday, January 23, 2005

                           Obs    id    age    weight

                             3     3     16      323 
                             4     4     51      108 
                                        ===    ======
                                         67      431 

                               Number printed: 2
 
 
                    ~john-c/5421/firstlast.sas 23JAN05 15:44
----------------------------------------------------------------------------------

Note that the data in this example are on an external file, 'somedata.extern'.

Note that the last observation is detected in the 'infile' statement:

     infile 'somedata.extern' end = lastobs ;

This creates a variable called 'lastobs'.  When the end of the file is
encountered, lastobs = 1.  For observations previous to the last one,
lastobs = 0.

Note that section which is executed conditional on lastobs = 1:

   if lastobs eq 1 then do ;

      file 'somedata.out' ;
      put 'Last observation on file somedata: ' obsnum
           '  age, weight = ' age weight ;

    end ;

This writes out the file 'somedata.out', which describes the number of
the last observation on the file, and the values of age and weight for
that observation.

Finally, the 'proc print' line:

   proc print data = somedata (firstobs = 3 obs = 4) n = 'Number printed: ' ;

This specifies that the first observation to be printed is the 3rd
observation on the file, and that the last observation to be printed is
the 4th observation on the file.

============================================================================= ;

PROBLEMS

Problem 1.  The first data file in Chapter 8 of Der and Everitt's text has 
            observations for both males (M) and females (F).  Create two 
            SAS data sets, one of which has only the observations for males 
            and the other of which has only the observations for females.  Use
            proc print to print each of these datasets separately.

Problem 2.  Create an Excel file where the first column is the first
            name of a country in Africa, and the second column is the
            the name of that country's capital, and the third column
            is the population of the country.  Include at least 10
            African countries, ordered alphabetically within the Excel
            file.  Read the Excel file using a SAS program.
            Sort the list by the size of the population.  Print the
            resulting dataset.

Problem 3.  A data file has the following structure:

     Obs      ID       Age      Weight
    -----   ------    -----    --------
      1     A00155      23       104
      2     A00155      28       120
      3     A02042      31       168
      4     A02042      36       167
      5     A02503      44       185
      6     A02503      49       204

            etc.

            That is, there are two observations for each ID, the first showing the
            weight of the person at a certain age and the second showing that
            person's weight 5 years later.

            Write a SAS program which prints out the change in the weight over
            5 years for each person, and the mean and standard deviation of 
            that change.

n54703.001  Last update: January 24, 2005.