THE SAS DATA STEP / INPUT n54703.001 The Data step in SAS is probably its most complex feature. It is not obvious what it is doing. In most cases, the basic structure is the following: 1. Input data with variables x, y, z, etc.. 2. Transform some of the variables. 3. Output the data. Here is how this works in a typical program: ------------------------------------------------------------------------ 1 data xyz ; /* Specify name of dataset */ 2 3 infile 'C:\datafiles\xyzdata' ; /* Specify name of input file */ 4 input x y z ; /* Input variables x, y, z */ 5 6 s = (x + y)**2 ; /* Create a new variable */ 7 8 run ; /* End of data step */ 9 ------------------------------------------------------------------------ Line 1: The dataset name is chosen by the user. It specifies only a 'local' temporary file. Line 3: The data in this case are in a permanent file on the C: disk. Line 4: The input statement reads each line in the data file until it comes to the end. Line 6: A data transformation which creates a new variable is used. The new variable 's' will be on the dataset xyz along with x, y, and z. Line 8: The 'run' statement signifies the end of the data step. As each line of the input file is read, an output line is created on the temporary SAS dataset 'xyz' ; ------------------------------------------------------------------------ There are enormous variations possible on this basic outline. If, for example, the following line is inserted after line 4, if x < 3 then output ; then the effect is to write out only observations which have x-values less than 3, and only the variables x, y, and z are written to the dataset. The variable 's' is computed only after the 'output' statement occurs. Thus we would have the following: ------------------------------------------------------------------------ Input file xyzdata Output dataset xyz ------------------ ------------------- Obs x y z Obs x y z --- --- --- --- --- --- --- --- 1 1 1.4 14 1 1 1.4 14 2 4 3.3 81 2 2 4.4 44 3 2 4.4 44 3 1 1.2 65 4 1 1.2 65 4 0 7.5 11 5 5 6.0 28 5 1 3.3 29 6 0 7.5 11 7 3 7.5 8 8 1 3.3 29 ------------------------------------------------------------------------ If the statement if x < 3 then output ; were moved to just after the line in which 's' is created, the following would be the result: ------------------------------------------------------------------------ Input file xyzdata Output dataset xyz ------------------ ------------------------- Obs x y z Obs x y z s --- --- --- --- --- --- --- --- ----- 1 1 1.4 14 1 1 1.4 14 5.76 2 4 3.3 81 2 2 4.4 44 40.96 3 2 4.4 44 3 1 1.2 65 4.84 4 1 1.2 65 4 0 7.5 11 56.25 5 5 6.0 28 5 1 3.3 29 18.49 6 0 7.5 11 7 3 7.5 8 8 1 3.3 29 ------------------------------------------------------------------------ The 'input' statement is somewhat of a black box. It looks like a single instruction, but in fact it, and all the data transformations following it, are repeated for every line of the input datafile. If the data step does not have an explicit 'output' statement, then a line is created on the output dataset for each line that is read in the input data file. The input data file can be embedded in the data step itself in the form of data following either a 'cards' statement or a 'datalines' statement. This is what is done in most of the program/datafiles on the Computer Programs and Datafiles web page for this course. Here is another simple example. The lines are numbered for reference later. ---------------------------------------------------------------------------------- 1 data xyz ; 2 retain xcount 0 ; 3 input x y z ; 4 5 if x ne . the xcount = xcount + 1 ; 6 minxyz = min(x, y, z) ; 7 maxxyz = max(x, y, z) ; 8 datalines ; 9 14 18 2 10 . 7 7 11 s 3 1 12 . 8 4 13 ; 14 run ; 15 16 proc print data = xyz ; 17 var x y z minxyz maxxyz ; 18 title1 'Printout of the dataset xyz' ; 19 run ; 20 ---------------------------------------------------------------------------------- Line 2: The 'retain' statement causes the value of the specified variable to be retained as the program goes through the lines in the datafile. In this case, the variable 'xcount' is initialized to zero. It is intended to be a count of the number of observations in the datafile which have a nonmissing value for x. Note that in SAS, missing values for numeric variables are denoted by '.' Line 5: Here the value of 'xcount' is increased by 1 for every occurrence in which x is not equal ('ne') to missing. For the last line of the file, in this case xcount will equal ???. Line 16: proc print is one of the most useful procedures in SAS. It is often essential to use proc print to figure out what a given datafile contains. If your datafile has 10,000 lines, when you use proc print, it will produce 10,000 lines of printout. Usually you are not going to want to read 10,000 lines of printout, much less print it (it would run to about 200 pages). You can often restrict how many lines are printed using a 'where' statement, as in the following example: ---------------------------------------------------------------------------------- 1 data xyz ; 2 infile 'mystery.file' ; 3 retain case 0 ; 4 input x y z ; 5 6 case = case + 1 ; 7 minxyz = min(x, y, z) ; 8 maxxyz = max(x, y, z) ; 9 10 run ; 11 12 proc print data = xyz ; 13 where case le 10 ; 17 var case x y z minxyz maxxyz ; 18 title1 'Printout of the dataset xyz - first 10 observations only.' ; 19 run ; 20 21 data xyzw ; 22 set xyz ; 23 24 w = x + y + z ; 25 26 run ; 27 28 ---------------------------------------------------------------------------------- Line 3: Put a 'case' counter in a retain statement. Line 6: Increment the variable 'case' for each line of the input data file. Line 13: Use the 'where' statement to print only the first 10 lines of the dataset. ---------------------------------------------------------------------------------- Data sets may be modified and augmented in many ways. In the example just given, line 21 defines a new dataset called 'xyzw'. The 'set xyz' statement tells SAS to read in each line of the dataset xyz. A new variable, w, is added to each line of the new dataset, and it contains all the old variables also. Both datasets continue to exist during the running of this program. If Line 21 of the preceding example had been replaced by: 21 data xyz ; then the effect would have been to write over the old dataset, xyz, with the new data. Of course here the old data and the new data are almost exactly the same, except that the new data have the additional variable w. CONCATENATING DATASETS In the following program, two datasets are concatenated together using the 'set' command. Then the two datasets are merge by 'name': ---------------------------------------------------------------------------------- options linesize = 80 pagesize = 30 ; data men ; length name $5 ; input name height weight ; cards ; Black 74 208 Blau 65 155 Brown 71 199 Green 63 104 ; run ; data women ; length name $5 ; input name shoesize dressiz ; cards ; Black 9 7 Blau 6 8 Brown 7 5 Green 7 18 ; run ; data both ; set men women ; run ; data merged ; merge men women ; by name ; run ; proc print data = both ; title1 'Two datasets concatenated together' ; run ; proc print data = merged ; title1 "Two datasets merged by the variable 'name'" ; run ; endsas ; ---------------------------------------------------------------------------------- Two datasets concatenated together 1 18:01 Tuesday, January 20, 2004 OBS NAME HEIGHT WEIGHT SHOESIZE DRESSIZ 1 Black 74 208 . . 2 Blau 65 155 . . 3 Brown 71 199 . . 4 Green 63 104 . . 5 Black . . 9 7 6 Blau . . 6 8 7 Brown . . 7 5 8 Green . . 7 18 Two datasets merged by the variable 'name' 2 OBS NAME HEIGHT WEIGHT SHOESIZE DRESSIZ 1 Black 74 208 9 7 2 Blau 65 155 6 8 3 Brown 71 199 7 5 4 Green 63 104 7 18 ---------------------------------------------------------------------------------- Note that the two datasets do not have the same variables. The concatenated dataset has missing values for the variables that are in one file but not the other. Note that on the MERGED dataset, all five of the variables occur. Note that the 'merge' command requires that both datasets are sorted by the 'merge' variable. If they are not, the merge will not work and an error message will be printed on the log file. ---------------------------------------------------------------------------------- ALTERNATING OBSERVATIONS There are situations where you have two datasets with the same number of observations, and you want to create a new dataset which is comprised of alternating observations, rather than merged observations, from each. This can be accomplished by using the 'set' and 'output' commands in the right order, as in the following example: ======================================================================== options linesize = 80 pagesize = 30 ; footnote "~john-c/5421/altern.sas &sysdate &systime" ; data one ; input x y ; cards; 1 2 3 4 5 6 7 8 9 10 ; run ; data two ; input x y ; cards; 1 4 9 16 25 36 49 64 81 100 ; run ; data onetwo ; set one ; output ; set two ; output ; run ; proc print data = one ; title1 'Proc print of dataset one' ; run ; proc print data = two ; title1 'Proc print of dataset two' ; run ; proc print data = onetwo ; title1 'Proc print of dataset onetwo - ' ; title2 'Note alternating observations from datasets one and two.' ; run ; ------------------------------------------------------------------------ Proc print of dataset one 1 12:00 Sunday, January 25, 2004 OBS X Y 1 1 2 2 3 4 3 5 6 4 7 8 5 9 10 ~john-c/5421/altern.sas 25JAN04 12:00 Proc print of dataset two 2 12:00 Sunday, January 25, 2004 OBS X Y 1 1 4 2 9 16 3 25 36 4 49 64 5 81 100 ~john-c/5421/altern.sas 25JAN04 12:00 Proc print of dataset onetwo - 3 Note alternating observations from datasets one and two. 12:00 Sunday, January 25, 2004 OBS X Y 1 1 2 2 1 4 3 3 4 4 9 16 5 5 6 6 25 36 7 7 8 8 49 64 9 9 10 10 81 100 ~john-c/5421/altern.sas 25JAN04 12:00 ======================================================================== READING EXCEL FILES IN SAS: SAS can read datafiles in a number of 'external' formats from database management systems and spreadsheets. In particular, SAS can read Excel files. This is well-explained at the following UCLA website: http://www.ats.ucla.edu/stat/sas/faq/readxls.htm It is important to note that SAS cannot read Excel spreadsheets from the more recent versions of Excel. You will need to save your Excel file in Excel 5.0 or earlier, using the "Save as" option from the File menu in Excel. Below is an example of a SAS program which reads an Excel spreadsheet: ======================================================================== footnote "~john-c/5421/excelread.sas &sysdate &systime" ; options linesize = 80 ; ------------------------------------------------------------------------ HOW TO WRITE OUT SAS SYSTEM FILES and HOW TO READ SAS SYSTEM FILES The following program writes out a SAS dataset as a permanent file: ------------------------------------------------------------------------ proc import datafile = "auto.xls" out = auto replace ; run ; proc print data = auto ; run ; ------------------------------------------------------------------------ libname sasperm '' ; footnote "~john-c/5421/sasperm.sas &sysdate &systime" ; options linesize = 80 ; data indata ; input x y n ; nxy = n * x * y ; cards ; 11 24 101 5 18 88 33 12 140 25 25 100 ; run ; data sasperm.output ; set indata ; run ; proc print data = sasperm.output ; title "SAS program to illustrate making a SAS dataset into a permanent file' ; run ; ------------------------------------------------------------------------ Notes: 1. Note that a library name is specified: libname sasperm '' ; This is simply a "local" name for the library in which the file will be stored. 2. Note that the actual name of the permanent file is indicated in the line data sasperm.output ; Here the prefix is the "local" library name. The file is actually stored on the computer's hard disk as: output.ssd01 The format for storage is determined by SAS. You cannot examine the contents of this file with an ordinary text editor. It is readable by SAS as will be shown below. ------------------------------------------------------------------------ The following program reads in a SAS dataset stored as a permanent file: ------------------------------------------------------------------------ libname sasperm '' ; footnote "~john-c/5421/inperm.sas &sysdate &systime" ; options linesize = 80 ; data sasperm.in ; set sasperm.output ; run ; proc contents data = sasperm.in ; title1 'Example of the use of proc contents ...' ; run ; proc print data = sasperm.in ; title 'SAS program to illustrate reading a permanent SAS datafile ' ; run ; ------------------------------------------------------------------------ Example of the use of proc contents ... 1 13:51 Monday, January 24, 2005 CONTENTS PROCEDURE Data Set Name: SASPERM.IN Observations: 4 Member Type: DATA Variables: 4 Engine: V612 Indexes: 0 Created: 13:51 Monday, January 24, 2005 Observation Length: 32 Last Modified: 13:51 Monday, January 24, 2005 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 8192 Number of Data Set Pages: 1 File Format: 607 First Data Page: 1 Max Obs per Page: 254 Obs in First Data Page: 4 File Name: /home/gnome/john-c/5421/in.ssd01 Inode Number: 571724 Access Permission: rw-r--r-- Owner Name: john-c File Size (bytes): 16384 -----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos ----------------------------------- 3 N Num 8 16 4 NXY Num 8 24 1 X Num 8 0 2 Y Num 8 8 ~john-c/5421/inperm.sas 24JAN05 13:51 SAS program to illustrate reading a permanent SAS datafile 2 13:51 Monday, January 24, 2005 OBS X Y N NXY 1 11 24 101 26664 2 5 18 88 7920 3 33 12 140 55440 4 25 25 100 62500 ~john-c/5421/inperm.sas 24JAN05 13:51 ---------------------------------------------------------------------------------- Notes: 1. Note that again a library name is specified: libname sasperm '' ; This is simply a "local" name for the library in which the file will be stored. 2. Note that the actual name of the permanent file is indicated in the line set sasperm.output ; Here the prefix is the "local" library name. The file is actually stored on the computer's hard disk as: output.ssd01 SAS System files, like that just shown, include not only the data, but also variable names and labels and formatting information. If the data that was stored was a date, this fact is included in the stored information also. Running PROC CONTENTS, as in the example just given, gives you a printout of this kind of information for each variable, as well as how many observations are on the file, etc. Note that 'proc contents' is used in the above program. This is a helpful utility procedure that tells you the nature of the data on on the input file. There are several reasons for using SAS system files: 1. You do not have to recreate all the variables, their labels, etc., as you did in the original program. 2. It saves computer time. 3. SAS system files are often portable between different kinds of computers and operating systems. 4. The data are stored fairly efficiently; that is, system files generally do not take up as much space on disk as text files. There are some disadvantages also: 1. The actual definitions of the variables are not included. Thus if there was a mistake made in constructing the variables that are on the system file, you may not be able to discover it and you may not be able to fix it. 2. If the files from which the SAS system file was made are changed, then you will probably need to run the program over again which created the system file. It is a good idea to have the date of creation of the system file as a variable on the file itself. ============================================================================= ; How to identify specific lines in a file ... Within a given data step, SAS creates an 'automatic variable' called _n_, which represent the observation number. It is not, however, an actual variable on the data file, and it cannot be used directly in procedures. However it can be used to create an actual variable on the file, which then CAN be used later on. Here is an example: ---------------------------------------------------------------------------------- options linesize = 80 ; footnote "~john-c/5421/n.sas &sysdate &systime" ; data ten ; input x @@ ; obsnum = _n_ ; cards ; 1 4 9 16 25 36 49 64 81 100 ; run ; proc print n = 'Number printed: ' ; where obsnum ge 3 and obsnum le 7 ; sum obsnum x ; var obsnum x ; title1 'Printout of a selected subset of observations from a dataset ...' ; title2 'obsnum = observation number - where obsnum ge 3 and obsnum le 7' ; run ; ---------------------------------------------------------------------------------- Printout from this program: ---------------------------------------------------------------------------------- Printout of a selected subset of observations from a dataset ... 1 obsnum = observation number - where obsnum ge 3 and obsnum le 7 14:57 Sunday, January 23, 2005 Obs obsnum x 3 3 9 4 4 16 5 5 25 6 6 36 7 7 49 ====== === 25 135 Number printed: 5 ~john-c/5421/n.sas 23JAN05 14:57 ---------------------------------------------------------------------------------- Note that there are two other features in the program n.sas: 1. The 'proc print' line includes an "n = " option: proc print n = 'Number printed: ' ; This causes the line below the main table in the printout: Number printed: 5 2. The proc print procedure includes the line: sum obsnum x ; which causes the printout to include the sum of the two specified variables. ================================================================================== How to identify the LAST observation on a given dataset: The following is an example which shows how to identify the last observation on a file: ---------------------------------------------------------------------------------- options linesize = 80 ; footnote "~john-c/5421/firstlast.sas &sysdate &systime" ; data somedata ; infile 'somedata.extern' end = lastobs ; input id age weight ; obsnum = _n_ ; if lastobs eq 1 then do ; file 'somedata.out' ; put 'Last observation on file somedata: ' obsnum ' age, weight = ' age weight ; end ; run ; proc print data = somedata (firstobs = 3 obs = 4) n = 'Number printed: ' ; sum age weight ; var id age weight ; title1 'Use of firstobs, obs in proc print ...' ; run ; ---------------------------------------------------------------------------------- Printout from the program 'firstlast.sas': 1. The file 'somedata.out' : ---------------------------------------------------------------------------------- Last observation on file somedata: 10 age, weight = 2 23 ---------------------------------------------------------------------------------- 2. The listing file, 'firstlast.lst' : ---------------------------------------------------------------------------------- Use of firstobs, obs in proc print ... 1 15:44 Sunday, January 23, 2005 Obs id age weight 3 3 16 323 4 4 51 108 === ====== 67 431 Number printed: 2 ~john-c/5421/firstlast.sas 23JAN05 15:44 ---------------------------------------------------------------------------------- Note that the data in this example are on an external file, 'somedata.extern'. Note that the last observation is detected in the 'infile' statement: infile 'somedata.extern' end = lastobs ; This creates a variable called 'lastobs'. When the end of the file is encountered, lastobs = 1. For observations previous to the last one, lastobs = 0. Note that section which is executed conditional on lastobs = 1: if lastobs eq 1 then do ; file 'somedata.out' ; put 'Last observation on file somedata: ' obsnum ' age, weight = ' age weight ; end ; This writes out the file 'somedata.out', which describes the number of the last observation on the file, and the values of age and weight for that observation. Finally, the 'proc print' line: proc print data = somedata (firstobs = 3 obs = 4) n = 'Number printed: ' ; This specifies that the first observation to be printed is the 3rd observation on the file, and that the last observation to be printed is the 4th observation on the file. ============================================================================= ; PROBLEMS Problem 1. The first data file in Chapter 8 of Der and Everitt's text has observations for both males (M) and females (F). Create two SAS data sets, one of which has only the observations for males and the other of which has only the observations for females. Use proc print to print each of these datasets separately. Problem 2. Create an Excel file where the first column is the first name of a country in Africa, and the second column is the the name of that country's capital, and the third column is the population of the country. Include at least 10 African countries, ordered alphabetically within the Excel file. Read the Excel file using a SAS program. Sort the list by the size of the population. Print the resulting dataset. Problem 3. A data file has the following structure: Obs ID Age Weight ----- ------ ----- -------- 1 A00155 23 104 2 A00155 28 120 3 A02042 31 168 4 A02042 36 167 5 A02503 44 185 6 A02503 49 204 etc. That is, there are two observations for each ID, the first showing the weight of the person at a certain age and the second showing that person's weight 5 years later. Write a SAS program which prints out the change in the weight over 5 years for each person, and the mean and standard deviation of that change. n54703.001 Last update: January 24, 2005.