MORE ON COMPUTATION OF SUMMARY STATISTICS SPH 5421 notes.009 It is not terribly difficult to write SAS code which will compute least squares estimates of coefficients for simple (univariate) linear regression. There are two ways to approach writing a simple linear regression program in SAS. One is appropriate when you have a relatively small number of observations per participant, probably repeated measures of an outcome variable as a function of time. The other is appropriate when you have a large number of observations for different individuals on two variables that may be linked by a linear relationship. The first situation can be exemplified as follows. Suppose you have up to six observations on each of 500 people. The outcome variable of interest is the person's body weight. You want to know whether it is increasing or declining as a function of time. Each line of your input file has one observation per person, with body weight measured at 0 (baseline), 1, 2, 3, 4, and 5 years. You want to regress body weight on time (in years). The data for a given person may look like the following: Time: 0 1 2 3 4 5 Weight: 140 138 145 148 . 151 If you graph this person's body weight as a function of time, it will look like a slight trend toward increasing weight. A straight line appears to fit the data reasonably well. The intercept is at about 139. There is a missing measurement at year 4. The underlying linear relationship could be weight = b0 + b1 * time + e, where b0 and b1 are unknown coefficients and e is measurement error (presumed normally distributed with mean 0 and standard deviation sigma). Below is SAS code which computes estimates of b0 and b1 for this one person. Again line numbers have been added to facilitate the discussion: ================================================================================ 01 array time(6) time1 time2 time3 time4 time5 time6 ; 02 array weight(6) weight1 weight2 weight3 weight4 weight5 weight6 ; 03 04 . 05 . 06 . 07 08 09 tsum = 0 ; 10 wsum = 0 ; 11 twsum = 0 ; 12 count = 0 ; 13 t2sum = 0 ; 14 w2sum = 0 ; 15 16 do i = 1 to 6 ; 17 18 if time(i) ne . and weight(i) ne . then do ; 19 count = count + 1 ; 20 tsum = tsum + time(i) ; 21 wsum = wsum + weight(i) ; 22 twsum = twsum + time(i) * weight(i) ; 23 t2sum = t2sum + time(i) * time(i) ; 24 w2sum = w2sum + weight(i) * weight(i) ; 25 end ; 26 27 end ; 28 29 30 if count ge 2 then do ; 31 32 b1hat = (twsum - tsum * wsum / count) / (t2sum - tsum * tsum / count) ; 33 taver = tsum / count ; 34 waver = wsum / count ; 35 b0hat = waver - b1hat * taver ; 36 37 end ; 38 ================================================================================ This program-fragment is a bit messier than you might have guessed because of the necessity of having to deal with missing values. If either time is missing or weight is missing, that datapoint cannot be used. The basic thing this program does is count numbers of nonmissing values (in line 18), and then compute sums: sums of times (line 20), of weights (lines 21), of time * weight (line 22), of time squared (line 23), and of weight squared (line 24). These are the summary statistics needed for linear regression. Actually the sum of squared weights is not needed here, but it is useful for computing variances later. The last section of code, lines 30-37, uses the summary statistics to compute the slope estimate b1hat and the intercept estimate b0hat. ANOTHER WAY TO HANDLE THE MISSING-VALUE PROBLEM There is another way to deal with missing values for this problem. SAS has several functions which deal with collections of values: N, SUM, MEAN, STD. For example, if x = sum(1, 5, ., 6, 2, .) ; then x will have the value which is the sum of the nonmissing elements in parenthesis: x = 1 + 5 + 6 + 2 = 14. These functions make it possible to shorten the program above without losing clarity: ================================================================================== 01 array time(6) time1 time2 time3 time4 time5 time6 ; 02 array weight(6) weight1 weight2 weight3 weight4 weight5 weight6 ; 03 04 . 05 . 06 . 07 08 do i = 1 to 6 ; 09 if time(i) eq . or weight(i) eq . then do ; 10 time(i) = . ; weight(i) = . ; 11 end ; 12 end ; 13 14 count = n(time1, time2, time3, time4, time5, time6) ; 15 16 tsum = sum(time1, time2, time3, time4, time5, time6) ; 17 18 wsum = sum(weight1, weight2, weight3, weight4, weight5, weight6) ; 19 20 twsum = sum(time1 * weight1, time2 * weight2, time3 * weight3, 21 time4 * weight4, time5 * weight5, time6 * weight6) ; 22 23 t2sum = sum(time1**2, time2**2, time3**2, time4**2, time5**2, time6**2) ; 24 25 w2sum = sum(weight1**2, weight2**2, weight3**2, 26 weight4**2, weight5**2, weight6**2) ; 27 28 if count ge 2 then do ; 29 30 b1hat = (twsum - tsum * wsum / count) / (t2sum - tsum * tsum / count) ; 31 taver = tsum / count ; 32 waver = wsum / count ; 33 b0hat = waver - b1hat * taver ; 34 35 end ; 36 ================================================================================== The 'do' loop (lines 8-12 above) is still needed to ensure that if EITHER time or weight are missing at a given point, then the other is set to missing also and all the subsequent calculations are based only on indices for which both time and weight are not missing. The important thing to see here, however the calculations are done, is that a separate slope and intercept are computed for each person. This sort of calculation is useful in doing longitudinal analysis. One objective in longitudinal analysis is to predict rates of change of a given variable (like weight) as a function of time and of other covariates. The program code above computes a rate of change (slope b1) of weight for each individual on the file. You can use this to answer a question like: is the rate of weight gain different among people who quit smoking and people who continue to smoke? For such a question, after you have computed the slopes, you can apply a t-test to compare the mean slope for smokers versus quitters. A DIFFERENT APPROACH The datafile in the example above was organized so that each person's data was on one record: the person has, for example, six different weights recorded at six different time points, and all of this data occurs on one record in the datafile. That is often the structure for datafiles. However, SAS regression procedures like PROC REG or PROC GLM expect that the data are organized so that each datapoint is on a different record. The records are linked together by some kind of ID variable. That is, all the data for a given person has the same ID. The differences in organization are as follows: ================================================================================== FILE STRUCTURE 1 FILE STRUCTURE 2 ------------------------------------------- ---------------- Casenum wgt1 wgt2 wgt3 wgt4 wgt5 Casenum visit wgt ------- ---- ---- ---- ---- ---- ------- ----- --- 1 114 118 114 111 123 1 1 114 2 78 84 87 . 76 1 2 118 3 106 . . 115 121 1 3 114 . . . . . 1 4 111 . . . . . 1 5 123 . . . . . 2 1 78 2 2 84 2 3 87 2 4 . 2 5 76 3 1 106 3 2 . 3 3 . 3 4 115 3 5 121 . . . ================================================================================== Both of these files contain the same data. File Structure 2 has 5 times as many observations as File Structure 1. Note that CASENUM is the ID which links together the observations in File Structure 2. If you have the data in the form of File Structure 1 and you want to use a regression procedure like PROC REG, you will have to write out a new file in the format of File Structure 2. Here is how you do this: ================================================================================== data filstruc1 ; input casenum weight1 weight2 weight3 weight4 weight5 weight6 ; run ; data filstruc2 ; set filstruc1 ; keep casenum visit weight ; weight = weight1 ; time = 1 ; output ; weight = weight2 ; time = 2 ; output ; weight = weight3 ; time = 3 ; output ; weight = weight4 ; time = 4 ; output ; weight = weight5 ; time = 5 ; output ; weight = weight6 ; time = 6 ; output ; run ; proc sort ; by casenum ; proc reg ; by casenum ; model weight = time ; endsas ; ================================================================================== The 'keep' statement causes filstruc2 to include only the three variables, casenum, weight, and time. If the keep statement were omitted, then weight1, weight2, ... , weight6 would appear as well as the variable weight. As shown, it is possible to compute slopes for individual participants using File Structure 2 also. Note that before carrying out the regression, the data are sorted by casenum. Then PROC REG is invoked, with a 'by casenum' statement. This will compute (and print) separate regressions for each person. If you have a file with data for several hundred people, this method can generate a lot of output (but printing can be suppressed, as discussed later). As mentioned above, you might want to use the individual's slope (time rate-of-change of weight) or intercept as an outcome variable in another regression. In order to do this you would need to output the parameter estimates to a file. This can be done by modifying the 'proc reg' statement as follows: proc reg outest = regest ; by casenum ; model weight = time ; Below is a complete program which shows how this works, followed by part of the printout from the program. Note that the program also graphs line-plots for each person. The plot can be viewed on an x-terminal by invoking ghostview: ghostview gsas.grf and can be printed on a postscript printer by using lpr: lpr gsas.grf The file containing the program itself is: /home/walleye/john-c/5421/regsim.sas ================================================================================ FILENAME GRAPH 'gsas.grf' ; GOPTIONS RESET = GLOBAL ROTATE = PORTRAIT FTEXT = SWISSB DEVICE = PSCOLOR GACCESS = SASGASTD GSFNAME = GRAPH GSFMODE = REPLACE GUNIT = PCT BORDER CBACK = WHITE HTITLE = 2 HTEXT = 1 ; *===================================================================== ; footnote "program: john-c/5421/regsim &sysdate &systime" ; options linesize = 80 ; data byperson ; array weight(6) weight1 weight2 weight3 weight4 weight5 weight6 ; array time(6) time1 time2 time3 time4 time5 time6 ; seed = 20000127 ; sigma0 = 20 ; sigma1 = 3 ; sigmae = 6 ; nperson = 10 ; ntimes = 6 ; probmiss = .1 ; b0 = 110 ; b1 = 2 ; do casenum = 1 to nperson ; b0c = b0 + sigma0 * rannor(seed) ; b1c = b1 + sigma1 * rannor(seed) ; do j = 1 to ntimes ; rmiss = ranuni(seed) ; time(j) = j ; weight(j) = . ; if rmiss gt probmiss then do ; weight(j) = b0c + b1c * time(j) + sigmae * rannor(seed) ; end ; end ; output ; end ; run ; proc print data = byperson ; var seed casenum weight1 weight2 weight3 weight4 weight5 weight6 ; title1 'Randomly generated weight-gain data: by person.' ; proc sort ; by casenum ; data splitup ; keep casenum time weight ; set byperson ; time = time1 ; weight = weight1 ; output ; time = time2 ; weight = weight2 ; output ; time = time3 ; weight = weight3 ; output ; time = time4 ; weight = weight4 ; output ; time = time5 ; weight = weight5 ; output ; time = time6 ; weight = weight6 ; output ; run ; proc reg outest = regest ; by casenum ; model weight = time ; title1 'PROC REG, by person ...' ; proc print data = regest ; proc means n mean stddev min max data = regest ; var intercep time ; symbol1 i = j v = NONE w = 2 h = 2 f = swissb l = 1 c = black ; symbol2 i = j v = NONE w = 2 h = 2 f = swissb l = 2 c = black ; symbol3 i = j v = NONE w = 2 h = 2 f = swissb l = 3 c = black ; symbol4 i = j v = NONE w = 2 h = 2 f = swissb l = 4 c = black ; symbol5 i = j v = NONE w = 2 h = 2 f = swissb l = 5 c = black ; symbol6 i = j v = NONE w = 2 h = 2 f = swissb l = 6 c = black ; symbol7 i = j v = NONE w = 2 h = 2 f = swissb l = 7 c = black ; symbol8 i = j v = NONE w = 2 h = 2 f = swissb l = 8 c = black ; symbol9 i = j v = NONE w = 2 h = 2 f = swissb l = 9 c = black ; symbol10 i = j v = NONE w = 2 h = 2 f = swissb l = 10 c = black ; symbol11 i = j v = NONE w = 2 h = 2 f = swissb l = 11 c = black ; symbol12 i = j v = NONE w = 2 h = 2 f = swissb l = 12 c = black ; proc gplot data = splitup ; plot weight * time = casenum / haxis = axis1 ; axis1 order = 0 to 7 by 1 ; title1 h = 2 f = swissb 'Plot of Individual Weight Gains vs Time' ; title2 h = 2 f = swissb 'Simulated Data, 10 People' ; run ; ================================================================================ Randomly generated weight-gain data: by person. 1 19:17 Thursday, January 27, 2000 OBS SEED CASENUM WEIGHT1 WEIGHT2 WEIGHT3 WEIGHT4 WEIGHT5 WEIGHT6 1 20000127 1 91.545 102.438 107.308 . 101.081 123.224 2 20000127 2 112.864 109.058 111.153 107.183 109.205 102.882 3 20000127 3 124.563 131.927 128.347 142.570 132.426 146.963 4 20000127 4 146.708 168.673 160.849 162.241 165.137 166.079 5 20000127 5 120.168 121.934 135.480 120.177 143.883 . 6 20000127 6 105.461 . 110.599 . 119.471 . 7 20000127 7 121.027 137.399 127.861 128.941 . 138.852 8 20000127 8 98.032 95.300 99.945 103.112 102.912 96.992 9 20000127 9 129.300 121.120 123.028 118.114 119.684 113.134 10 20000127 10 107.472 . 105.533 116.697 102.381 105.584 program: john-c/5421/regsim 27JAN00 19:17 --------------------------------------------------------------------------------- PROC REG, by person ... 2 19:17 Thursday, January 27, 2000 ---------------------------------- CASENUM=1 ----------------------------------- Model: MODEL1 Dependent Variable: WEIGHT Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 1 336.38871 336.38871 4.949 0.1126 Error 3 203.92068 67.97356 C Total 4 540.30939 Root MSE 8.24461 R-square 0.6226 Dep Mean 105.11928 Adj R-sq 0.4968 C.V. 7.84310 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 90.083176 7.69930304 11.700 0.0013 TIME 1 4.422384 1.98795150 2.225 0.1126 program: john-c/5421/regsim 27JAN00 19:17 --------------------------------------------------------------------------------- Printout of the output data set regest: OBS CASENUM _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ INTERCEP TIME WEIGHT 1 1 MODEL1 PARMS WEIGHT 8.24461 90.083 4.42238 -1 2 2 MODEL1 PARMS WEIGHT 2.18489 114.068 -1.52682 -1 3 3 MODEL1 PARMS WEIGHT 5.80824 121.694 3.64914 -1 4 4 MODEL1 PARMS WEIGHT 6.99318 152.851 2.50392 -1 5 5 MODEL1 PARMS WEIGHT 9.28116 114.627 4.56720 -1 6 6 MODEL1 PARMS WEIGHT 1.52440 101.337 3.50238 -1 7 7 MODEL1 PARMS WEIGHT 6.63631 123.199 2.38020 -1 8 8 MODEL1 PARMS WEIGHT 3.34326 97.302 0.59439 -1 9 9 MODEL1 PARMS WEIGHT 2.67682 129.736 -2.57299 -1 10 10 MODEL1 PARMS WEIGHT 6.19498 109.296 -0.46387 -1 program: john-c/5421/regsim 27JAN00 19:17 --------------------------------------------------------------------------------- Means of the Intercepts and Slopes 19:17 Thursday, January 27, 2000 Variable Label N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------- INTERCEP Intercept 10 115.4192153 17.9950467 90.0831762 152.8507688 TIME 10 1.7055947 2.5465471 -2.5729893 4.5671991 ------------------------------------------------------------------------------- program: john-c/5421/regsim 27JAN00 19:17 ================================================================================ (Topic continued in notes.010) /home/gnome/john-c/5421/notes.009 Last update: February 13, 2001