MORE ON COMPUTATION OF SUMMARY STATISTICS           SPH 5421 notes.009
 
     It is not terribly difficult to write SAS code which will compute least squares
estimates of coefficients for simple (univariate) linear regression.

     There are two ways to approach writing a simple linear regression program in
SAS.  One is appropriate when you have a relatively small number of observations per
participant, probably repeated measures of an outcome variable as a function of
time.  The other is appropriate when you have a large number of observations for
different individuals on two variables that may be linked by a linear relationship.

     The first situation can be exemplified as follows.  Suppose you have up to six
observations on each of 500 people.  The outcome variable of interest is the
person's body weight.  You want to know whether it is increasing or declining as a
function of time.

     Each line of your input file has one observation per person, with body weight
measured at 0 (baseline), 1, 2, 3, 4, and 5 years.  You want to regress body weight
on time (in years).

     The data for a given person may look like the following:

     Time:      0     1     2     3     4     5
     Weight:   140   138   145   148    .    151

     If you graph this person's body weight as a function of time, it will look like
a slight trend toward increasing weight.  A straight line appears to fit the data
reasonably well.  The intercept is at about 139. There is a missing measurement at
year 4.

     The underlying linear relationship could be

     weight = b0 + b1 * time + e,

     where b0 and b1 are unknown coefficients and e is measurement error (presumed
normally distributed with mean 0 and standard deviation sigma).

     Below is SAS code which computes estimates of b0 and b1 for this one person.  
Again line numbers have been added to facilitate the discussion:

================================================================================

01      array time(6)   time1   time2   time3   time4   time5   time6 ;
02      array weight(6) weight1 weight2 weight3 weight4 weight5 weight6 ;
03
04          .
05          .
06          .
07
08
09      tsum = 0 ;
10      wsum = 0 ;
11      twsum = 0 ;
12      count = 0 ;
13      t2sum = 0 ;
14      w2sum = 0 ;
15
16      do i = 1 to 6 ;
17
18         if time(i) ne . and weight(i) ne . then do ;
19            count = count + 1 ;
20            tsum = tsum + time(i) ;
21            wsum = wsum + weight(i) ;
22            twsum = twsum + time(i) * weight(i) ;
23            t2sum = t2sum + time(i) * time(i) ;
24            w2sum = w2sum + weight(i) * weight(i) ;
25         end ;
26
27      end ;
28
29
30      if count ge 2 then do ;
31
32         b1hat = (twsum - tsum * wsum / count) / (t2sum - tsum * tsum / count) ;
33         taver = tsum / count ;
34         waver = wsum / count ;
35         b0hat = waver - b1hat * taver ;
36
37      end ;
38
================================================================================

     This program-fragment is a bit messier than you might have guessed because of
the necessity of having to deal with missing values.  If either time is missing or
weight is missing, that datapoint cannot be used.

     The basic thing this program does is count numbers of nonmissing values (in
line 18), and then compute sums: sums of times (line 20), of weights (lines
21), of time * weight (line 22), of time squared (line 23), and of weight
squared (line 24).  These are the summary statistics needed for linear
regression.  Actually the sum of squared weights is not needed here, but it is
useful for computing variances later.

     The last section of code, lines 30-37, uses the summary statistics to compute
the slope estimate b1hat and the intercept estimate b0hat.



ANOTHER WAY TO HANDLE THE MISSING-VALUE PROBLEM

     There is another way to deal with missing values for this problem.  SAS
has several functions which deal with collections of values: N, SUM, MEAN, STD.
For example, if

     x = sum(1, 5, ., 6, 2, .) ;

then x will have the value which is the sum of the nonmissing elements in
parenthesis: x = 1 + 5 + 6 + 2 = 14.

     These functions make it possible to shorten the program above without
losing clarity:

==================================================================================

01      array time(6)   time1   time2   time3   time4   time5   time6 ;
02      array weight(6) weight1 weight2 weight3 weight4 weight5 weight6 ;
03
04          .
05          .
06          .
07
08      do i = 1 to 6 ;
09         if time(i) eq . or weight(i) eq . then do ;
10         time(i) = . ; weight(i) = . ;
11         end ;
12      end ;
13
14      count = n(time1, time2, time3, time4, time5, time6) ;
15
16      tsum  = sum(time1, time2, time3, time4, time5, time6) ;
17
18      wsum  = sum(weight1, weight2, weight3, weight4, weight5, weight6) ;
19
20      twsum = sum(time1 * weight1, time2 * weight2, time3 * weight3,
21                  time4 * weight4, time5 * weight5, time6 * weight6) ;
22
23      t2sum = sum(time1**2, time2**2, time3**2, time4**2, time5**2, time6**2) ;
24
25      w2sum = sum(weight1**2, weight2**2, weight3**2,
26                  weight4**2, weight5**2, weight6**2) ;
27
28      if count ge 2 then do ;
29
30         b1hat = (twsum - tsum * wsum / count) / (t2sum - tsum * tsum / count) ;
31         taver = tsum / count ;
32         waver = wsum / count ;
33         b0hat = waver - b1hat * taver ;
34
35      end ;
36
==================================================================================

     The 'do' loop (lines 8-12 above) is still needed to ensure that if EITHER
time or weight are missing at a given point, then the other is set to missing
also and all the subsequent calculations are based only on indices for which both
time and weight are not missing.

     The important thing to see here, however the calculations are done,
is that a separate slope and intercept are computed for each person.

     This sort of calculation is useful in doing longitudinal analysis.  One
objective in longitudinal analysis is to predict rates of change of a given variable
(like weight) as a function of time and of other covariates.  The program code above
computes a rate of change (slope b1) of weight for each individual on the file.  
You can use this to answer a question like: is the rate of weight gain different
among people who quit smoking and people who continue to smoke?  For such a
question, after you have computed the slopes, you can apply a t-test to compare the
mean slope for smokers versus quitters.


A DIFFERENT APPROACH

     The datafile in the example above was organized so that each person's data was
on one record: the person has, for example, six different weights recorded at six
different time points, and all of this data occurs on one record in the datafile.  
That is often the structure for datafiles.  However, SAS regression procedures like
PROC REG or PROC GLM expect that the data are organized so that each datapoint is on
a different record.  The records are linked together by some kind of ID variable.  
That is, all the data for a given person has the same ID.  The differences in
organization are as follows:
       
==================================================================================

                  FILE STRUCTURE 1                         FILE STRUCTURE 2
       -------------------------------------------         ----------------		
       Casenum  wgt1   wgt2   wgt3   wgt4   wgt5           Casenum visit wgt
       -------  ----   ----   ----   ----   ----           ------- ----- ---
          1      114    118    114    111    123              1      1   114        
          2       78     84     87      .     76              1      2   118
          3      106      .      .    115    121              1      3   114
              .       .      .      .      .                  1      4   111
              .       .      .      .      .                  1      5   123
              .       .      .      .      .                  2      1    78
                                                              2      2    84
                                                              2      3    87
                                                              2      4     .
                                                              2      5    76
                                                              3      1   106
                                                              3      2     .
                                                              3      3     .
                                                              3      4   115
                                                              3      5   121

                                                                     .
                                                                     .
                                                                     .

==================================================================================

     Both of these files contain the same data.  File Structure 2 has 5 times as
many observations as File Structure 1.  Note that CASENUM is the ID which links
together the observations in File Structure 2.

     If you have the data in the form of File Structure 1 and you want to use a
regression procedure like PROC REG, you will have to write out a new file in the
format of File Structure 2.  Here is how you do this:

==================================================================================

     data filstruc1 ;
          input casenum weight1 weight2 weight3 weight4 weight5 weight6 ;

     run ;

     data filstruc2 ;
          set filstruc1 ;
          keep casenum visit weight ;

          weight = weight1 ; time = 1 ; output ;
          weight = weight2 ; time = 2 ; output ;
          weight = weight3 ; time = 3 ; output ;
          weight = weight4 ; time = 4 ; output ;
          weight = weight5 ; time = 5 ; output ;
          weight = weight6 ; time = 6 ; output ;

     run ;

     proc sort ; by casenum ;

     proc reg ;
          by casenum ;
          model weight = time ;

     endsas ;

==================================================================================

     The 'keep' statement causes filstruc2 to include only the three variables,
casenum, weight, and time.  If the keep statement were omitted, then weight1,
weight2, ... , weight6 would appear as well as the variable weight.

     As shown, it is possible to compute slopes for individual participants using
File Structure 2 also.  Note that before carrying out the regression, the data are
sorted by casenum.  Then PROC REG is invoked, with a 'by casenum' statement.  This
will compute (and print) separate regressions for each person.  If you have a file
with data for several hundred people, this method can generate a lot of output (but
printing can be suppressed, as discussed later).

     As mentioned above, you might want to use the individual's slope (time
rate-of-change of weight) or intercept as an outcome variable in another regression.  
In order to do this you would need to output the parameter estimates to a file.
This can be done by modifying the 'proc reg' statement as follows:

     proc reg outest = regest ;
          by casenum ;
          model weight = time ;

     Below is a complete program which shows how this works, followed by part of the
printout from the program.  Note that the program also graphs line-plots for each
person.  The plot can be viewed on an x-terminal by invoking ghostview:

     ghostview gsas.grf

and can be printed on a postscript printer by using lpr:

     lpr gsas.grf

     The file containing the program itself is:

         /home/walleye/john-c/5421/regsim.sas

================================================================================

FILENAME GRAPH 'gsas.grf' ;

GOPTIONS
         RESET = GLOBAL
         ROTATE = PORTRAIT
         FTEXT = SWISSB
         DEVICE = PSCOLOR
         GACCESS = SASGASTD
         GSFNAME = GRAPH
         GSFMODE = REPLACE
         GUNIT = PCT BORDER
         CBACK = WHITE
         HTITLE = 2 HTEXT = 1 ;

*===================================================================== ;        
footnote "program: john-c/5421/regsim &sysdate &systime" ;

options linesize = 80 ;

data byperson ;
     array weight(6) weight1 weight2 weight3 weight4 weight5 weight6 ;
     array time(6) time1 time2 time3 time4 time5 time6 ;

     seed = 20000127 ;

     sigma0 = 20 ;
     sigma1 = 3 ;
     sigmae = 6 ;

     nperson = 10 ;
     ntimes = 6 ;
     probmiss = .1 ;

     b0 = 110 ;
     b1 = 2 ;

     do casenum = 1 to nperson ;

        b0c = b0 + sigma0 * rannor(seed) ;
        b1c = b1 + sigma1 * rannor(seed) ;

        do j = 1 to ntimes ;

           rmiss = ranuni(seed) ;
           time(j) = j ;
           weight(j) = . ;

           if rmiss gt probmiss then do ;

              weight(j) = b0c + b1c * time(j) + sigmae * rannor(seed) ;

           end ;

         end ;

         output ;

      end ;

   run ;

   proc print data = byperson ;
        var seed casenum weight1 weight2 weight3 weight4 weight5 weight6 ;
   title1 'Randomly generated weight-gain data: by person.' ;

   proc sort ; by casenum ;

   data splitup ;
        keep casenum time weight ;
        set byperson ;

        time = time1 ; weight = weight1 ; output ;
        time = time2 ; weight = weight2 ; output ;
        time = time3 ; weight = weight3 ; output ;
        time = time4 ; weight = weight4 ; output ;
        time = time5 ; weight = weight5 ; output ;
        time = time6 ; weight = weight6 ; output ;

   run ;

   proc reg outest = regest ;
        by casenum ;
        model weight = time ;
   title1 'PROC REG, by person ...' ;

   proc print data = regest ;

   proc means n mean stddev min max data = regest ;
        var intercep time ;

   symbol1  i = j v = NONE w = 2 h = 2 f = swissb l =  1 c = black ;
   symbol2  i = j v = NONE w = 2 h = 2 f = swissb l =  2 c = black ;
   symbol3  i = j v = NONE w = 2 h = 2 f = swissb l =  3 c = black ;
   symbol4  i = j v = NONE w = 2 h = 2 f = swissb l =  4 c = black ;
   symbol5  i = j v = NONE w = 2 h = 2 f = swissb l =  5 c = black ;
   symbol6  i = j v = NONE w = 2 h = 2 f = swissb l =  6 c = black ;
   symbol7  i = j v = NONE w = 2 h = 2 f = swissb l =  7 c = black ;
   symbol8  i = j v = NONE w = 2 h = 2 f = swissb l =  8 c = black ;
   symbol9  i = j v = NONE w = 2 h = 2 f = swissb l =  9 c = black ;
   symbol10 i = j v = NONE w = 2 h = 2 f = swissb l = 10 c = black ;
   symbol11 i = j v = NONE w = 2 h = 2 f = swissb l = 11 c = black ;
   symbol12 i = j v = NONE w = 2 h = 2 f = swissb l = 12 c = black ;

   proc gplot data = splitup ;
        plot  weight * time = casenum / haxis = axis1 ;
        axis1 order = 0 to 7 by 1 ;
   title1 h = 2 f = swissb 'Plot of Individual Weight Gains vs Time' ;
   title2 h = 2 f = swissb 'Simulated Data, 10 People' ;
   run ;

================================================================================

                Randomly generated weight-gain data: by person.                1

                                                19:17 Thursday, January 27, 2000

  OBS    SEED    CASENUM  WEIGHT1  WEIGHT2  WEIGHT3  WEIGHT4  WEIGHT5  WEIGHT6

    1  20000127      1     91.545  102.438  107.308     .     101.081  123.224
    2  20000127      2    112.864  109.058  111.153  107.183  109.205  102.882
    3  20000127      3    124.563  131.927  128.347  142.570  132.426  146.963
    4  20000127      4    146.708  168.673  160.849  162.241  165.137  166.079
    5  20000127      5    120.168  121.934  135.480  120.177  143.883     .   
    6  20000127      6    105.461     .     110.599     .     119.471     .   
    7  20000127      7    121.027  137.399  127.861  128.941     .     138.852
    8  20000127      8     98.032   95.300   99.945  103.112  102.912   96.992
    9  20000127      9    129.300  121.120  123.028  118.114  119.684  113.134
   10  20000127     10    107.472     .     105.533  116.697  102.381  105.584
 
 
                   program: john-c/5421/regsim 27JAN00 19:17

---------------------------------------------------------------------------------
                             PROC REG, by person ...                            2
                                                19:17 Thursday, January 27, 2000

---------------------------------- CASENUM=1 -----------------------------------

Model: MODEL1  
Dependent Variable: WEIGHT                                             

                              Analysis of Variance

                                 Sum of         Mean
        Source          DF      Squares       Square      F Value       Prob>F

        Model            1    336.38871    336.38871        4.949       0.1126
        Error            3    203.92068     67.97356
        C Total          4    540.30939

            Root MSE       8.24461     R-square       0.6226
            Dep Mean     105.11928     Adj R-sq       0.4968
            C.V.           7.84310

                              Parameter Estimates

                       Parameter      Standard    T for H0:               
      Variable  DF      Estimate         Error   Parameter=0    Prob > |T|

      INTERCEP   1     90.083176    7.69930304        11.700        0.0013
      TIME       1      4.422384    1.98795150         2.225        0.1126
 
                   program: john-c/5421/regsim 27JAN00 19:17


---------------------------------------------------------------------------------
                   Printout of the output data set regest:


  OBS  CASENUM  _MODEL_  _TYPE_  _DEPVAR_   _RMSE_  INTERCEP    TIME    WEIGHT

    1      1    MODEL1   PARMS    WEIGHT   8.24461    90.083   4.42238    -1
    2      2    MODEL1   PARMS    WEIGHT   2.18489   114.068  -1.52682    -1  
    3      3    MODEL1   PARMS    WEIGHT   5.80824   121.694   3.64914    -1  
    4      4    MODEL1   PARMS    WEIGHT   6.99318   152.851   2.50392    -1  
    5      5    MODEL1   PARMS    WEIGHT   9.28116   114.627   4.56720    -1  
    6      6    MODEL1   PARMS    WEIGHT   1.52440   101.337   3.50238    -1  
    7      7    MODEL1   PARMS    WEIGHT   6.63631   123.199   2.38020    -1  
    8      8    MODEL1   PARMS    WEIGHT   3.34326    97.302   0.59439    -1  
    9      9    MODEL1   PARMS    WEIGHT   2.67682   129.736  -2.57299    -1  
   10     10    MODEL1   PARMS    WEIGHT   6.19498   109.296  -0.46387    -1  
 
 
                   program: john-c/5421/regsim 27JAN00 19:17
---------------------------------------------------------------------------------

                      Means of the Intercepts and Slopes

                                                19:17 Thursday, January 27, 2000

Variable  Label       N          Mean       Std Dev       Minimum       Maximum
-------------------------------------------------------------------------------
INTERCEP  Intercept  10   115.4192153    17.9950467    90.0831762   152.8507688
TIME                 10     1.7055947     2.5465471    -2.5729893     4.5671991
-------------------------------------------------------------------------------
 
 
                   program: john-c/5421/regsim 27JAN00 19:17
================================================================================

(Topic continued in notes.010)

/home/gnome/john-c/5421/notes.009    Last update: February 13, 2001