MORE ON COMPUTATION AND USE OF SUMMARY STATISTICS         SPH 5421 notes.010
 
     The simulated weight-gain dataset used in notes.009 is an example of data which
includes random effects.  At the beginning of the simulation, coefficients b0 and b1
were specified, but each person was given an intercept and a slope which were a
random deviation away from b0 and b1 respectively.  Also a random error of
measurement was built in for individual weight measurements.

     This is fairly close to realistic.  In real life, people have different body
sizes, partly due to genetics, and different rates at which they gain or lose
weight.

     The two kinds of analysis that were described, in which a slope and intercept
were estimated for each person, are appropriate for this kind of data. These are
simple versions of longitudinal models, about which you will learn more if you take
one of the longitudinal models classes in biostatistics.  The most commonly used SAS
procedure for this kind of analysis is PROC MIXED.  The usual objective is to
examine the effects of other variables on the rate of increase of the outcome
variable (weight in the notes.009 simulations).  For example, will the rate of
increase in weight be different for smokers and nonsmokers?

     There is a file of real de-identified data from the Lung Health Study which you can use to
address these kind of questions:

     /home/gnome/john-c/5421/lhs.data

     The SAS program below reads in data from this file.  Note that
that the variables WEIGHT0, WEIGHT1, WEIGHT2, WEIGHT3, WEIGHT4,
WEIGHT5 are weights measured at annual visits.  WEIGHT0 was measured
at baseline when the person first entered the study.  For Problem 11,
you will need to create time variables which indicate the time of each
annual visit.

==================================================================================
footnote "/home/gnome/john-c/5421/lhs.sas &sysdate &systime" ;

FILENAME GRAPH 'gsas.grf' ;

OPTIONS  LINESIZE = 80 ;

GOPTIONS
         RESET = GLOBAL
         ROTATE = LANDSCAPE
         FTEXT = SWISSB
         DEVICE = PSCOLOR
         GACCESS = SASGASTD
         GSFNAME = GRAPH
         GSFMODE = REPLACE
         GUNIT = PCT BORDER
         CBACK = WHITE
         HTITLE = 2 HTEXT = 1 ;

*===================================================================== ;        

 DATA lhs ;
      infile '/home/gnome/john-c/5421/lhs.data' ;


      INPUT CASENUM  AGE GENDER BASECIGS GROUP RANDDATE DEADDATE DEADCODE
            BODYMASS F31MSTAT
            VPCQUIT1 VPCQUIT2 VPCQUIT3  VPCQUIT4 VPCQUIT5
            CIGSA0   CIGSA1   CIGSA2    CIGSA3   CIGSA4   CIGSA5
            S1MFEV   S2FEVPRE  A1FEVPRE  A2FEVPRE A3FEVPRE A4FEVPRE A5FEVPRE
                     S2FEVPOS  A1FEVPOS  A2FEVPOS A3FEVPOS A4FEVPOS A5FEVPOS
                     WEIGHT0   WEIGHT1   WEIGHT2  WEIGHT3  WEIGHT4  WEIGHT5 ;

 RUN ;

*======================================================================;

PROBLEM 11

1.  Use the above datafile to compute individual slopes and intercepts
    of weight (in kilograms) versus time for the first 20 people on the
    file.  Do this in two ways:

    1) As in the first or second program-examples in notes.009 (that is, not using
       any SAS procedure).  You can assume that the weights are measured
       at 1-year intervals, with weight0 occurring at time 0.

    2) As in the third program-example in notes.009, using a BY statement
       and PROC REG.

    The results you get for slopes and intercepts from these two methods
    should be the same.

    3) Graph the weights for the first 15 people on the file as a function
       of time.  You can use either Splus or SAS/GRAPH (PROC GPLOT).  You
       should overlay the graphs for the different people on the same piece
       of paper.

    Note that the file actually contains data for 500 people.  You will get
    large, messy output from PROC REG if you accidentally run it on the whole
    file instead of on a subset of 10 or 20 people.  Also the graph will look
    very messy.

    But also note that you can suppress printing from PROC REG as follows:

                       proc reg noprint outest = regest ;
                            by casenum ;
                            model weight = time ;

    This enables you to do a large number of PROC REGs without creating
    a monstrous pile of output.  The results for each CASENUM are stored
    sequentially in the file 'regest'.

2.  Investigate (for the WHOLE dataset) whether people who are not smokers at
    year 1 have a different rate of weight change across the whole time period
    than those who are smoking at 1 year.  Smoking status at 1 year is defined by:

                    VPCQUIT1 = 0 :  smoking at 1 year
                    VPCQUIT1 = 1 :  quit smoking at 1 year

    Compute means of the rate of weight change and 95% confidence intervals for
    the two groups, and carry out an appropriate test for whether the true means
    are equal.

3.  Without using a SAS procedure, perform regression through the
    origin for the first 20 people on the LHS file.  The model is:

                      Y = b*X + e,

    where X is time (in years since baseline) and Y is CHANGE in
    weight from baseline to each annual visit, and where
    e ~ N(0, sig2).  Again, this should be done separately for each of
    the first 20 people on the file.  However your program should
    provide an OVERALL estimate of  b (by averaging the individual
    b's) and of sig2 [how?].

    Since Y is the change in weight from baseline, when time = 0,
    the change is 0.  Should the point (time = 0, weight = 0) be
    used in computing the regression coefficient?  Why or why not?
    Do you get a different answer if you include this point?  Does the
    standard error of the coefficient change?  Why or why not?

    Check that your program works by comparing it to
    PROC REG with the NOINT option on the model statement.

/home/gnome/john-c/5421/notes.010    Last update: October 3, 2012.