MORE ON COMPUTATION AND USE OF SUMMARY STATISTICS SPH 5421 notes.010
The simulated weight-gain dataset used in notes.009 is an example of data which
includes random effects. At the beginning of the simulation, coefficients b0 and b1
were specified, but each person was given an intercept and a slope which were a
random deviation away from b0 and b1 respectively. Also a random error of
measurement was built in for individual weight measurements.
This is fairly close to realistic. In real life, people have different body
sizes, partly due to genetics, and different rates at which they gain or lose
weight.
The two kinds of analysis that were described, in which a slope and intercept
were estimated for each person, are appropriate for this kind of data. These are
simple versions of longitudinal models, about which you will learn more if you take
one of the longitudinal models classes in biostatistics. The most commonly used SAS
procedure for this kind of analysis is PROC MIXED. The usual objective is to
examine the effects of other variables on the rate of increase of the outcome
variable (weight in the notes.009 simulations). For example, will the rate of
increase in weight be different for smokers and nonsmokers?
There is a file of real de-identified data from the Lung Health Study which you can use to
address these kind of questions:
/home/gnome/john-c/5421/lhs.data
The SAS program below reads in data from this file. Note that
that the variables WEIGHT0, WEIGHT1, WEIGHT2, WEIGHT3, WEIGHT4,
WEIGHT5 are weights measured at annual visits. WEIGHT0 was measured
at baseline when the person first entered the study. For Problem 11,
you will need to create time variables which indicate the time of each
annual visit.
==================================================================================
footnote "/home/gnome/john-c/5421/lhs.sas &sysdate &systime" ;
FILENAME GRAPH 'gsas.grf' ;
OPTIONS LINESIZE = 80 ;
GOPTIONS
RESET = GLOBAL
ROTATE = LANDSCAPE
FTEXT = SWISSB
DEVICE = PSCOLOR
GACCESS = SASGASTD
GSFNAME = GRAPH
GSFMODE = REPLACE
GUNIT = PCT BORDER
CBACK = WHITE
HTITLE = 2 HTEXT = 1 ;
*===================================================================== ;
DATA lhs ;
infile '/home/gnome/john-c/5421/lhs.data' ;
INPUT CASENUM AGE GENDER BASECIGS GROUP RANDDATE DEADDATE DEADCODE
BODYMASS F31MSTAT
VPCQUIT1 VPCQUIT2 VPCQUIT3 VPCQUIT4 VPCQUIT5
CIGSA0 CIGSA1 CIGSA2 CIGSA3 CIGSA4 CIGSA5
S1MFEV S2FEVPRE A1FEVPRE A2FEVPRE A3FEVPRE A4FEVPRE A5FEVPRE
S2FEVPOS A1FEVPOS A2FEVPOS A3FEVPOS A4FEVPOS A5FEVPOS
WEIGHT0 WEIGHT1 WEIGHT2 WEIGHT3 WEIGHT4 WEIGHT5 ;
RUN ;
*======================================================================;
PROBLEM 11
1. Use the above datafile to compute individual slopes and intercepts
of weight (in kilograms) versus time for the first 20 people on the
file. Do this in two ways:
1) As in the first or second program-examples in notes.009 (that is, not using
any SAS procedure). You can assume that the weights are measured
at 1-year intervals, with weight0 occurring at time 0.
2) As in the third program-example in notes.009, using a BY statement
and PROC REG.
The results you get for slopes and intercepts from these two methods
should be the same.
3) Graph the weights for the first 15 people on the file as a function
of time. You can use either Splus or SAS/GRAPH (PROC GPLOT). You
should overlay the graphs for the different people on the same piece
of paper.
Note that the file actually contains data for 500 people. You will get
large, messy output from PROC REG if you accidentally run it on the whole
file instead of on a subset of 10 or 20 people. Also the graph will look
very messy.
But also note that you can suppress printing from PROC REG as follows:
proc reg noprint outest = regest ;
by casenum ;
model weight = time ;
This enables you to do a large number of PROC REGs without creating
a monstrous pile of output. The results for each CASENUM are stored
sequentially in the file 'regest'.
2. Investigate (for the WHOLE dataset) whether people who are not smokers at
year 1 have a different rate of weight change across the whole time period
than those who are smoking at 1 year. Smoking status at 1 year is defined by:
VPCQUIT1 = 0 : smoking at 1 year
VPCQUIT1 = 1 : quit smoking at 1 year
Compute means of the rate of weight change and 95% confidence intervals for
the two groups, and carry out an appropriate test for whether the true means
are equal.
3. Without using a SAS procedure, perform regression through the
origin for the first 20 people on the LHS file. The model is:
Y = b*X + e,
where X is time (in years since baseline) and Y is CHANGE in
weight from baseline to each annual visit, and where
e ~ N(0, sig2). Again, this should be done separately for each of
the first 20 people on the file. However your program should
provide an OVERALL estimate of b (by averaging the individual
b's) and of sig2 [how?].
Since Y is the change in weight from baseline, when time = 0,
the change is 0. Should the point (time = 0, weight = 0) be
used in computing the regression coefficient? Why or why not?
Do you get a different answer if you include this point? Does the
standard error of the coefficient change? Why or why not?
Check that your program works by comparing it to
PROC REG with the NOINT option on the model statement.
/home/gnome/john-c/5421/notes.010 Last update: October 3, 2012.