MORE ON COMPUTATION AND USE OF SUMMARY STATISTICS SPH 5421 notes.010 The simulated weight-gain dataset used in notes.009 is an example of data which includes random effects. At the beginning of the simulation, coefficients b0 and b1 were specified, but each person was given an intercept and a slope which were a random deviation away from b0 and b1 respectively. Also a random error of measurement was built in for individual weight measurements. This is fairly close to realistic. In real life, people have different body sizes, partly due to genetics, and different rates at which they gain or lose weight. The two kinds of analysis that were described, in which a slope and intercept were estimated for each person, are appropriate for this kind of data. These are simple versions of longitudinal models, about which you will learn more if you take one of the longitudinal models classes in biostatistics. The most commonly used SAS procedure for this kind of analysis is PROC MIXED. The usual objective is to examine the effects of other variables on the rate of increase of the outcome variable (weight in the notes.009 simulations). For example, will the rate of increase in weight be different for smokers and nonsmokers? There is a file of real de-identified data from the Lung Health Study which you can use to address these kind of questions: /home/gnome/john-c/5421/lhs.data The SAS program below reads in data from this file. Note that that the variables WEIGHT0, WEIGHT1, WEIGHT2, WEIGHT3, WEIGHT4, WEIGHT5 are weights measured at annual visits. WEIGHT0 was measured at baseline when the person first entered the study. For Problem 11, you will need to create time variables which indicate the time of each annual visit. ================================================================================== footnote "/home/gnome/john-c/5421/lhs.sas &sysdate &systime" ; FILENAME GRAPH 'gsas.grf' ; OPTIONS LINESIZE = 80 ; GOPTIONS RESET = GLOBAL ROTATE = LANDSCAPE FTEXT = SWISSB DEVICE = PSCOLOR GACCESS = SASGASTD GSFNAME = GRAPH GSFMODE = REPLACE GUNIT = PCT BORDER CBACK = WHITE HTITLE = 2 HTEXT = 1 ; *===================================================================== ; DATA lhs ; infile '/home/gnome/john-c/5421/lhs.data' ; INPUT CASENUM AGE GENDER BASECIGS GROUP RANDDATE DEADDATE DEADCODE BODYMASS F31MSTAT VPCQUIT1 VPCQUIT2 VPCQUIT3 VPCQUIT4 VPCQUIT5 CIGSA0 CIGSA1 CIGSA2 CIGSA3 CIGSA4 CIGSA5 S1MFEV S2FEVPRE A1FEVPRE A2FEVPRE A3FEVPRE A4FEVPRE A5FEVPRE S2FEVPOS A1FEVPOS A2FEVPOS A3FEVPOS A4FEVPOS A5FEVPOS WEIGHT0 WEIGHT1 WEIGHT2 WEIGHT3 WEIGHT4 WEIGHT5 ; RUN ; *======================================================================; PROBLEM 11 1. Use the above datafile to compute individual slopes and intercepts of weight (in kilograms) versus time for the first 20 people on the file. Do this in two ways: 1) As in the first or second program-examples in notes.009 (that is, not using any SAS procedure). You can assume that the weights are measured at 1-year intervals, with weight0 occurring at time 0. 2) As in the third program-example in notes.009, using a BY statement and PROC REG. The results you get for slopes and intercepts from these two methods should be the same. 3) Graph the weights for the first 15 people on the file as a function of time. You can use either Splus or SAS/GRAPH (PROC GPLOT). You should overlay the graphs for the different people on the same piece of paper. Note that the file actually contains data for 500 people. You will get large, messy output from PROC REG if you accidentally run it on the whole file instead of on a subset of 10 or 20 people. Also the graph will look very messy. But also note that you can suppress printing from PROC REG as follows: proc reg noprint outest = regest ; by casenum ; model weight = time ; This enables you to do a large number of PROC REGs without creating a monstrous pile of output. The results for each CASENUM are stored sequentially in the file 'regest'. 2. Investigate (for the WHOLE dataset) whether people who are not smokers at year 1 have a different rate of weight change across the whole time period than those who are smoking at 1 year. Smoking status at 1 year is defined by: VPCQUIT1 = 0 : smoking at 1 year VPCQUIT1 = 1 : quit smoking at 1 year Compute means of the rate of weight change and 95% confidence intervals for the two groups, and carry out an appropriate test for whether the true means are equal. 3. Without using a SAS procedure, perform regression through the origin for the first 20 people on the LHS file. The model is: Y = b*X + e, where X is time (in years since baseline) and Y is CHANGE in weight from baseline to each annual visit, and where e ~ N(0, sig2). Again, this should be done separately for each of the first 20 people on the file. However your program should provide an OVERALL estimate of b (by averaging the individual b's) and of sig2 [how?]. Since Y is the change in weight from baseline, when time = 0, the change is 0. Should the point (time = 0, weight = 0) be used in computing the regression coefficient? Why or why not? Do you get a different answer if you include this point? Does the standard error of the coefficient change? Why or why not? Check that your program works by comparing it to PROC REG with the NOINT option on the model statement. /home/gnome/john-c/5421/notes.010 Last update: October 3, 2012.