PubH 5470-3 Statistical Analysis Using SAS Procedures page 1 of 6 Final Exam - May 15, 2004 Name: _____________________________ ========================================================================================== 1. Data on individuals enrolled in the Happy Health HMO are recorded in two files: File 1: Demographic Data File 2: Drug Prescription Data ----------------------------------- ------------------------------------------- Date of Last ID Gender Age Weight Race ID Prescription Drug Code Age ---- ------ --- ------ ---- ---- ------------ ---------- ----------- 0611 M 18 129 W 0345 2004-02-29 124 41 0345 M 40 158 A 0111 2003-12-16 288 52 0260 F 79 107 B 0064 . . 63 [more observations] [more observations] Your task is to merge the two files so that both the demographic data and the drug prescription data for a given person are part of one observation for that person. The two files both include 'ID', which is a unique identifier for the person. However, File 1 is sorted in ascending order by age, while File 2 is sorted in reverse order by Date of Last Prescription. Write a SAS program which merges the two files. -------------------------------------------------------------------------------------- data file1 ; length id $4 gender $1 race $1 ; input ID gender age weight race ; run ; [13] data file2 ; length id $4 ; input id @7 presdate yymmdd10. drugcode ; run ; proc sort data = file1 ; by id ; proc sort data = file2 ; by id ; data twofiles ; merge file1 file2 ; by id ; run ; -------------------------------------------------------------------------------------- Note that one person, ID #0345, has a different Age on File 2 than on File 1. What will happen to this variable when the files are merged? -------------------------------------------------------------------------------------- The value from the second file in the merge statement will be used. -------------------------------------------------------------------------------------- [7] PubH 5470-3 Statistical Analysis Using SAS Procedures page 2 of 6 Final Exam - May 15, 2004 Name: _____________________________ ========================================================================================== 2. The relationship of height to weight was studied in men and women in the Lung Health Study. On the following pages are 4 PROC GLM analyses, and the corresponding printouts. a) Use the results of Model 1 and Model 2 to evaluate whether the relationship between height and weight is the same for men as it is for women. Describe your reasoning in detail. -------------------------------------------------------------------------------------- Consider confidence intervals for intercepts and slopes, based on param estimates +/- 2*std errs. [6] Men: Intercept = -81.43 +/- 2*5.06 = (-91.55, -71.31) Slope = 92.75 +/- 2*2.86 = ( 87.03, 98.43) Women: Intercept = -54.62 +/- 2*8.62 = (-71.65, -37.38) Slope = 73.14 +/- 2*3.88 = ( 65.38, 80.90) Note that the confidence intervals essentially do not even overlap. -------------------------------------------------------------------------------------- b) What do you conclude by comparing Model 3 and Model 4 ? Justify your answer in detail. -------------------------------------------------------------------------------------- Conclude that because the 'gender' term is highly significant, then in a two-parallel-lines model, the intercepts are not the same. [6] Also very likely: the slopes are not the same [note Model 3 slope is very different from Model 4 slope]. -------------------------------------------------------------------------------------- c) Write another 'proc glm' ('Model 5' )which models weight as a function of height with separate intercepts and separate slopes for women and men in one procedure. -------------------------------------------------------------------------------------- proc glm ; class gender ; [6] model height = weight gender weight * gender ; -------------------------------------------------------------------------------------- PubH 5470-3 Statistical Analysis Using SAS Procedures page 3 of 6 Final Exam - May 15, 2004 Name: _____________________________ ========================================================================================== 2., contin. d) Describe how you would use the printout from Model 5 and Model 3 to formally test whether Model 5 is better than Model 3. -------------------------------------------------------------------------------------- Compute the F-statistic: [6] (ErrSS3 - ErrSS5) / 2 F = ----------------------- ErrSS5 / (n - 4) Compare to F-distribution with (2, n - 4) degrees of freedom. -------------------------------------------------------------------------------------- ======================================================================== Program and Printout for Problem 2 proc glm data = smoke ; where gender eq 0 ; model weight = height ; title1 'Model 1: gender = 0 [Men] only: regress weight on height' ; proc glm data = smoke ; where gender eq 1 ; model weight = height ; title1 'Model 2: gender = 1 [Women] only: regress weight on height' ; proc glm data = smoke ; model weight = height ; title1 'Model 3: both genders combined: regress weight on height' ; proc glm data = smoke ; model weight = gender height ; title1 'Model 4: both genders, same slope, different intercepts' ; ------------------------------------------------------------------------ Model 1: gender = 0 [Men] only: regress weight on height 2 16:24 Sunday, May 9, 2004 General Linear Models Procedure Dependent Variable: WEIGHT Sum of Mean Source DF Squares Square F Value Pr > F Model 1 138223.66876 138223.66876 1051.04 0.0001 Error 3699 486458.48624 131.51081 Corrected Total 3700 624682.15500 R-Square C.V. Root MSE WEIGHT Mean 0.221270 13.89803 11.467816 82.513996 Source DF Type I SS Mean Square F Value Pr > F HEIGHT 1 138223.66876 138223.66876 1051.04 0.0001 Source DF Type III SS Mean Square F Value Pr > F HEIGHT 1 138223.66876 138223.66876 1051.04 0.0001 john-c T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT -81.43835688 -16.09 0.0001 5.06067757 HEIGHT 92.75304821 32.42 0.0001 2.86099902 ------------------------------------------------------------------------ Model 2: gender = 1 [Women] only: regress weight on height 4 16:24 Sunday, May 9, 2004 General Linear Models Procedure Dependent Variable: WEIGHT Sum of Mean Source DF Squares Square F Value Pr > F Model 1 40585.548007 40585.548007 355.19 0.0001 Error 2183 249437.325009 114.263548 Corrected Total 2184 290022.873016 R-Square C.V. Root MSE WEIGHT Mean 0.139939 16.44993 10.689413 64.981510 Source DF Type I SS Mean Square F Value Pr > F HEIGHT 1 40585.548007 40585.548007 355.19 0.0001 Source DF Type III SS Mean Square F Value Pr > F HEIGHT 1 40585.548007 40585.548007 355.19 0.0001 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT -54.83183244 -8.62 0.0001 6.36142075 HEIGHT 73.14141859 18.85 0.0001 3.88089171 ------------------------------------------------------------------------ Model 3: both genders combined: regress weight on height 6 16:24 Sunday, May 9, 2004 General Linear Models Procedure Dependent Variable: WEIGHT Sum of Mean Source DF Squares Square F Value Pr > F Model 1 571224.28119 571224.28119 4389.00 0.0001 Error 5884 765796.48491 130.14896 Corrected Total 5885 1337020.76610 R-Square C.V. Root MSE WEIGHT Mean 0.427237 15.00980 11.408285 76.005590 Source DF Type I SS Mean Square F Value Pr > F HEIGHT 1 571224.28119 571224.28119 4389.00 0.0001 Source DF Type III SS Mean Square F Value Pr > F HEIGHT 1 571224.28119 571224.28119 4389.00 0.0001 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT -114.1720521 -39.72 0.0001 2.87447389 HEIGHT 110.5977830 66.25 0.0001 1.66941166 ------------------------------------------------------------------------ Model 4: both genders, same slope, different intercepts 8 16:24 Sunday, May 9, 2004 General Linear Models Procedure Dependent Variable: WEIGHT Sum of Mean Source DF Squares Square F Value Pr > F Model 2 599142.93697 299571.46849 2388.44 0.0001 Error 5883 737877.82913 125.42543 Corrected Total 5885 1337020.76610 R-Square C.V. Root MSE WEIGHT Mean 0.448118 14.73490 11.199350 76.005590 Source DF Type I SS Mean Square F Value Pr > F GENDER 1 422315.73809 422315.73809 3367.07 0.0001 HEIGHT 1 176827.19888 176827.19888 1409.82 0.0001 Source DF Type III SS Mean Square F Value Pr > F GENDER 1 27918.65578 27918.65578 222.59 0.0001 HEIGHT 1 176827.19888 176827.19888 1409.82 0.0001 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT -70.31957232 -17.26 0.0001 4.07456022 GENDER -6.33408349 -14.92 0.0001 0.42455048 HEIGHT 86.46279900 37.55 0.0001 2.30275410 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 09MAY04 16:24 PubH 5470-3 Statistical Analysis Using SAS Procedures page 4 of 6 Final Exam - May 15, 2004 Name: _____________________________ ========================================================================================== 3. A cohort of people aged 80-100 was followed for one year. A datafile was constructed with the following information: Person's ID Age in years Death (0 = no, 1 = yes) a) Describe the model in which probability of death is represented as a function of age. -------------------------------------------------------------------------------------- [3] Model: Prob(Death | age) = 1 / (1 + exp(-b0 -b1*age)). -------------------------------------------------------------------------------------- b) Write a proc logistic procedure to analyze the data. -------------------------------------------------------------------------------------- [3] proc logistic data = agedeath ; model death = age / clodds = pl ; run ; -------------------------------------------------------------------------------------- c) Suppose the coefficient estimates from proc logistic are as follows: Analysis of Maximum Likelihood Estimates Parameter Standard Variable DF Estimate Error INTERCPT 1 -1.000 0.3000 AGE 1 -.010 0.0020 Compute the probability of dying within one year for a person who is 90 years old. -------------------------------------------------------------------------------------- prob = 1 / (1 + exp(1 + .01*90)) = 1 / (1 + exp(1 + .9)) = 1 / (1 + exp(1.9)) [4] = 1 / (1 + 6.686) = 1 / 7.686 = .13 -------------------------------------------------------------------------------------- d) Using the coefficient estimates in b) [misprint: should be c)], find the age for which a third of the people of that age would be expected to die within a year. -------------------------------------------------------------------------------------- [6] Let 1/3 = 1 / (1 + exp(1 + .01*age)), or 3 = 1 + exp(1 + .01*age) 2 = exp(1 + .01*age) .693 = 1 + .01*age Therefore age = -30.7. -------------------------------------------------------------------------------------- Note: although the calculations above are technically correct, the answer -30.7 years does not really make sense. The coefficient b1 of age was erroneously printed as -.010 when it should have been +.010. PubPH5470-3 Statistical Analysis Using SAS Procedures page 5 of 6 Final Exam - May 15, 2004 Name: _____________________________ ========================================================================================== 4. An investigator wants to conduct a clinical trial. A total of 200 people will be randomized: 100 to drug X and 100 to drug Y (placebo). The outcome is resolution of depression. The investigator thinks that the probability of resolution of depression in each of the two groups will be the following: Drug X : .75 Drug Y : .60 a) Write a SAS program which will simulate this clinical trial. The output from the program should show the number of events in each of the two groups. -------------------------------------------------------------------------------------- data trial ; n = 100 ; px = .75 ; py = .60 ; seed = 20040515 ; [6] do i = 1 to n ; group = 'X' ; rx = ranuni(seed) ; x = 0 ; if rx < px then x = 1 ; output ; end ; do i = 1 to n ; group = 'Y' ; ry = ranuni(seed) ; x = 0 ; if ry < py then x = 1 ; output ; end ; run ; -------------------------------------------------------------------------------------- b) How would you test whether the results of your simulated clinical trial indicated a significant difference between the two groups? -------------------------------------------------------------------------------------- Use PROC FREQ as follows: proc freq data = trial ; [6] tables x * group / chisq ; run ; -------------------------------------------------------------------------------------- c) Suppose you carried out 1000 separate simulations of the clinical trial with the probabilities given as above. How could you use the results to estimate the power of the clinical trial (i.e., the probability that the results would be significant)? -------------------------------------------------------------------------------------- Generate 1000 datasets as in b) above and test each using proc freq. Let M = the number of times in 1000 that the p-value is < .05. [8] Compute M / 1000. This is an estimate of power. -------------------------------------------------------------------------------------- PubH 5470-3 Statistical Analysis Using SAS Procedures page 6 of 6 Final Exam - May 15, 2004 Name: _____________________________ ========================================================================================== 5. The following dataset summarizes data from a followup study of heart attack, where the risk factor of interest was aspirin use. Systolic blood pressure, age, and gender are also risk factors for heart attack. Participant Systolic Aspirin Followup Heart Sequence No. Gender Age B. P. Use Time(mos.) Attack? ------------ ------ --- -------- ---------- ---------- -------- 1 2 67 145 0 28 0 2 2 48 132 1 36 0 3 1 72 168 0 48 1 [more observations] a) Write a SAS program to read the datafile and analyze this data using PROC LIFETEST. -------------------------------------------------------------------------------------- data heart ; infile 'heart.data' ; input seqno gender age sbp aspirin folltime heartatt ; aspgend = aspirin * gender ; run ; [6] proc lifetest data = heart ; folltime * heartatt(0) ; strata aspirin ; run ; -------------------------------------------------------------------------------------- b) How can you tell, from PROC LIFETEST output, whether the people who use aspirin are at higher or lower risk than the people who do not? What test for the difference between aspirin users and non-users would you use ? -------------------------------------------------------------------------------------- Examine the survival curves. The group with the higher survival rate is the group that has lower risk. [6] Test: PROC LIFETEST produces tests based on -2*Log(Likelihood), Wilcoxon statistic, and Logrank statistic. Any of these can be used. Conservative approach would be to take the largest p-value produced by these three tests. -------------------------------------------------------------------------------------- c) Write a PROC PHREG procedure to analyze the effects of aspirin use, controlling for age, gender, and SBP. How would you test for an interaction between aspirin use and gender ? -------------------------------------------------------------------------------------- 1. proc phreg data = heart ; model folltime * heartatt(0) = age gender sbp ; run ; 2. proc phreg data = heart ; model folltime * heartatt(0) = aspirin age gender sbp ; run ; Look at difference in (-2 Log L) with these two models, compare to chi-square distrib with 1 d.f. 3. proc phreg data = heart ; model folltime * heartatt(0) = aspirin age gender sbp aspgend ; run ; Look at difference in (-2 Log L) between 3. and 2., compare to chi-square distrib. with 1 d.f.. Note: you need to create a variable called aspgend in the dataset ... [8] see above. --------------------------------------------------------------------------------------