PH 5470 Spring 2005 Exam 1 Grade Distribution: 97, 97, 96, 95, 94, 94, 94, 92, 91, 89, 85, 85, 85, 84, 83, 82, 79, 60, 40. ANSWER KEY PubH 5470-3 Statistical Analysis Using SAS Procedures page 1 of 4 Exam 1 - March 24, 2005 Name: _____________________________ ========================================================================================== 1. Given the following program, use the space below to show what the output will look like: --------------------------------------------------------------------------------- data dset1 ; input id x y ; z = x + y ; cards ; 1 2 5 3 . 7 5 9 13 ; run ; data dset2 ; input id x y ; z = x - y ; cards ; 1 3 5 2 11 8 4 9 . 5 13 9 ; run ; data dset3 ; set dset1 dset2 ; run ; data dset4 ; merge dset1 dset2 ; by id ; [12] proc print data = dset3 ; title1 'PROC PRINT: data = dset3' ; [13] proc print data = dset4 ; title1 'PROC PRINT: data = dset4' ; endsas ; ================================================================================= PROC PRINT of dset3 : Obs ID x y z ---- ---- --- --- --- 1 1 2 5 7 2 3 . 7 . 3 5 9 13 22 4 1 3 5 -2 5 2 11 8 3 6 4 9 . . 7 5 13 9 4 PROC PRINT of dset4 : Obs ID x y z ---- ---- --- --- --- 1 1 3 5 -2 2 2 11 8 3 3 3 . 7 . 4 4 9 . . 5 5 13 9 4 PubH 5470-3 Statistical Analysis Using SAS PROCEDURES page 2 of 4 Exam 1 - March 24, 2005 Name: _____________________________ ========================================================================================== 2. A case-control study was conducted of the effects of levels of exposure to arsenic in well water on the risks of getting cancer. The data were as follows: MEN WOMEN No Cancer Cancer No Cancer Cancer --------------------- --------------------- Low | | | | | | Arsenic | 160 | 2 | | 150 | 10 | | | | | | | --------------------- --------------------- High | | | | | | Arsenic | 40 | 20 | | 15 | 15 | | | | | | | --------------------- --------------------- 200 100 200 100 a) Write a program to input these data into SAS. data byperson ; input gender arsenic cancer count ; [5] gendars = gender * arsenic ; do i = 1 to count ; output ; end ; cards ; 0 0 0 160 0 0 1 2 0 1 0 40 0 1 1 20 1 0 0 150 1 0 1 10 1 1 0 15 1 1 1 15 ; run ; b) Write appropriate PROC FREQ coding to analyze this data, so that you get a separate analysis for each gender and a COMBINED analysis. Describe how you would tell whether exposure to arsenic has a different effect in women than it does in men. [5] proc freq data = byperson ; tables = gender * arsenic * cancer / chisq cmh measures ; title 'PROC FREQ ANALYSIS of arsenic - cancer data' ; run ; To see if the effect of arsenic is different in women that in men, look at the Breslow-Day chi-square: if p is small, reject the hypothesis that the effect is the same (i.e., reject the hypothesis of homogeneity of odds ratio). c) Write a PROC LOGISTIC procedure for a no-interaction model which includes terms for both arsenic exposure and gender. Explain how you can use the coefficient estimates from this model to estimate the odds ratio for arsenic exposure. Your procedure should also ensure that confidence limits for odds ratios are printed. [5] proc logistic data = byperson descending ; model cancer = gender arsenic / clodds ; title1 'Model 1: cancer vs gender, arsenic: no interaction.' ; run ; d) Write a PROC LOGISTIC procedure for model that allows for possible interaction of arsenic exposure and gender, as well as for 'main effects' of each of these factors. Explain how you can use the estimated coefficients from this model to produce estimated odds ratios for arsenic exposure for men and women separately. [5] proc logistic data = byperson descending ; model cancer = gender arsenic gendars / clodds ; title1 'Model 2: cancer vs gender, arsenic, interaction term.' ; run ; e) Explain how you can use printed statistics from c) and d) to test for whether there is a significant interaction between arsenic exposure and gender as determinants of cancer. [5] Look at -2 log L for Model 1 and Model 2; compute the difference, diff = (-2logL)[Model 1] - (-2logL)[Model 2] ; Compare this to a chi-square distribution with 1 DF: pvalue = 1 - probchi(diff, 1) ; Reject H0: No interaction if pvalue is small. PubH 5470-3 Statistical Analysis Using SAS PROCEDURES page 5 of 9 Exam 1 - March 24, 2005 Name: _____________________________ ========================================================================================== 3. Below is a program which examines how body weight (kg) is related to age and gender of Lung Health Study participants. Partial printout from two analyses follows that on the next page. a) Fill in the blanks in the ANOVA tables for both analyses. [10] b) The main question of interest is: does age have a different effect on the body weight of women than on the body weight of men. State the appropriate null hypothesis to be tested. Then use the data from the ANOVA tables to compute an F-statistic which can be used to test that null hypothesis. [15] [write answers on the blank page following the printout] ---------------------------------------------------------------------------- DATA lhs ; infile '/home/walleye/john-c/5421/lhs.data' ; INPUT CASENUM AGE GENDER BASECIGS GROUP RANDDATE DEADDATE DEADCODE BODYMASS F31MSTAT VPCQUIT1 VPCQUIT2 VPCQUIT3 VPCQUIT4 VPCQUIT5 CIGSA0 CIGSA1 CIGSA2 CIGSA3 CIGSA4 CIGSA5 S1MFEV S2FEVPRE A1FEVPRE A2FEVPRE A3FEVPRE A4FEVPRE A5FEVPRE S2FEVPOS A1FEVPOS A2FEVPOS A3FEVPOS A4FEVPOS A5FEVPOS WEIGHT0 WEIGHT1 WEIGHT2 WEIGHT3 WEIGHT4 WEIGHT5 ; one = 1 ; gendbmi = gender * bodymass ; agegend = age * gender ; RUN ; *======================================================================; proc reg data = lhs ; model weight0 = age gender ; title1 'Model 1: weight (kg) versus age and gender' ; run ; proc reg data = lhs ; model weight0 = age gender agegend ; title1 'Model 2: weight versus age and gender,' ; title2 'and an interaction term for age and gender.'; run ; PubH 5470-3 Statistical Analysis Using SAS PROCEDURES page 6 of 9 Exam 1 - March 24, 2005 Name: _____________________________ ========================================================================================== Prob 3, contin. Model 1: weight (kg) versus age and gender 1 12:18 Sunday, March 20, 2005 Model: MODEL1 Dependent Variable: WEIGHT0 Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 2 34559.34 17279.67 126.38 <0.0001 ------ Error 497 67954.87 136.73 --- -------- ------ C Total 499 102514.21 R-square .337 ---- ======================================================================== Model 2: weight versus age and gender, 2 and an interaction term for age and gender. 12:18 Sunday, March 20, 2005 Model: MODEL2 Dependent Variable: WEIGHT0 Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 3 35256.44 11752.14 86.67 0.0001 ----- Error 496 67257.77 135.60 --- -------- ------ C Total 499 102514.21 R-square .344 ------------------------------------------------------------------------ (Error SS (Model 1) - Error SS (Model 2))/1 Compute F = -------------------------------------------- Error SS (Model 2) / 496 = (67954.87 - 67257.77) / (67257.77/496)) = 5.14. Compute this to an F-distribution with (1, 496) degrees of freedom. If the p-value is small, reject H0: no age-gender interaction. PubH 5470-3 Statistical Analysis Using SAS PROCEDURES page 7 of 9 Exam 1 - March 24, 2005 Name: _____________________________ ========================================================================================== Blank page for answers to Problem 3. PubH 5470-3 Statistical Analysis Using SAS PROCEDURES page 3 of 4 Exam 1 - March 24, 2004 Name: _____________________________ ========================================================================================== 4. You are given a dataset which has the following variables: ID, gender, age, weight, dose of a cholesterol-lowering drug, serum cholesterol level before starting to take the drug, and serum cholesterol after 8 weeks of taking the drug. Your task is to relate the change in serum cholesterol to the drug dose. a) What should you do before even starting the analysis, to check on the quality of the data ? [5] PROC PRINT at least some of the data to make sure the variables are what you thought they were. Check the observed values against the original sources or forms. Perform PROC UNIVARIATE on all the variables to find the means, standard deviations, ranges, extreme values. Check that the extreme values are correct if possible. Perform PROC PLOT of each variable against each other variable. See if there are any obvious outliers or influential points. Check that they are correct. Perform PROC CORR to see which independent variables are correlated. b) What should you do to describe the data before carrying out a formal analysis? What SAS procedures might you use? [4] PROC PLOT for each pair of variables. PROC UNIVARIATE or PROC CHART to examine histograms c) What SAS procedures might you use to carry out the analysis ? [4] PROC REG and PROC GLM. PROC REG has more regression diagnostics and can perform stepwise analysis, so you might prefer it for this problem. However PROC GLM allows class variables and provides multiple-comparisons tests (Bonferroni, etc.). Both PROC REG and PROC GLM have advantages. d) How would you test for outliers? Influential points? Non-constancy of variance ? What should you do about such things if you find them? [3] Outliers: compute studentized residuals and put them on an output datafile. If they are larger than 2 in absolute value, consider omitting those observations and re-running the analysis to see if the results are very different. Influential points: compute 'dffits' statistics and put them on an output datafile. If they are larger than 2 in absolute value, consider omitting those observations and re-running the analysis to see if the results are very different. Nonconstancy of variance: examine the plots of residuals both against the serum cholesterol, and against the predicted values. If there is an obvious pattern, consider a transformation of the outcome variable. e) How would you test for whether drug dose has a significant effect on the change in serum cholesterol? [5] You can examine the effect of drug dose in the Type III sum of squares table, which shows the effect of drug dose controlling for the other variables in the model. f) How might you test for a nonlinear effect of drug dose? What might you do if a nonlinear effect appears to be present? [4] Look at a residual plot versus drug dose. If it indicates a nonlinear pattern, consider adding dose-squared or sqrt(dose) or log(dose) to the model. Add nonlinear terms to the model and compare pairs of models using the F-statistic.