PH 5470 Spring 2005 Exam 1

Grade Distribution:

      97, 97, 96, 95, 94, 94, 94, 92, 91, 89,

      85, 85, 85, 84, 83, 82, 79, 60, 40.

ANSWER KEY


PubH 5470-3  Statistical Analysis Using SAS Procedures                         page 1 of 4

Exam 1 - March 24, 2005                                Name: _____________________________
==========================================================================================
1.  Given the following program, use the space below to show what the output will
    look like:
 ---------------------------------------------------------------------------------

    data dset1 ;
      input id x y ;

      z = x + y ;

      cards ;
      1  2  5
      3  .  7
      5  9  13
      ;
     run ;

      data dset2 ;
        input id x y ;

        z = x - y ;

        cards ;
        1   3  5
        2  11  8
        4   9  .
        5  13  9
        ;
     run ;

     data dset3 ;
       set dset1 dset2 ;
     run ;

     data dset4 ;
       merge dset1 dset2 ; by id ;

[12]   proc print data = dset3 ;
       title1 'PROC PRINT: data = dset3' ;

[13]   proc print data = dset4 ;
       title1 'PROC PRINT: data = dset4' ;

 endsas ;
 =================================================================================

PROC PRINT of dset3 :

     Obs    ID    x    y    z
    ----   ----  ---  ---  ---
      1      1    2    5    7
      2      3    .    7    .
      3      5    9   13   22
      4      1    3    5   -2
      5      2   11    8    3
      6      4    9    .    .
      7      5   13    9    4

PROC PRINT of dset4 :

     Obs    ID    x    y    z
    ----   ----  ---  ---  ---
      1      1    3    5   -2
      2      2   11    8    3
      3      3    .    7    .
      4      4    9    .    .
      5      5   13    9    4




PubH 5470-3  Statistical Analysis Using SAS PROCEDURES                         page 2 of 4

Exam 1 - March 24, 2005                                Name: _____________________________
==========================================================================================
2.  A case-control study was conducted of the effects of levels
    of exposure to arsenic in well water on the risks of getting
    cancer.  The data were as follows:


               MEN                     WOMEN

           No Cancer   Cancer        No Cancer   Cancer
          ---------------------     ---------------------
 Low      |         |         |     |         |         |
 Arsenic  |   160   |     2   |     |   150   |    10   |
          |         |         |     |         |         |
          ---------------------     ---------------------
 High     |         |         |     |         |         |
 Arsenic  |    40   |    20   |     |    15   |    15   |
          |         |         |     |         |         |
          ---------------------     ---------------------
              200       100             200       100


  a)  Write a program to input these data into SAS.

      data byperson ;
           input gender arsenic cancer count ;
[5]
           gendars = gender * arsenic ;

           do i = 1 to count ;
              output ;
           end ;

      cards ;
      0  0  0  160
      0  0  1    2
      0  1  0   40
      0  1  1   20
      1  0  0  150
      1  0  1   10
      1  1  0   15
      1  1  1   15
      ;

run ;

  b)  Write appropriate PROC FREQ coding to analyze this data,
      so that you get a separate analysis for each gender and
      a COMBINED analysis.  Describe how you would tell whether
      exposure to arsenic has a different effect in women than
      it does in men.

[5]     proc freq data = byperson ;
             tables = gender * arsenic * cancer / chisq cmh measures ;
        title 'PROC FREQ ANALYSIS of arsenic - cancer data' ;
        run ;

      To see if the effect of arsenic is different in women that
      in men, look at the Breslow-Day chi-square: if p is small,
      reject the hypothesis that the effect is the same (i.e.,
      reject the hypothesis of homogeneity of odds ratio).

  c)  Write a PROC LOGISTIC procedure for a no-interaction model
      which includes terms for both arsenic exposure and gender.
      Explain how you can use the coefficient estimates from this
      model to estimate the odds ratio for arsenic exposure.
      Your procedure should also ensure that confidence limits
      for odds ratios are printed.

[5]   proc logistic data = byperson descending ;
           model cancer = gender arsenic / clodds ;
      title1 'Model 1: cancer vs gender, arsenic: no interaction.' ;
      run ;


  d)  Write a PROC LOGISTIC procedure for model that allows for
      possible interaction of arsenic exposure and gender, as well
      as for 'main effects' of each of these factors.
      Explain how you can use the estimated coefficients from this
      model to produce estimated odds ratios for arsenic exposure
      for men and women separately.

[5]   proc logistic data = byperson descending ;
           model cancer = gender arsenic gendars / clodds ;
      title1 'Model 2: cancer vs gender, arsenic, interaction term.' ;
      run ;

  e)  Explain how you can use printed statistics from c) and d)
      to test for whether there is a significant interaction
      between arsenic exposure and gender as determinants of
      cancer.

[5]   Look at -2 log L for Model 1 and Model 2; compute the
      difference,

           diff = (-2logL)[Model 1] - (-2logL)[Model 2] ;

      Compare this to a chi-square distribution with 1 DF:

           pvalue = 1 - probchi(diff, 1) ;

      Reject H0: No interaction if pvalue is small.


PubH 5470-3  Statistical Analysis Using SAS PROCEDURES                         page 5 of 9

Exam 1 - March 24, 2005                                Name: _____________________________
==========================================================================================
3.  Below is a program which examines how body weight (kg) is related
    to age and gender of Lung Health Study participants.

    Partial printout from two analyses follows that on the next page.

 a) Fill in the blanks in the ANOVA tables for both analyses.

[10]

 b) The main question of interest is: does age have a different
    effect on the body weight of women than on the body weight
    of men.  State the appropriate null hypothesis to be tested.
    Then use the data from the ANOVA tables to compute an F-statistic
    which can be used to test that null hypothesis.

[15]      [write answers on the blank page following the printout]

----------------------------------------------------------------------------

 DATA lhs ;
      infile '/home/walleye/john-c/5421/lhs.data' ;

      INPUT CASENUM  AGE GENDER BASECIGS GROUP RANDDATE DEADDATE DEADCODE
            BODYMASS F31MSTAT
            VPCQUIT1 VPCQUIT2 VPCQUIT3  VPCQUIT4 VPCQUIT5
            CIGSA0   CIGSA1   CIGSA2    CIGSA3   CIGSA4   CIGSA5
            S1MFEV   S2FEVPRE  A1FEVPRE  A2FEVPRE A3FEVPRE A4FEVPRE A5FEVPRE
                     S2FEVPOS  A1FEVPOS  A2FEVPOS A3FEVPOS A4FEVPOS A5FEVPOS
                     WEIGHT0   WEIGHT1   WEIGHT2  WEIGHT3  WEIGHT4  WEIGHT5 ;

 one = 1 ;
 gendbmi = gender * bodymass ;
 agegend = age * gender ;

 RUN ;

*======================================================================;

proc reg data = lhs ;
     model weight0 = age gender ;
title1 'Model 1: weight (kg) versus age and gender' ;
run ;

proc reg data = lhs ;
     model weight0 = age gender agegend ;
title1 'Model 2: weight versus age and gender,' ;
title2 'and an interaction term for age and gender.';
run ;


PubH 5470-3  Statistical Analysis Using SAS PROCEDURES                         page 6 of 9

Exam 1 - March 24, 2005                                Name: _____________________________
==========================================================================================
Prob 3, contin.


                   Model 1: weight (kg) versus age and gender                  1
                                                    12:18 Sunday, March 20, 2005

Model: MODEL1  
Dependent Variable: WEIGHT0                                            

                              Analysis of Variance

                                 Sum of         Mean
        Source          DF      Squares       Square      F Value       Prob>F

        Model            2     34559.34     17279.67      126.38        <0.0001
                                                          ------
        Error          497     67954.87       136.73
                       ---     --------       ------
        C Total        499    102514.21

        R-square       .337
                       ----
========================================================================

                      Model 2: weight versus age and gender,                    2
                  and an interaction term for age and gender.
                                                    12:18 Sunday, March 20, 2005

Model: MODEL2
Dependent Variable: WEIGHT0                                            

                              Analysis of Variance

                                 Sum of         Mean
        Source          DF      Squares       Square      F Value       Prob>F

        Model            3     35256.44  11752.14         86.67             0.0001
                                                          -----
        Error          496     67257.77    135.60
                       ---     --------    ------
        C Total        499    102514.21

        R-square      .344

------------------------------------------------------------------------

                 (Error SS (Model 1) - Error SS (Model 2))/1
   Compute F =   --------------------------------------------
                           Error SS (Model 2) / 496

             =   (67954.87 - 67257.77) / (67257.77/496)) = 5.14.

   Compute this to an F-distribution with (1, 496) degrees of freedom.
   If the p-value is small, reject H0: no age-gender interaction.




PubH 5470-3  Statistical Analysis Using SAS PROCEDURES                         page 7 of 9

Exam 1 - March 24, 2005                                Name: _____________________________
==========================================================================================
Blank page for answers to Problem 3.



PubH 5470-3  Statistical Analysis Using SAS PROCEDURES                         page 3 of 4

Exam 1 - March 24, 2004                                Name: _____________________________
==========================================================================================
4.  You are given a dataset which has the following variables:
    ID, gender, age, weight, dose of a cholesterol-lowering drug,
    serum cholesterol level before starting to take the drug, and
    serum cholesterol after 8 weeks of taking the drug.

    Your task is to relate the change in serum cholesterol to
    the drug dose.

 a) What should you do before even starting the analysis, to
    check on the quality of the data ?

[5] PROC PRINT at least some of the data to make sure the variables are
    what you thought they were.   Check the observed values against
    the original sources or forms.

    Perform PROC UNIVARIATE on all the variables to find the means,
    standard deviations, ranges, extreme values.  Check that the
    extreme values are correct if possible.

    Perform PROC PLOT of each variable against each other variable.
    See if there are any obvious outliers or influential points.
    Check that they are correct.

    Perform PROC CORR to see which independent variables are
    correlated.


 b) What should you do to describe the data before carrying out
    a formal analysis?  What SAS procedures might you use?

[4] PROC PLOT for each pair of variables.

    PROC UNIVARIATE or PROC CHART to examine histograms


 c) What SAS procedures might you use to carry out the
    analysis ?

[4] PROC REG and PROC GLM.  PROC REG has more regression diagnostics
    and can perform stepwise analysis, so you might prefer it for
    this problem.  However PROC GLM allows class variables and
    provides multiple-comparisons tests (Bonferroni, etc.).  Both
    PROC REG and PROC GLM have advantages.


 d) How would you test for outliers?   Influential points?
    Non-constancy of variance ?  What should you do about such things
    if you find them?

[3] Outliers: compute studentized residuals and put them on an
    output datafile.  If they are larger than 2 in absolute value,
    consider omitting those observations and re-running the analysis
    to see if the results are very different.

    Influential points: compute 'dffits' statistics and put them on an
    output datafile.  If they are larger than 2 in absolute value,
    consider omitting those observations and re-running the analysis
    to see if the results are very different.

    Nonconstancy of variance: examine the plots of residuals both against
    the serum cholesterol, and against the predicted values.  If
    there is an obvious pattern, consider a transformation of the
    outcome variable.


 e) How would you test for whether drug dose has a significant effect
    on the change in serum cholesterol?

[5] You can examine the effect of drug dose in the Type III sum of
    squares table, which shows the effect of drug dose controlling
    for the other variables in the model.


 f) How might you test for a nonlinear effect of drug dose?  What
    might you do if a nonlinear effect appears to be present?

[4] Look at a residual plot versus drug dose.  If it indicates a
    nonlinear pattern, consider adding dose-squared or sqrt(dose)
    or log(dose) to the model.

    Add nonlinear terms to the model and compare pairs of models
    using the F-statistic.