PubH 6470  SAS Procedures and Data Analysis                                  page 1 of 6

Exam 1 - March 9, 2006                                 Name: __Answer Key_________________
==========================================================================================
1.  Given the following program:
-----------------------------------
    data xyz ;

    input x y z ;

    datalines ;
      5  8 13
      4  0  0
      9 19 12

       [more lines]

     12 -5  2
      ;

    run ;
-----------------------------------

 Write another datastep which includes new variables  a b c
 where a = smallest of x, y, and z
       b = middle value of x, y, and z
 and   c = largest of x, y, and z.


 That is, the new data set should look like this:

      x    y    z    a    b    c
     ---  ---  ---  ---  ---  ---
      5   13    8    5    8   13
      4    1    2    1    2    4
      9   19   12    9   12   19

       [more lines]

     12   -5    2   -5    2   12
--------------------------------------------------------------------------

[15]

--------------------------------------------------------------------------
data xyzabc ;
     set xyz ;

     a = x ;
     if a ge y then a = y ;
     if a ge z then a = z ;

     c = y ;
     if c le x then c = x ;
     if c le z then c = z ;

* [Comment: the preceding lines set a = minimum(x, y, z) and c = maximum(x, y, z).]   

     b = z ;
     if a le y and y le c then b = y ;
     if a le x and x le c then b = x ;
     if a le z and z le c then b = z ;
     output ;
  run ;

  proc print data = xyzabc ;
       var x y z a b c ;
--------------------------------------------------------------------------

Pubh 5470-3  Statistical Analysis Using SAS PROCEDURES                         page 2 of 6

Exam 1 - March 9, 2006 4                                Name: _____________________________
==========================================================================================
2.  Heights are measured on three groups of 5-year-old children with 15 children in each
    group.  The first group was raised in Beijing.  The second group was raised in Houston.
    The third group was raised in Mombasa.  The objective is to see whether the distribution
    of heights in the three groups is the same.  Here is the program:

--------------------------------------------------------------------------
    data heights ;
         length city $7 ;
         input city h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14 h15 ;
    datalines ;
      Beijing  40 43 29 32 32 41 34 29 50 42 41 39 30 28 33
      Houston  51 38 28 33 50 43 44 37 50 44 51 34 44 38 40
      Mombasa  28 31 39 48 29 40 31 33 29 30 45 31 35 46 38
      ;
--------------------------------------------------------------------------

    Write a program to restructure the data set as necessary and use PROC NPAR1WAY to test
whether the distribution of heights in the three cities is the same.

[20]

---------------------------------------------------------------------------
options linesize = 80 ;
footnote "~john-c/5421/exam1prob2.sas &sysdate &systime" ;

    data heights ;
         length city $7 ;
         input city @@ ;
           do i = 1 to 15 ;
             input height @@ ;
             output ;
           end ;
    datalines ;
      Beijing  40 43 29 32 32 41 34 29 50 42 41 39 30 28 33
      Houston  51 38 28 33 50 43 44 37 50 44 51 34 44 38 40
      Mombasa  28 31 39 48 29 40 31 33 29 30 45 31 35 46 38
      ;

    run ;

proc print data = heights ;
     var city height ;
run ;

proc npar1way data = heights ;
     class city ;
     var height ;
run ;

---------------------------------------------------------------------------


PubH 5470-3  Statistical Analysis Using SAS PROCEDURES                         page 3 of 6

Exam 1 - March 9, 2006                                 Name: _____________________________
==========================================================================================

3.  A case-control study was performed in which the cases were people over 60 who had liver
    cancer, and the controls were people over 60 who did not have liver cancer.  There were
    twice as many controls as cases.  The risk factor of interest was exposure to abdominal
    CT scans.  The data were as follows:

                Case       Control
            -------------------------
    Abdom   |           |           |
    CT  Yes |    104    |    150    |  n1 = 254
            |           |           |
            | exp=84.67 | exp=169.33|
            -------------------------
    Abdom   |           |           |
    CT  No  |    496    |   1050    |  n2 = 1546
            |           |           |
            | exp=515.33|exp=1030.67|
            -------------------------
              m1 = 600    m2 = 1200    N = 1800



    a)  What null hypothesis might you want to test?

        H0:  prob(Abdom CT = Yes | case) = prob(Abdom CT = Yes | control)
 [5]
        Or:  H0: Odds ratio = 1.



    b)  Compute an appropriate odds ratio estimate and explain what it means.  If your
        odds ratio estimate is bigger than 1 or less than 1, tell what that means in
        terms of the relationship between exposure to CT scans and liver cancer.

        OR estimate = 104*1050/(150*496) = 1.468.

 [10]   This means that it appears to be more likely that a person with
        liver cancer will have had exposure to abdominal CT scans.



    c)  Compute expected counts for each cell in the table, assuming the null hypothesis.


        See above: (row margin) X (column margin) / Total
[5]



PubH 5470-3  Statistical Analysis Using SAS PROCEDURES                         page 4 of 6

Exam 1 - March 9, 2006                                 Name: _____________________________
==========================================================================================

3.  Continued

    d)  Write a SAS program (PROC FREQ) to analyze this data, including computation of the 
        odds ratio estimate and a 95% confidence interval for the true odds ratio.  Tell how 
        you might explain to a nonstatistician what it means if the 95% confidence interval 
        does not include the number 1.0.

[15]

---------------------------------------------------------------------------
        data livercan ;
             input case abdomct count ;
             datalines ;
             1  1  104
             0  1  150
             1  0  496
             0  0 1050
             ;
       run ;

       proc freq data = livercan ;
            weight count ;
            tables abdomct * case / chisq measures ;
       run ;
---------------------------------------------------------------------------

    If the null hypothesis is true, you would expect that in repetitions
of this study, 95% of the confidence intervals for the OR would contain
the value "1".  So this is interpreted as evidence that the null hypothesis
is not true.

---------------------------------------------------------------------------

    e)  If the chi-square test in PROC FREQ is highly significant (p < .0001), does that
        mean that abdominal CT scans cause liver cancer?  Why or why not?

        Not necessarily.  It could be that a history of liver disease
        causes CT scans.  Or it could be that liver cancer is correlated
        with some other variable that results in more CT scans.

        In general case-control studies do not yield causative conclusions,
        only association.
[5]



PubH 5470-3  Statistical Analysis Using SAS PROCEDURES                         page 5 of 6

Exam 1 - March 9, 2006                                 Name: _____________________________
==========================================================================================

4.  The graph below shows mean cholesterol levels for 4 groups of people:

    ----------------------------------------------------------------------
          250 +                                      A
              |                                                        B
              |                    A                 A                 B
              |                    B                                   E
              |  A                 D                 C                 C
              |                    B                 D                 A
Cholesterol   |                    F                 F                 B
              |  A                 C                 D                 E
          200 +  B                 C                 D                 B
              |  F                 C                 A                 B
              |  F                                   A
              |  B                                                     A
              |  C                 A
              |  B
              |  B
              |
          150 +
              ---+-----------------+-----------------+-----------------+--
                 Men             Women              Men              Women
               No bacon         No bacon           Bacon             Bacon
    ----------------------------------------------------------------------

    The vertical axis is cholesterol level.  The four groups are defined by
gender and whether or not the person habitually eats bacon.  There are 25 people
in each group.  Assume you have the data on a file with the following structure:

    Gender       Bacon
   (0 = male   (1 = yes,  Cholesterol
  1 = female)   0 = no)  (mg/deciliter)
   ----------   -------   -------------
      1            0         231
      0            0         177
      0            1         250

                etc.

    Write a SAS program to read in the data.  Write an appropriate analysis
routines to see what the effects of gender and bacon are on cholesterol level.
There should be two analyses: one for a model that includes only main effects, and
one for a model that includes main effects and an interaction term.  What does
it mean if the interaction of gender and bacon is positive and significant?
[Continue on next page if necessary]

[25]

---------------------------------------------------------------------------
    data bacon ;
         infile 'diet.bacon' ;
         input gender bacon chol ;

    run ;

    proc glm data = bacon ;
         class gender bacon ;
         model chol = gender bacon / solution ss2 ;
    title1 'Model 1: cholesterol = b0 + b1*gender + b2*bacon + error' ;
    run ;

    proc glm data = bacon ;
         class gender bacon ;
         model chol = gender bacon gender * bacon / solution ss2 ;
    title1 'Model 2: cholesterol = b0 + b1*gender + b2*bacon + b3*gender*bacon + error' ;
    run ;

---------------------------------------------------------------------------

    If the coefficient of the interaction term is positive, it means
that bacon raises cholesterol more in women than it does in men.


PubH 5470-3  Statistical Analysis Using SAS PROCEDURES                         page 6 of 6

Exam 1 - March 9, 2006                                 Name: _____________________________
==========================================================================================

4.  Continued ...