PubH 5470-3 home page

PubH 5470-3 STATISTICAL DATA ANALYSIS USING SAS PROCEDURES: SOME REVIEW QUESTIONS

Web address of this page: http://www.biostat.umn.edu/~john-c/ph5470-3.review.s2004.html

 April 28, 2004

 1.  A datafile for a program is on an external file with the following
     address:

       /home/gnome/bobw/data/datafil1

     Data on the file looks like the following:

     ID     Group    Date Randomized    Gender   Birthdate   Cholesterol
  -------  ------   -----------------  -------  ----------- -------------
  A-003-4  No-Drug    2004/07/04          M     1976/11/08      230
  A-005-8  Drug-A     2004/08/14          F     1970/02/28      -99
  A-011-3  Drug-B     2004/10/11          M           .         208
  A-013-1  Drug-A     2004/10/18          M     1975/04/30      244
  A-022-5  No-Drug    2005/03/12          F     1969/05/21      282
  A-105-3  Drug-A     2003/12/21        -99           .         180

                       [more observations]



  Write a program which reads this datafile and computes the person's
age in years.  Note that values of gender and cholesterol which are missing 
are denoted by '-99'.  Missing dates are denoted by '.' .

  Write a procedure to test the hypothesis that the serum cholesterol
level is not related to drug assignment (controlling for age and gender).

Answers:

----------------------------------------------------------------------------------
options linesize = 80 ;
footnote "/home/gnome/john-c/5421/review1.sas &sysdate &systime" ;

data chol ;
     infile 'prob1.data' ;
     length id $7  group $7  gender $3 ;
     input  id  
            group
            @23 daterand  yymmdd10.
            gender
            @49 birtdate  yymmdd10.
            chol ;

     if gender eq '-99' then gender = '' ;
     if chol le '-99' then chol = '.' ;
run ;

proc print data = chol ;
run ;

proc glm data = chol ;
     class group gender ;
     model chol = age gender group / solution ;
title 'Analysis of chol vs. drug group, controlling for age, gender' ;
run ;
----------------------------------------------------------------------------------
proc glm data = chol ;
     class group gender ;
     model chol = age gender group / solution ;
title 'Analysis of chol vs. drug group, controlling for age, gender' ;
run ;
==================================================================================

 2.  Dr. Smith performs a regression of diastolic blood pressure on dietary
     intake of sodium.  He plots the residuals and obtains the following
     graph:                                                      x

           x                                  x        x
  15 |  x       x
     |
  10 |                                           x
     |     x                             x                         x
   5 |                 x             x       x       x
     |          x              x
   0 ---------------x-------x-----------------------------------------------
     |   x    x          x             x          x
  -5 |                           x                        x
     |  x
 -10 |      x                                        x
     |  x                                                      x        x
 -15 |                                  x
     _______________________________________________________________________
       100     200     300     400     500     600     700     800     1000

                               Sodium intake (mg/day)


     a)  What is the mean residual ?

         Zero.


     b)  If you regress the residuals on sodium intake, what will be the
         result ?

--->     0 intercept, 0 slope (or coefficient for sodium) where-ever it occurs in the
         analysis section.

     c)  How can you use the residuals to test for outliers ?

--->     Output the residuals from proc reg onto a SAS dataset, then use proc
         univariate on them.


     d)  What does the above pattern of residuals suggest regarding the
         error structure of the relationship between DBP and sodium intake?

--->     Suggests that the variance of the measurements is not constant with
         respect to sodium intake, and presumably also not constant with
         respect to the response variable, chol.  However you would need to
         confirm the latter with a graph.  If that turns out to be true,
         then you should consider a transformation.  One possible transformation
         is:

                  transchol = sgn(chol - 80)*sqrt(abs(chol - 80)) ;

==================================================================================

 3.  The dataset below shows dose-levels of toxic chemical TCHEM, number of
     mice exposed to that dose-level (NMICE), and number of mice that died
     (NDIED):

         TCHEM       NMICE       NDIED
        -------     -------     -------
           0           16          1

           1           20          3

           2           21          4

           4           18          8

           8           22         14

          16           23         16

          32           19         19

     It is believe that the probability of death is related to the log of
the dose of TCHEM.

     a)  Write a program to analyze these data.

     b)  How would you use output from a program to estimate the dose which
         would be expected to result in the death of 50% of the mice ?
--------------------------------------------------------------------------------
a)   data toxic ;
          input dose nmice ndied ;
          logdose = log(dose + 1) ;
          nlived = nmice - ndied ;

          do i = 1 to nlived ;
             died = 0 ;
             output ;
          end ;

          do j = 1 to ndied ;
             died = 1 ;
             output ;
          end ;

      run ;

      proc logistic descending data = toxic ;
           model died = logdose ;
      run ;
--------------------------------------------------------------------------------
b)  Let b0 = estimate of intercept, b1 = estimate of logdose1 from the previous
    regression.  You want to find dose such that

           .5 = 1 / (1 + exp(-b0 - b1*logdose1))

    This is simply a matter of solving this equation for logdose1:

            2 = 1 + exp(-b0 - b1*logdose1), or

            1 = exp(-b0 - b1*logdose1).

    Take logs of both sides.  Note that log(1) = 0.  Therefore

            0 = -b0 - b1*logdose1 , or

            logdose1 = -b0/b1.  This means that

            log(dose + 1) = -b0/b1, or

            dose + 1 = exp(-b0/b1), or

            dose = exp(-b0/b1) - 1.

==================================================================================

 4.  The tables below show results of an experiment by a veterinary researcher
     on the calming effects of a drug on dogs and cats.  The objectives are to
     see whether dogs and cats have similar reactions to the drug, and whether
     the drug has a calming effect in either dogs or cats or both animals:

                          Dogs                     Cats
                 ---------------------     ---------------------
                     Drug    Placebo           Drug    Placebo
                 ---------------------     ---------------------
                 |         |         |     |         |         |
    Agitated     |    31   |    43   |     |    16   |    50   |
                 |         |         |     |         |         |
                 ---------------------     ---------------------
                 |         |         |     |         |         |
    Calm         |    69   |    57   |     |    84   |    50   |
                 |         |         |     |         |         |
                 ---------------------     ---------------------
                     100       100             100       100

                     OR =                      OR =


     a)  Write a complete SAS program to read in these data.

         data dogscats ;
              input animal drug response count ;
         cards ;
         1  1  1  31
         1  1  0  69
         1  0  1  43
         1  0  0  57
         0  1  1  16
         0  1  0  84
         0  0  1  50
         0  0  0  50
         ;
         run ;

     b)  Write a PROC FREQ procedure to carry out an appropriate analysis,
         and discuss how you would use the output.

--->     proc freq data = dogscats ;
              weight count ;
              tables animal * response * drug / cmh measures ;
         run ;

     c)  Write a PROC LOGISTIC procedures to carry out an appropriate analysis,
         and again discuss how you would use the output.

--->     1.  You need to re-write the datafile so that there is one observation
             per animal.

         2.  proc logistic descending data = dogscats ;
                  class animal response drug ;
                  model response = animal drug animal*drug / clodds = pl ;
             run ;

==================================================================================

 5.  a)  Why are CLASS variables useful in PROC GLM ?

--->     PROC GLM automatically creates indicator variables from the
         CLASS variables, and correctly codes interactions if interactions
         are specified.


     b)  Suppose CITY is a categorical variable with 5 possible values, and
         GENDER is a categorical variable with 2 values, and WEIGHT is a
         continuous (quantitative) variable.  You want to analyze a collection
         of 100 observations, studying the relationship of WEIGHT to CITY and
         GENDER.  You want to test for influential observations and outliers.
         Finally, you want to test for whether there is an interaction of
         CITY and GENDER.  Describe how you carry out the analysis using PROC REG.

--------------------------------------------------------------------------------
         1.  Create indicator variables for each of the cities: e.g., if
             one of the cities is St. Paul, create a variable called STPAUL
             which equals 0 if CITY is not equal to StPaul, and which equals 1
             if CITY equals St. Paul.   Similarly create an indicator
             variable for GENDER.

         2.  proc reg data = dataset ;
                  model WEIGHT = STPAUL CHICAGO NEWYORK BEIJING TOKYO GENDER ;
                  output out = regout rstudent = wstudent dffits = wdffits ;
             run ;

             proc print data = regout ;
                  var weight wstudent wdffits ;
             run ;
--------------------------------------------------------------------------------

         Write PROC GLM code to analyze the same data.

--------------------------------------------------------------------------------
             proc glm data = dataset ;
                  class city gender ;
                  model weight = city | gender / solution ;
             run ;

             [note: GLM does not provide statistics on influential points or
              outliers]
--------------------------------------------------------------------------------

     c)  Why are multiple comparisons tests (Bonferroni and others) sometimes
         used in one-way analysis of variance ?

--->     To decrease the probability that you will erroneously conclude that a factor
         is a significant predictor when in fact it is one of several factors
         that were considered, and the chance that one of them appears to be
         significant exceeds the probability that it alone is significant
         if the null hypothesis is true (i.e., actually none of the factors
         is related to the outcome).

==================================================================================

 6.  The following datafile is for a case-control study of skin-cancer incidence
     versus type of hat worn by farmers:

     Case-Control    Case or
        Set #        Control?    Type of Hat   Eye Color
     ------------    --------    -----------   ---------
          1            case         Strap        Brown
          1            control      Cowboy       Blue
          1            control      Strap        Brown
          2            case         Strap        Blue
          2            control      Cowboy       Blue
          2            control      Cowboy       Brown
          3            case         Cowboy       Blue
          3            control      Cowboy       Brown
          3            control      Strap        Blue

                      [more observations]

    Write a program which reads in the data and carries out an appropriate
    case-control analysis to determine whether the type of hat is related
    to the incidence of skin cancer, controlling for eye color.

--------------------------------------------------------------------------------

data casecon ;
     input ccset status hat eyecolor ;

     nstatus = . ;
     if status eq 'case' then nstatus = 1 ;
     if status eq 'control' then nstatus = 0 ;

     nhat = . ;
     if hat = 'Strap' then nhat = 1 ;
     if hat = 'Cowboy' then nhat = 2 ;

     ncolor = . ;
     if eyecolor = 'Brown' then ncolor = 1 ;
     if eyecolor = 'Blue' then ncolor = 2 ;

     time = 1 - nstatus ;

cards ;
          1            case         Strap        Brown
          1            control      Cowboy       Blue
          1            control      Strap        Brown
          2            case         Strap        Blue
          2            control      Cowboy       Blue
          2            control      Cowboy       Brown
          3            case         Cowboy       Blue
          3            control      Cowboy       Brown
          3            control      Strap        Blue

                      [more observations]

          ;

 run ;

 proc phreg data = casecon ;
      model time * nstatus(0) = hat eyecolor / rl ;
 run ;

==================================================================================

 7.  You conduct a random survey of people living in Minneapolis, MN,
     recording their weight, gender, age, and carbohydrate intake (in
     calories).  You want to construct a table like the following:

                    Carbohydrate Intake(g)                             Weight
     Carbohydrate   ---------------------               Percent      ------------
       Quintile     Mean      Min    Max      Mean Age    Women      Mean  StdDev
     ------------   ----     -----  -----     --------  --------     ----  ------
          1          201       50    300         35        67         56     20

          2          233       302   390         34        70         52     14

                                  [etc.]

     How do you do this ?
--------------------------------------------------------------------------------

     proc rank data = dataset group = 5  out = outrank ;
          var carbohyd ;
          ranks carbquin ;
     run ;

     proc means data = outrank n mean std min max ;
          class carbquin ;
          var   carbohyd age gender weight ;
     title 'Mean carbo intake, age, gender, and weight by carbo quintile.' ;


==================================================================================

 8.  The following is a summary of data from an eye study.  Each person
     in the study had one eye randomized to treatment and one eye randomized
     to control.  Each eye was evaluated as having a successful outcome or
     a failure outcome.

                                                      Number of
                Outcome Category                        People
     ----------------------------------------        ----------
     Treated eye success, control eye success            100
     Treated eye success, control eye failure             50
     Treated eye failure, control eye success             30
     Treated eye failure, control eye failure             80

     Write a SAS program which reads in this data, and which carries out an
     appropriate analysis using (1) PROC FREQ, and (2) PROC LOGISTIC.

--------------------------------------------------------------------------------
PROC FREQ approach:

     data eyes ;
          input trtsucc  contsucc  count ;
          cards ;
          1  1  100
          1  0   50
          0  1   30
          0  0   80
          ;
     run ;

     proc freq data = eyes ;
          weight count ;
          tables trtsucc * contsucc / agree ;
     run ;
--------------------------------------------------------------------------------
PROC LOGISTIC approach

     data eyes ;
          input trtsucc  contsucc  count ;
          one = 1 ;
          diff = trtsucc - contsucc ;
          do i = 1 to count ;
             output ;
          end ;

          cards ;
          1  1  100
          1  0   50
          0  1   30
          0  0   80
          ;
     run ;

     proc logistic data = eyes ;
          model one = diff / noint ;
     run ;
--------------------------------------------------------------------------------

     In PROC FREQ, what statistic tells you the relative effect of treatment
     versus control ?

---> Odds ratio

     In PROC LOGISTIC, what is the relationship of coefficient estimates
     to the relative effect of treatment versus control ?

---> exp(coeff) = Odds ratio

     Show how this same data can be analyzed using PROC PHREG.
--------------------------------------------------------------------------------
PROC PHREG approach

options linesize = 80 ;

     data eyes ;
          retain set 0 ;
          input trtsucc  contsucc  count ;
          one = 1 ;

          do i = 1 to count ;

            set = set + 1 ;

            group = 1 ;
            time = 1 - trtsucc ;
            case = trtsucc ;
            output ;

            group = 0 ;
            time = 1 - contsucc ;
            case = contsucc ;
            output ;

          end ;

          cards ;
          1  1  100
          1  0   50
          0  1   30
          0  0   80
          ;
     run ;

proc print data = eyes ;

     proc phreg data = eyes ;
***       model time * case(0) = group / ties = discrete ;
          strata set ;
     run ;


*** Note: the 'ties = discrete' option was omitted in the example in
    notes n54703.017.  It is essential in this procedure.

==================================================================================

 9.  The following dataset shows followup time and event or censoring status
     for a clinical trial of Drug 1 versus Drug 2:


                                               Event or
     Obs       Drug         Followup Time      Censored?    Age
    -----     ------       ---------------    -----------  -----
      1          1               73                C         24
      2          1               45                E         77
      3          2               12                E         40
      4          1               56                C         29
      5          2               65                E         64

                   [more observations]

     Write a program to analyze this data using PROC LIFETEST (Ignore effect
     of age).

--------------------------------------------------------------------------------
     data drugs ;
          infile 'drugfile' ;
          input drug follmons censored age ;
          ncensor = 1 ;
          if censored = 'E' then ncensor = 0 ;
     run ;

     proc lifetest data = drugs notable outsurv = surcurve ;
          time follmons * ncensor(0);
          strata drug ;
     run ;
--------------------------------------------------------------------------------

     What plots might you want to examine ?

---> Survival, log(survival), hazard.

     How do you decide which drug is better ?

---> Look at the two survival curves (drug = 1 vs drug = 2).  The curve
     with the higher survival rate corresponds to the better drug.

     Write a program to analyze this data, estimating the effect of drug
     while controlling for age.

--------------------------------------------------------------------------------
     proc phreg data = drugs ;
          model follmons * ncensor(0) = age drug ;
     run ;
==================================================================================

10.  Write SAS code to find p-values corresponding to the following statistics:

     T-statistic, two-sided, df = 25, t = -1.96

     Chi-square statistic, df = 10, X2 = 25.56

     F-statistic, df = (5, 100), F = 15.11

     Z-statistic, one-sided, z = 2.16.

--------------------------------------------------------------------------------
     tprob = 2 * probt(-1.96, 25) ;
     cprob = 1 - probchi(25.56, 10) ;
     fprob = 1 - probf(15.11, 5, 100) ;
     zprob = 1 - probnorm(2.16) ;

==================================================================================

11.  Random number generators:

     a)  Write a SAS program which randomly chooses a day in the year 2004.

--------------------------------------------------------------------------------
         data day ;
              seed = -1 ;
              rday = 1 + int(366*ranuni(seed)) ;
              dec3103 = mdy(12, 31, 2003) ;
              randday = rday + dec3103 ;
              output ;
         run ;
--------------------------------------------------------------------------------


     b)  Write a SAS program which randomly assigns 200 people to three drug
         groups, with the following probabilities:

         Group 1:  prob = .36

         Group 2:  prob = .24

         Group 3:  prob = .40

--------------------------------------------------------------------------------
 data drugs ;
      p1 = .36 ;
      p2 = .24 ;
      p3 = .40 ;
      p12 = p1 + p2 ;

      n = 200 ;

      do i = 1 to n ;
         r = ranuni(-1) ;
         group = 1 ;
         if r > p1 then group = 2 ;
         if r > p12 then group = 3 ;
         output ;
      end ;

 run ;
--------------------------------------------------------------------------------

     c)  Write a SAS program which randomly generates 100 numbers from a normal
         distribution with mean 5 and standard deviation 2.
--------------------------------------------------------------------------------

     data x100 ;

          do i = 1 to 100 ;
             x = 5 + 2 * rannor(-1) ;
             output ;
          end ;
     run ;
--------------------------------------------------------------------------------

     d)  Write a SAS program to generate a randomization schedule with 300
         people, with exactly 100 of them assigned to Drug A and 200 of them 
         assigned to Drug B.
--------------------------------------------------------------------------------
 data randschd ;
      p = .3333333333 ;
      seed = 12985634 ;

      do i = 1 to 300 ;

         r = ranuni(seed) ;
         assign = 'A' ;
         if i gt 100 then assign = 'B' ;
         output ;

      end ;
run ;

proc sort data = randschd ;  by r ;
run ;

--------------------------------------------------------------------------------

     e)  If you randomize in blocks to guarantee approximate balance between
         treatment groups at any point in the randomization schedule, what
         desirable characteristic of the schedule is impaired ?

--->  Unpredictability.

==================================================================================

12.  A datafile has 100 observations for husband-wife pairs as follows:

      Family      Husband's  Husband's  Husband's  Wife's  Wife's  Wife's
      Income         Age       DBP      Insomnia    Age     DBP    Insomnia
      -------     ---------  ---------  ---------  ------  ------  --------
      $35000         28         88          0        25      70        1
      $29300         53        100          1        54      94        1
      $67250         41         76          1        41      98        0

                        [more observations]


     a)  Write a SAS program to read the datafile and write out a new
         file which has one observation for the husband and one for the
         wife.
--------------------------------------------------------------------------------

data insomnia ;
     infile 'insomnia.data' ;
     input  income  husbage  husbdbp  husbinsm
                    wifeage  wifedbp  wifeinsm ;
     husb = 1 ;
     age = husbage ;
     dbp = husbdbp ;
     insom = husbinsm ;
     output ;

     husb = 0 ;
     age = wifeage ;
     dbp = wifedbp ;
     insom = wifeinsm ;
     output ;
run ;

--------------------------------------------------------------------------------

     b)  The problem is to determine whether DBP predicts insomnia.  You
         would want to control for income and age, and you want to analyze
         the data as paired.  Write the appropriate SAS code.


--->  It makes no sense to analyze this as paired data.  It would make sense
      if the predictor of interest were gender.


Most recent update: May 8, 2004.