PROC LOGISTIC, III: Continuous Covariates                         n54703.011

     In the preceding discussions of PROC LOGISTIC we have focussed on
examples in which categorical covariates (or 'predictors') are
employed.  However, PROC LOGISTIC, like PROC GLM, can be used with
continuous covariates also.

     If Y is a dichotomous outcome variable [e.g., success or failure
of an operation, and X1, X2, ... Xp are p covariates, the usual logistic
model is:

                                                            1
     Prob(Y = 1 | X1, X2, ..., Xp) = ------------------------------------------ ,
                                     1 + exp(-b0 - b1*X1 - b2*X2 - ... - bp*Xp)


where b0, b1, b2, ..., bp are unknown coefficients.

     It is certainly possible that the covariates of interest are
in fact risk factors, like age or systolic blood pressure or FEV1.
Age in particular is a risk factor for a great many chronic conditions
and for survival of serious illnesses: in general you expect that
the higher the age, the greater the probability of an event.  In terms
of the coefficients in the above model, that translates into the
coefficient of age being *positive*.

     This also relates to the interpretation of exp(b1):  this is equal to an
odds ratio.  It is the odds ratio associated with a 1-unit increase in the
value of the covariate X1.

     Appended is an example of logistic regression with several covariates.
The data are from the Lung Health Study.  The outcome variable is smoking
status at Year 1:  VPCQUIT1 = 0 means the person did not quit smoking, whereas
VPCQUIT1 = 1 means the person did quit smoking.  The covariates are AGE at
baseline; GENDER (0 = men, 1 = women); BMI (body mass index); F10CIGS (baseline
cigarettes per day); YEAREDUC (years education category, 1-9); and S2FEVPC2
(baseline FEV1 percent of normal).

     The first regression uses the selection = stepwise option.  The
second regression shows the effect of adding in interaction terms.

=================================================================================


 educage = age*educ ;
 educbmi = educ*bmi ;

* ==================================================================== ;        

options linesize = 100 ;

proc logistic data = smoke descending ;
     where siuc eq 1 ;
     model vpcquit1 = age gender bmi f10cigs yeareduc s2fevpc2
                      /selection = stepwise lackfit rsquare clodds = pl;
title1 'PROC LOGISTIC: quit-smoking at 1 year versus demographic vars' ;
title2 'and baseline FEV1 percent predicted.  Stepwise option employed.' ;
title3 'Also lack-of-fit statistics, pseudo-R-square' ;

proc logistic data = smoke descending ;
     where siuc eq 1 ;
     model vpcquit1 = yeareduc age bmi educage educbmi
                      /lackfit rsquare clodds = pl;
title1 'PROC LOGISTIC: quit-smoking at 1 year versus demographic vars' ;
title2 'and interactions. Also lack-of-fit statistics, pseudo-R-square' ;
title3 '' ;
run ;

endsas;

----------------------------------------------------------------------------------------------------
                   PROC LOGISTIC: quit-smoking at 1 year versus demographic vars                   1
                  and baseline FEV1 percent predicted.  Stepwise option employed.
                            Also lack-of-fit statistics, pseudo-R-square
                                                                      17:37 Saturday, March 13, 2004

                                       The LOGISTIC Procedure

               Data Set: WORK.SMOKE   
               Response Variable: VPCQUIT1  VALID QUIT AT AV1
               Response Levels: 2
               Number of Observations: 3916
               Link Function: Logit


                                          Response Profile
 
                                    Ordered
                                      Value  VPCQUIT1     Count

                                          1         1      1354
                                          2         0      2562

WARNING: 7 observation(s) were deleted due to missing values for the response or explanatory 
         variables.



                                    Stepwise Selection Procedure


Step  0. Intercept entered:


                         Residual Chi-Square = 56.3119 with 6 DF (p=0.0001)



Step  1. Variable YEAREDUC entered:


                Model Fitting Information and Testing Global Null Hypothesis BETA=0
 
                                         Intercept
                           Intercept        and   
             Criterion       Only       Covariates    Chi-Square for Covariates

             AIC            5051.940      5026.333         .                          
             SC             5058.213      5038.878         .                          
             -2 LOG L       5049.940      5022.333       27.607 with 1 DF (p=0.0001)  
             Score              .             .          27.458 with 1 DF (p=0.0001)  

                      RSquare = 0.0070          Max-rescaled RSquare = 0.0097



                         Residual Chi-Square = 28.8407 with 5 DF (p=0.0001)

 
 
                        LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 13MAR04 17:37
----------------------------------------------------------------------------------------------------
                    PROC LOGISTIC: quit-smoking at 1 year versus demographic vars                   2
                  and baseline FEV1 percent predicted.  Stepwise option employed.
                            Also lack-of-fit statistics, pseudo-R-square
                                                                      17:37 Saturday, March 13, 2004

                                       The LOGISTIC Procedure

Step  2. Variable AGE entered:


                Model Fitting Information and Testing Global Null Hypothesis BETA=0
 
                                         Intercept
                           Intercept        and   
             Criterion       Only       Covariates    Chi-Square for Covariates

             AIC            5051.940      5019.408         .                          
             SC             5058.213      5038.226         .                          
             -2 LOG L       5049.940      5013.408       36.533 with 2 DF (p=0.0001)  
             Score              .             .          36.451 with 2 DF (p=0.0001)  

                      RSquare = 0.0093          Max-rescaled RSquare = 0.0128



                         Residual Chi-Square = 19.9902 with 4 DF (p=0.0005)



Step  3. Variable BMI entered:


                Model Fitting Information and Testing Global Null Hypothesis BETA=0
 
                                         Intercept
                           Intercept        and   
             Criterion       Only       Covariates    Chi-Square for Covariates

             AIC            5051.940      5012.708         .                          
             SC             5058.213      5037.799         .                          
             -2 LOG L       5049.940      5004.708       45.232 with 3 DF (p=0.0001)  
             Score              .             .          45.113 with 3 DF (p=0.0001)  

                      RSquare = 0.0115          Max-rescaled RSquare = 0.0158



                         Residual Chi-Square = 11.2975 with 3 DF (p=0.0102)



Step  4. Variable S2FEVPC2 entered:

 
 
 
 
 
 
                        LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 13MAR04 17:37
----------------------------------------------------------------------------------------------------
                    PROC LOGISTIC: quit-smoking at 1 year versus demographic vars                   3
                  and baseline FEV1 percent predicted.  Stepwise option employed.
                            Also lack-of-fit statistics, pseudo-R-square
                                                                      17:37 Saturday, March 13, 2004

                                       The LOGISTIC Procedure

                Model Fitting Information and Testing Global Null Hypothesis BETA=0
 
                                         Intercept
                           Intercept        and   
             Criterion       Only       Covariates    Chi-Square for Covariates

             AIC            5051.940      5006.806         .                          
             SC             5058.213      5038.170         .                          
             -2 LOG L       5049.940      4996.806       53.134 with 4 DF (p=0.0001)  
             Score              .             .          52.875 with 4 DF (p=0.0001)  

                      RSquare = 0.0135          Max-rescaled RSquare = 0.0186



                         Residual Chi-Square = 3.4348 with 2 DF (p=0.1795)



NOTE: No (additional) variables met the 0.05 significance level for entry into the model.


                                    Summary of Stepwise Procedure
 
            Variable       Number     Score       Wald        Pr >     Variable
 Step  Entered   Removed       In  Chi-Square  Chi-Square  Chi-Square    Label 

    1  YEAREDUC                 1     27.4576           .      0.0001  YEARS  EDUCATION            
    2  AGE                      2      8.8999           .      0.0029  AGE AT ENTRY INTO LHS       
    3  BMI                      3      8.7354           .      0.0031  BODY MASS INDEX (KG/M2)     
    4  S2FEVPC2                 4      7.8697           .      0.0050  FEV1 % PRED POST-BD SCREEN 2


                             Analysis of Maximum Likelihood Estimates
 
                  Parameter    Standard       Wald          Pr >       Standardized        Odds
Variable    DF     Estimate      Error     Chi-Square    Chi-Square      Estimate         Ratio

INTERCPT    1       -3.8862      0.5186       56.1531        0.0001               .        .       
AGE         1        0.0177     0.00509       12.0865        0.0005        0.066710       1.018    
BMI         1        0.0263     0.00860        9.3313        0.0023        0.056911       1.027    
YEAREDUC    1        0.0640      0.0121       28.1388        0.0001        0.100014       1.066    
S2FEVPC2    1        0.0107     0.00383        7.8554        0.0051        0.053863       1.011    
 
 
 
 
 
 
 
 
 
                        LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 13MAR04 17:37
----------------------------------------------------------------------------------------------------
                    PROC LOGISTIC: quit-smoking at 1 year versus demographic vars                   4
                  and baseline FEV1 percent predicted.  Stepwise option employed.
                            Also lack-of-fit statistics, pseudo-R-square
                                                                      17:37 Saturday, March 13, 2004

                                       The LOGISTIC Procedure

                   Association of Predicted Probabilities and Observed Responses

                             Concordant = 56.4%          Somers' D = 0.139
                             Discordant = 42.5%          Gamma     = 0.141
                             Tied       =  1.0%          Tau-a     = 0.063
                             (3468948 pairs)             c         = 0.570


                        Conditional Odds Ratios and 95% Confidence Intervals
 
                                                           Profile Likelihood
                                                            Confidence Limits
                                                  Odds
                      Variable        Unit       Ratio       Lower       Upper

                      AGE           1.0000       1.018       1.008       1.028
                      BMI           1.0000       1.027       1.009       1.044
                      YEAREDUC      1.0000       1.066       1.041       1.092
                      S2FEVPC2      1.0000       1.011       1.003       1.018


                              Hosmer and Lemeshow Goodness-of-Fit Test
 
                                           VPCQUIT1 = 1            VPCQUIT1 = 0
                                       --------------------    --------------------
                  Group       Total    Observed    Expected    Observed    Expected

                      1         394          95      100.76         299      293.24
                      2         392         127      113.73         265      278.27
                      3         392         120      120.87         272      271.13
                      4         394         123      127.16         271      266.84
                      5         393         118      131.82         275      261.18
                      6         391         142      136.21         249      254.79
                      7         390         141      141.57         249      248.43
                      8         393         152      149.16         241      243.84
                      9         392         160      158.32         232      233.68
                     10         385         176      174.29         209      210.71

                       Goodness-of-fit Statistic = 5.544 with 8 DF (p=0.6982)
 
 
 
 
 
 
 
 
 
 
 
 
 
                        LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 13MAR04 17:37
----------------------------------------------------------------------------------------------------
                    PROC LOGISTIC: quit-smoking at 1 year versus demographic vars                   5
                   and interactions. Also lack-of-fit statistics, pseudo-R-square
                                                 '                    17:37 Saturday, March 13, 2004

                                       The LOGISTIC Procedure

               Data Set: WORK.SMOKE   
               Response Variable: VPCQUIT1  VALID QUIT AT AV1
               Response Levels: 2
               Number of Observations: 3916
               Link Function: Logit


                                          Response Profile
 
                                    Ordered
                                      Value  VPCQUIT1     Count

                                          1         1      1354
                                          2         0      2562

WARNING: 7 observation(s) were deleted due to missing values for the response or explanatory 
         variables.



                Model Fitting Information and Testing Global Null Hypothesis BETA=0
 
                                         Intercept
                           Intercept        and   
             Criterion       Only       Covariates    Chi-Square for Covariates

             AIC            5051.940      5014.621         .                          
             SC             5058.213      5052.258         .                          
             -2 LOG L       5049.940      5002.621       47.319 with 5 DF (p=0.0001)  
             Score              .             .          46.732 with 5 DF (p=0.0001)  

                      RSquare = 0.0120          Max-rescaled RSquare = 0.0166



                              Analysis of Maximum Likelihood Estimates
 
             Parameter Standard    Wald       Pr >    Standardized     Odds Variable
 Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio   Label 

 INTERCPT 1    -5.1471   1.6542     9.6815     0.0019            .     .    Intercept              
 YEAREDUC 1     0.2289   0.1177     3.7839     0.0517     0.357704    1.257 YEARS  EDUCATION       
 AGE      1     0.0459   0.0251     3.3474     0.0673     0.173002    1.047 AGE AT ENTRY INTO LHS  
 BMI      1     0.0525   0.0431     1.4794     0.2239     0.113603    1.054 BODY MASS INDEX (KG/M2)
 EDUCAGE  1   -0.00225  0.00177     1.6187     0.2033    -0.203210    0.998                        
 EDUCBMI  1   -0.00201  0.00310     0.4188     0.5175    -0.100583    0.998                        
 
 
                        LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 13MAR04 17:37
----------------------------------------------------------------------------------------------------
                    PROC LOGISTIC: quit-smoking at 1 year versus demographic vars                   6
                   and interactions. Also lack-of-fit statistics, pseudo-R-square
                                                 '                    17:37 Saturday, March 13, 2004

                                       The LOGISTIC Procedure

                   Association of Predicted Probabilities and Observed Responses

                             Concordant = 56.1%          Somers' D = 0.134
                             Discordant = 42.8%          Gamma     = 0.135
                             Tied       =  1.1%          Tau-a     = 0.060
                             (3468948 pairs)             c         = 0.567


                        Conditional Odds Ratios and 95% Confidence Intervals
 
                                                           Profile Likelihood
                                                            Confidence Limits
                                                  Odds
                      Variable        Unit       Ratio       Lower       Upper

                      YEAREDUC      1.0000       1.257       0.998       1.583
                      AGE           1.0000       1.047       0.997       1.100
                      BMI           1.0000       1.054       0.968       1.147
                      EDUCAGE       1.0000       0.998       0.994       1.001
                      EDUCBMI       1.0000       0.998       0.992       1.004


                              Hosmer and Lemeshow Goodness-of-Fit Test
 
                                           VPCQUIT1 = 1            VPCQUIT1 = 0
                                       --------------------    --------------------
                  Group       Total    Observed    Expected    Observed    Expected

                      1         393          98      100.53         295      292.47
                      2         391         121      114.67         270      276.33
                      3         393         121      122.35         272      270.65
                      4         394         109      127.97         285      266.03
                      5         392         135      132.52         257      259.48
                      6         394         148      137.97         246      256.03
                      7         392         146      142.65         246      249.35
                      8         393         151      149.32         242      243.68
                      9         394         170      158.21         224      235.79
                     10         380         155      167.68         225      212.32

                       Goodness-of-fit Statistic = 9.298 with 8 DF (p=0.3178)
 
 
 
                        LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 13MAR04 17:37
=================================================================================

     The stepwise regression selected 4 of the covariates as useful
predictors of the outcome (quitting smoking at Year 1): years education,
age, body mass index, and baseline FEV1 percent of normal.  All had
positive coefficients, indicating that higher levels of any of them
are associated with an increased chance of having quit smoking by
Year 1.

     Note that the coefficient for AGE is 0.0177.  The odds ratio
corresponding to a 1-year increase in age is exp(0.0177) = 1.018.
This is obviously not very different from 1.  In fact you might
not be very interested in the effect of only a 1-year increase in
age.  You might be more interested in the effect of a 10-year
increment in age.  This can be calculated also.  The odds ratio
of quitting smoking for a 10-year increase in age is:

     exp(10*.0177) = exp(.177) = 1.1936.


     SAS will compute a 'pseudo-R-square' for logistic regressions.
This is analogous to the R-square statistic for ordinary linear
regression.  If R-square is close to 1, it is an indications that much
of the variability in outcome is due to the regression, i.e., is
explained by the covariates.

     In this case the final value of R-square is very small: 0.0135.  This
implies that almost all of the variability is due to factors that are
not entered into the model.  They may be factors that were not even
measured in the study.  Quitting smoking is a complex behavioral change:
it is believed to involve physical and psychological dependence, brain
chemistry and genetics, social pressures, economics, fears of weight
gain, co-existing chronic diseases, work restrictions, and many other
factors, and, probably, interactions of factors.


     Another option on the MODEL statement that was used here is the
Hosmer-Lemeshow goodness-of-fit statistic.  Values of this statistic
are compared to values from a chi-square distribution.  In this case,
the chi-distribution has 8 degrees of freedom and the value of the
statistic is 5.544.  This yields a p-value of 0.6982.  Thus this test
does not provide any convincing evidence that the fit of the model is not
good.

     The second analysis shows the effect of adding two interaction
terms:  years education with age (EDUCAGE) and years education with
with body mass index (EDUCBMI).  As shown at the top of the program,
these are both defined simply in terms of the product of the two
variables.

     The best way to assess whether adding these two interaction
terms makes a significant difference in the model is to examine
the change in -2 Log L from a model which does not include the
interaction terms, to another model which does.  The results were
as follows:

Model vpcquit1 = yeareduc age bmi                 -2 Log L = 5004.708

Model vpcquit1 = yeareduc age bmi educage educbmi -2 Log L = 5002.621

---------------------------------------------------------------------
 Diff(-2 Log L)                                            =    2.087

This should be compared to a chi-squared distribution with 2
degrees of freedom.  The p-value for testing the null hypothesis that
the two added interaction terms do not make a difference is:  0.3522.
This is not sufficient evidence to reject the null hypothesis.

[Note that results for the first of these two models are not shown
 in the printout.]


     One precaution about comparing models using -2 Log L.  Say, for
example, you were comparing:

     Model 1:     model y = x ;

and  Model 2:     model y = x z ;

     For the comparison of these two models based on -2 Log L, the
number of cases analyzed *must be the same* for these two models.
Note that it is possible that the covariate z may have missing values
in certain cases, while the covariate x is not missing in those same
cases.  In such cases -2 Log L should NOT be used uncritically
to compare the effect of adding the covariate z to the model.  You must
make sure that both analyses are carried out on the same dataset.  Here
is how to do it:

Model 1:    proc logistic descending data = dataset ;
                 where x ne . and z ne . ;
                 model y = x ;
                 title1 'Model 1: y vs x, restricted to cases where x and z are not missing.' ;
                 run ;

Model 2:    proc logistic descending data = dataset ;
                 where x ne . and z ne . ;
                 model y = x  z ;
                 title1 'Model 1: y vs x and z, restricted to cases where x and z are not missing.' ;
                 run ;


=================================================================================
Problem 1.

     Below is a program which reads a dataset based on a study of arthritis
treatment.  Participants in the study received either a treatment drug
or a placebo.  The outcome variable is called 'better'.  Other variables
in the study are ntreat, age and gender.

     You can use the 'Save as' option in the 'File' menu on a browser
to save this file, then edit it as you see fit.

     Use logistic regression to analysis these data.  Specifically:

     Model 1:  Age only as a predictor of 'better'.
     Model 2:  Gender only.
     Model 3:  ntreat only.
     Model 4:  Age and gender.
     Model 5:  Age and ntreat.
     Model 6:  Gender and ntreat.
     Model 7:  Age, gender and ntreat.

     For each of these models, tabulate the odds ratios and 95%
confidence intervals for each variable in the model.  Also for each
variable in these models, determine whether that variable makes a
statistically significant difference in the model:  specify also exactly
what null hypothesis is being tested.

     Test for interactions of:

       Age with ntreat
       Gender with ntreat
       Age with gender

     Looking over all the analyses you have carried out, specify the model 
which you think is best, and justify your choice.

------------------------------------------------------------------------
options linesize = 80 ;
footnote "yourprogram.sas &sysdate &systime" ;

data arthrit;
   length treat $7. sex $6. ;
   input id treat $ sex $ age improve @@ ;
   better  = 0 ;
   if improve > 0 then better = 1 ;
   gender = 0 ;
   if sex eq 'Female' then gender = 1 ;
   ntreat = 0 ;
   if treat eq 'Treated' then ntreat = 1 ;

  cards ;
57 Treated Male   27 1   9 Placebo Male   37 0
46 Treated Male   29 0  14 Placebo Male   44 0
77 Treated Male   30 0  73 Placebo Male   50 0
17 Treated Male   32 2  74 Placebo Male   51 0
36 Treated Male   46 2  25 Placebo Male   52 0
23 Treated Male   58 2  18 Placebo Male   53 0
75 Treated Male   59 0  21 Placebo Male   59 0
39 Treated Male   59 2  52 Placebo Male   59 0
33 Treated Male   63 0  45 Placebo Male   62 0
55 Treated Male   63 0  41 Placebo Male   62 0
30 Treated Male   64 0   8 Placebo Male   63 2
 5 Treated Male   64 1  80 Placebo Female 23 0
63 Treated Male   69 0  12 Placebo Female 30 0
83 Treated Male   70 2  29 Placebo Female 30 0
66 Treated Female 23 0  50 Placebo Female 31 1
40 Treated Female 32 0  38 Placebo Female 32 0
 6 Treated Female 37 1  35 Placebo Female 33 2
 7 Treated Female 41 0  51 Placebo Female 37 0
72 Treated Female 41 2  54 Placebo Female 44 0
37 Treated Female 48 0  76 Placebo Female 45 0
82 Treated Female 48 2  16 Placebo Female 46 0
53 Treated Female 55 2  69 Placebo Female 48 0
79 Treated Female 55 2  31 Placebo Female 49 0
26 Treated Female 56 2  20 Placebo Female 51 0
28 Treated Female 57 2  68 Placebo Female 53 0
60 Treated Female 57 2  81 Placebo Female 54 0
22 Treated Female 57 2   4 Placebo Female 54 0
27 Treated Female 58 0  78 Placebo Female 54 2
 2 Treated Female 59 2  70 Placebo Female 55 2
59 Treated Female 59 2  49 Placebo Female 57 0
62 Treated Female 60 2  10 Placebo Female 57 1
84 Treated Female 61 2  47 Placebo Female 58 1
64 Treated Female 62 1  44 Placebo Female 59 1
34 Treated Female 62 2  24 Placebo Female 59 2
58 Treated Female 66 2  48 Placebo Female 61 0
13 Treated Female 67 2  19 Placebo Female 63 1
61 Treated Female 68 1   3 Placebo Female 64 0
65 Treated Female 68 2  67 Placebo Female 65 2
11 Treated Female 69 0  32 Placebo Female 66 0
56 Treated Female 69 1  42 Placebo Female 66 0
43 Treated Female 70 1  15 Placebo Female 66 1
                        71 Placebo Female 68 1
                         1 Placebo Female 74 2
;
run ;

--------------------------------------------------------------------------------------
Problem 2.  

The dataset below is from a study of remission of cancer.
The outcome variable is 'remiss' (0 = no, 1 = yes).  The covariates
of interest which are thought to be related to remission are cell, 
smear, infil, li, blast, and temp.  The dataset is shown below.  

Carry out a stepwise logistic regression for this data.  Include
a test for goodness of fit (the 'lackfit' option).  Explain
and interpret the results of the final model.

data remission;
     input remiss cell smear infil li blast temp;
     datalines;
   1   .8   .83  .66  1.9  1.1     .996
   1   .9   .36  .32  1.4   .74    .992
   0   .8   .88  .7    .8   .176   .982
   0  1     .87  .87   .7  1.053   .986
   1   .9   .75  .68  1.3   .519   .98
   0  1     .65  .65   .6   .519   .982
   1   .95  .97  .92  1    1.23    .992
   0   .95  .87  .83  1.9  1.354  1.02
   0  1     .45  .45   .8   .322   .999
   0   .95  .36  .34   .5  0      1.038
   0   .85  .39  .33   .7   .279   .988
   0   .7   .76  .53  1.2   .146   .982
   0   .8   .46  .37   .4   .38   1.006
   0   .2   .39  .08   .8   .114   .99
   0  1     .9   .9   1.1  1.037   .99
   1  1     .84  .84  1.9  2.064  1.02
   0   .65  .42  .27   .5   .114  1.014
   0  1     .75  .75  1    1.322  1.004
   0   .5   .44  .22   .6   .114   .99
   1  1     .63  .63  1.1  1.072   .986
   0  1     .33  .33   .4   .176  1.01
   0   .9   .93  .84   .6  1.591  1.02
   1  1     .58  .58  1     .531  1.002
   0   .95  .32  .3   1.6   .886   .988
   1  1     .6   .6   1.7   .964   .99
   1  1     .69  .69   .9   .398   .986
   0  1     .73  .73   .7   .398   .986
   ;


=================================================================================
n54703.011  Last update: March 28, 2006.