PROC LOGISTIC, I:  Dichotomous Outcomes.                          n54703.009

     PROC LOGISTIC is a procedure for investigating the relationship between
predictors, or 'covariates', and a categorical outcome variable. 'Categorical'
implies here that the number of possible values of the outcome is small: usually,
2-10 or so.  In the case that the number of outcomes is 2, the outcome variable is
called 'dichotomous': e.g., mortality (0 = alive, 1 = dead).

     The covariates can be either continuous variables, like age and blood pressure,
or categorical, like gender.  If Y denotes the outcome variable, which takes on only
values 0 and 1, and X1, X2, X3 denote the covariates, then the logistic model is
specified by a probability statement:
                                                   1
     Prob(Y = 1 | X1, X2, X3) = -------------------------------------
                                (1 + exp(-b0 - b1*X1 - b2*X2 - b3*X3))

     This is interpreted as saying: the probability that Y = 1, given the values X1,
X2, and X3 for the covariates, is given by the expression on the right side.

     The probability that Y = 0 is 1 - prob(Y = 1).  This turns out to be:

                                  exp(-b0 - b1*X1 - b2*X2 - b3*X3)
     Prob(Y = 0 | X1, X2, X3) = -------------------------------------
                                (1 + exp(-b0 - b1*X1 - b2*X2 - b3*X3)).

     These expressions may seem somewhat more complicated than those for a linear
model.  Why does the expression for prob(Y = 1 | X1, X2, X3) have an exponential
expression in a denominator?

     The logistic model is simpler, for modelling categorical outcomes, than many
other possible models.  It has the significant advantage that, regardless of the
values of the covariates X1, etc., and the unknown coefficients, it always
represents a probability between 0 and 1.  Expressing the second part of the
denominator as exp(-b0 - b1*X1 - b2*X2 - b3*X3) leads to simpler calculations and
interpretations of coefficients, as you will now see.

     Another advantage is that the relationship between a given covariate and Y is
simple to describe.  For example, if the coefficient b1 is positive, that means that
an increase in the covariate X1 results in an increase in the probability that Y =
1.

     The simplest possible case of logistic regression is the following: one
covariate, X1, which can take on only values 0 and 1. The model in this case is:

                                  1
     Prob(Y = 1 | X1) =  --------------------
                         (1 + exp(-b0 -b1*X1)

     Data for this model can be represented in a 2 x 2 table:

                 X = 0    X = 1
             ---------------------
             |         |         |
     Y = 0   |    a    |    b    |
             |         |         |
             ---------------------
             |         |         |
     Y = 1   |    c    |    d    |
             |         |         |
             ---------------------
                a + c     b + d


     Now think about the interpretation of the coefficients, b0 and b1. Note that if
X = 0, the probability in the observed data that Y = 1 is c / (a + c).  In the
logistic model, the probability that Y = 1, given that X = 0, is
                                      1
     Prob(Y = 1 | X1 = 0) =  --------------------
                             (1 + exp(-b0 -b1*0)

[*]                       =  1 / (1 + exp(-b0)) = c / (a + c)

     The ODDS that Y = 1, given that X1 = 0, are


                             Prob(Y = 1 | X1 = 0)
     Odds(Y = 1 | X1 = 0) = ----------------------
                             Prob(Y = 0 | X1 = 0)

                               c / (a + c)     c
                          =   ------------ =  ---
                               a / (a + c)     a

     Now solve equation [*] for b0:


            a + c = c + c * exp(-b0), or

            a = c * exp(-b0), or

            a/c = exp(-b0), or

            b0 = log(c / a).


     That is, b0 can be interpreted as the log of Odds(Y = 1 | X1 = 0).

     Note that

     Prob(Y = 1 | X1 = 1) = d / (b + d).

     From this we can find the solution for b1:

             1                d
     ------------------  = -------, so
     (1 + exp(-b0 - b1)    (d + b)


           d + b = d + d* exp(-b0 - b1), or

          b
         ---  =  exp(-b0 - b1) = exp(-b0)*exp(-b1) = (a/c) * exp(-b1), or
          d

          a*d
         ----- = exp(b1), or
          b*c

          b1 = log((a*d)/(b*c)) = log(Odds ratio: Y = 1, X1 = 1 vs X1 = 0).


     That is, b1 can be interpreted as the log of the odds ratio from the 2 x 2
table above.

     Here is the more general rule:

     b1 is the log of the odds ratio that Y = 1 for a 1-unit
     increase in X1.

     Below is an example which shows both PROC FREQ and PROC LOGISTIC used to
analyze data from a 2 x 2 table.  The data are from: The ISAM Study Group (1986):A
prospective trial of Intravenous Streptokinase in Acute Myocardial Infarction
(I.S.A.M.): Mortality, morbidity, and infarct size at 21 days.  NEJM 314: 1465-1471.  
The data are:


                    strepto
             ---------------------
                  0         1
             ---------------------
             |         |         |
chddth = 0   |   820   |   810   |
             |         |         |
             ---------------------
             |         |         |
chddth = 1   |    62   |    49   |
             |         |         |
             ---------------------
                a + c     b + d

================================================================================

options linesize = 80 ;
*===================================================================== ;        
footnote "~john-c/5421/54703.009.sas &sysdate &systime" ;

data myocard byperson ;

     strepto = 0 ; chddth = 0 ; count = 820 ; output myocard ;
     do i = 1 to count ; output byperson ; end ;
     strepto = 0 ; chddth = 1 ; count = 62 ;  output myocard ;
     do i = 1 to count ; output byperson ; end ;
     strepto = 1 ; chddth = 0 ; count = 810 ; output myocard ;
     do i = 1 to count ; output byperson ; end ;
     strepto = 1 ; chddth = 1 ; count = 49 ;  output myocard ;
     do i = 1 to count ; output byperson ; end ;

run ;

proc freq data = myocard ;
     tables chddth * strepto / chisq measures ;
     weight count ;
title 'PROC FREQ analysis: CHD Death versus Streptokinase' ;
run ;

proc logistic data = byperson descending ;
     model chddth = strepto / clodds = pl ctable ;
title 'PROC LOGISTIC analysis: CHD Death versus Streptokinase' ;
run ;

endsas ;
---------------------------------------------------------------------------------
               PROC FREQ analysis: CHD Death versus Streptokinase              1
                                                     18:22 Sunday, March 7, 2004

                               The FREQ Procedure

                           Table of chddth by strepto

                      chddth     strepto

                      Frequency|
                      Percent  |
                      Row Pct  |
                      Col Pct  |       0|       1|  Total
                      ---------+--------+--------+
                             0 |    820 |    810 |   1630
                               |  47.10 |  46.52 |  93.62
                               |  50.31 |  49.69 |
                               |  92.97 |  94.30 |
                      ---------+--------+--------+
                             1 |     62 |     49 |    111
                               |   3.56 |   2.81 |   6.38
                               |  55.86 |  44.14 |
                               |   7.03 |   5.70 |
                      ---------+--------+--------+
                      Total         882      859     1741
                                  50.66    49.34   100.00


                   Statistics for Table of chddth by strepto

             Statistic                     DF       Value      Prob
             ------------------------------------------------------
             Chi-Square                     1      1.2802    0.2579
             Likelihood Ratio Chi-Square    1      1.2835    0.2572
             Continuity Adj. Chi-Square     1      1.0679    0.3014
             Mantel-Haenszel Chi-Square     1      1.2795    0.2580
             Phi Coefficient                      -0.0271          
             Contingency Coefficient               0.0271          
             Cramer's V                           -0.0271          


                              Fisher's Exact Test
                       ----------------------------------
                       Cell (1,1) Frequency (F)       820
                       Left-sided Pr <= F          0.1507
                       Right-sided Pr >= F         0.8907
                                                         
                       Table Probability (P)       0.0414
                       Two-sided Pr <= P           0.2810
 
 
 
 
 
 
 
 
 
 
                    ~john-c/5421/54703.009.sas 07MAR04 18:22
---------------------------------------------------------------------------------
                PROC FREQ analysis: CHD Death versus Streptokinase              2
                                                     18:22 Sunday, March 7, 2004

                  Estimates of the Relative Risk (Row1/Row2)
 
       Type of Study                   Value       95% Confidence Limits
       -----------------------------------------------------------------
       Case-Control (Odds Ratio)      0.8001        0.5433        1.1782
       Cohort (Col1 Risk)             0.9007        0.7581        1.0700
       Cohort (Col2 Risk)             1.1257        0.9080        1.3956

                               Sample Size = 1741
 
 
                    ~john-c/5421/54703.009.sas 07MAR04 18:22
---------------------------------------------------------------------------------
              PROC LOGISTIC analysis: CHD Death versus Streptokinase            3
                                                     18:22 Sunday, March 7, 2004

                             The LOGISTIC Procedure

                               Model Information

                 Data Set                      WORK.BYPERSON   
                 Response Variable             chddth          
                 Number of Response Levels     2               
                 Number of Observations        1741            
                 Link Function                 Logit           
                 Optimization Technique        Fisher's scoring


                                Response Profile
 
                       Ordered                      Total
                         Value       chddth     Frequency

                             1            1           111
                             2            0          1630


                            Model Convergence Status

                 Convergence criterion (GCONV=1E-8) satisfied.          


                             Model Fit Statistics
 
                                                  Intercept
                                   Intercept         and   
                    Criterion        Only        Covariates

                    AIC              827.864        828.580
                    SC               833.326        839.505
                    -2 Log L         825.864        824.580


                    Testing Global Null Hypothesis: BETA=0
 
            Test                 Chi-Square       DF     Pr > ChiSq

            Likelihood Ratio         1.2835        1         0.2572
            Score                    1.2802        1         0.2579
            Wald                     1.2758        1         0.2587


                   Analysis of Maximum Likelihood Estimates
 
                                     Standard
      Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

      Intercept     1     -2.5822      0.1317      384.3324        <.0001
      strepto       1     -0.2230      0.1975        1.2758        0.2587
 
 
 
                    ~john-c/5421/54703.009.sas 07MAR04 18:22
---------------------------------------------------------------------------------
              PROC LOGISTIC analysis: CHD Death versus Streptokinase            4
                                                     18:22 Sunday, March 7, 2004

                             The LOGISTIC Procedure

                             Odds Ratio Estimates
                                       
                                Point          95% Wald
                  Effect     Estimate      Confidence Limits

                  strepto       0.800       0.543       1.178


          Association of Predicted Probabilities and Observed Responses

               Percent Concordant      27.8    Somers' D    0.055
               Percent Discordant      22.2    Gamma        0.111
               Percent Tied            50.0    Tau-a        0.007
               Pairs                 180930    c            0.528


        Profile Likelihood Confidence Interval for Adjusted Odds Ratios
 
          Effect          Unit     Estimate     95% Confidence Limits

          strepto       1.0000        0.800        0.541        1.176


                              Classification Table
 
              Correct      Incorrect                Percentages
     Prob          Non-          Non-           Sensi-  Speci-  False  False
    Level  Event  Event  Event  Event  Correct  tivity  ficity   POS    NEG 

    0.040    111      0   1630      0      6.4   100.0     0.0   93.6     . 
    0.060     62    810    820     49     50.1    55.9    49.7   93.0    5.7
    0.080      0   1630      0    111     93.6     0.0   100.0     .     6.4
 
                    ~john-c/5421/54703.009.sas 07MAR04 18:22

* ==================================================================== ;        

     Note that both PROC FREQ and PROC LOGISTIC print an estimate of
the odds ratio:

------------------------------------------------------------------------
PROC FREQ:
                  Estimates of the Relative Risk (Row1/Row2)
 
       Type of Study                   Value       95% Confidence Limits
       -----------------------------------------------------------------
       Case-Control (Odds Ratio)      0.8001        0.5433        1.1782

------------------------------------------------------------------------

 PROC LOGISTIC:
                             Odds Ratio Estimates
                                       
                                Point          95% Wald
                  Effect     Estimate      Confidence Limits

                  strepto       0.800       0.543       1.178

------------------------------------------------------------------------

     These estimates agree, and the confidence limits are the same also.  Recall
that the 'measures' option must be used in PROC FREQ to cause printing of the odds
ratio.  In PROC LOGISTIC, similarly one must specify 'clodds' as an option on the
MODEL statement.

     There is one characteristic of PROC LOGISTIC that you must be careful about.  
In the example above, the outcome, chddth, was coded as 0 (no CHD death) and 1 (CHD
death).  The default option is to code the lower value as an event and the higher
value as a non-event. To get what you expect in the analysis, you must do one of two
things:

     1) Recode the outcome so that events correspond to 1 and non-events
        correspond to 2.

     2) Or, put the option 'descending' on the PROC LOGISTIC line:

------------------------------------------------------------------------
proc logistic data = byperson descending ;
     model chddth = strepto / clodds = pl ctable ;
title 'PROC LOGISTIC analysis: CHD Death versus Streptokinase' ;
run ;
------------------------------------------------------------------------

     But don't do *both* 1) and 2)!

     Note also that the dataset analyzed by PROC LOGISTIC was 'byperson' rather than
the original dataset.  The WEIGHT statement has essentially no effect in PROC
LOGISTIC.  This means that the observationsn must be written out individually.

     PROC LOGISTIC prints parameter estimates for both b0 and b1.  It represents b0
as the intercept.  In this case, b0 = -2.5822.  As noted above, this should equal
log(c/a) = log(62/820) = -2.58216995 (according to my calculator).  Moreover, again
from the PROC LOGISTIC printout, b1 = -0.2230, which should equal log(a*d/(b*c) =
log(820*49/(810*62)) = -.22304399.

     In fact, in this case there is virtually no difference between the PROC FREQ
analysis and the PROC LOGISTIC analysis.

     The PROC LOGISTIC analysis does include some additional statistics:

     1.  The 'Model Fit' statistics

     2.  The 'Association of Predicted Probabilities and Observed Responses' Table

     3.  The 'Classification' Table

------------------------------------------------------------------------
1.  MODEL FIT STATISTICS:

                                                  Intercept
                                   Intercept         and   
                    Criterion        Only        Covariates

                    AIC              827.864        828.580
                    SC               833.326        839.505
                    -2 Log L         825.864        824.580

------------------------------------------------------------------------

     The most important of these is '-2 Log L', which means: -2 * log(likelihood).  
PROC LOGISTIC uses a numerical technique to produce parameter estimates.  This
technique yields what are called maximum likelihood estimates, or MLEs.  MLEs have
certain desirable properties: if the sample size is very large, there is a high
probability that MLEs will agree with the true values of the unknowns parameters.

      In this example, -2 Log L actually compares two models: the 'Intercept Only'
model, and the 'Intercept and Covariates' model.  The intercept only model could be
written as:

                          1
     Prob(Y = 1) =  ------------- .
                   (1 + exp(-b0))

     This is analogous to the 'intercept-only' (flat line) model is linear
regression.  Moreover, the value of -2 Log L for the intercept-only model is
analogous to the 'C Total' in the ANOVA table for linear regression.

     The 'Intercept and Covariates' model in this case is

                                  1
     Prob(Y = 1 | X1) =  --------------------
                         (1 + exp(-b0 -b1*X1)

where X1 = strepto and Y = chddth.

     The important thing to look at in comparing the intercept-only model with the
intercept-and-covariates model is the *difference* between the two in -2 Log L.  In
the example above, this is

     Diff(-2 Log L) = 825.864 - 824.580 = 1.284.

     Under the null hypothesis that the covariates have no effect on the outcome,
the difference between these two -2 Log L values has a a chi-square distribution
with degrees of freedom equal to the number of covariates in the model.  A large
value of Diff(-2 Log L) means that you reject the null hypothesis.  In this case
there is only one covariate, X1 = strepto.  The p-value for a chi-square statistic
with value 1.284 and 1 degree of freedom is p = .2572.  So in this case, although it
appears that streptokinase reduces CHD mortality, the difference between the
Streptokinase group and the Placebo group was not large enough to be statistically
significant.

     Note, incidentally, that the p-value from PROC FREQ for the Likelihood Ratio
Chi-Square also equals 0.2572.  This is not a coincidence; for this simple 2 x 2
table, the PROC LOGISTIC analysis based on Diff(-2 Log L) is exactly the same as the
Likelihood Ratio Chi-Square analysis.

     So why do we need PROC LOGISTIC?  Why not simply use PROC FREQ?

     In this case, with the dichotomous covariate X1 = strepto, the two analyses are
essentially the same.  However, PROC LOGISTIC can accommodate continuous covariates
(like age, cholesterol levels, etc) while PROC FREQ is restricted to only
categorical covariates. Furthermore, PROC LOGISTIC can accommodate much more
complicated models - models involving many covariates and interaction terms. This
will be explained in more detail later in these notes.  The AIC ('Akaike Information
Criterion') statistic will be explained later also.

------------------------------------------------------------------------
2.  ASSOCIATION OF PREDICTED AND OBSERVED

          Association of Predicted Probabilities and Observed Responses

               Percent Concordant      27.8    Somers' D    0.055
               Percent Discordant      22.2    Gamma        0.111
               Percent Tied            50.0    Tau-a        0.007
               Pairs                 180930    c            0.528
------------------------------------------------------------------------

     This interesting table is based on consideration of possiblew pairs of cases
and non-cases.  The number of cases (i.e., CHD deaths) is 62 + 49 = 111.  The number
of non-cases is 1630.  That means there is a total of 111 * 1630 possible pairs, and
111*1630 = 180930.

     Of these pairs, some are 'discordant' and some are 'concordant'. The discordant
pairs are people like the following:


     Prob(Person 1 has event) is less than Prob(Person 2 has event),
     but the opposite actually happens - that is, Person 1 has an
     event and Person 2 does not.


     Now: in this data set, those who are in the Strepto group have a lower
probability of event than those who are in the Placebo group.

     Therefore the number of discordant pairs is:

        49 * 820 = 40180, so

        Percent discordant:   40180 / 180930 = 22.21 %

     Similarly, the number of concordant pairs is:

        62 * 810 = 50220, and

        Percent concordant:   50220 / 180930 = 27.76%.

     There is a large group of pairs in this data where the responses are *tied*.
That is, both members of the pair have the same predicted outcome and they also have
the same observed outcome.  The number of such pairs in this dataset is:

     49 * 820  +  62 * 810 = 40180 + 50220 = 90400.

     The percent of such pairs is 90400 / 180930 = 49.96 %.


------------------------------------------------------------------------
3.  CLASSIFICATION TABLE:

 
              Correct      Incorrect                Percentages
     Prob          Non-          Non-           Sensi-  Speci-  False  False
    Level  Event  Event  Event  Event  Correct  tivity  ficity   POS    NEG 

    0.040    111      0   1630      0      6.4   100.0     0.0   93.6     . 
    0.060     62    810    820     49     50.1    55.9    49.7   93.0    5.7
    0.080      0   1630      0    111     93.6     0.0   100.0     .     6.4
 
------------------------------------------------------------------------

     The important line to consider in this table is the middle one, which
corresponds to the observed data.  The total number of people who were studied was
1740.  The number predicted by the model to have an event in the no-streptokinase
(X1 = 0) group is:

     882 * Prob(Y = 1 | X1 = 0) = 882 / (1 + exp(+2.5822)) = 61.9983.

     Thus the number predicted by the model to have an event in this group equals
(to within roundoff) the number (62) who actually did have an event. This is the
number of correctly predicted events in the no-streptokinase group.

     The number predicted to NOT have an event in the same group is 882 - 62 = 820.  
This is the number of INCORRECTLY predicted events in the no-streptokinase group.

     Similarly, the number correctly predicted to have an event in the streptokinase
group is

     859 * Prob(Y = 1|X1 = 1) = 859 / (1 + exp(2.5822 + .2230))

                              = 859 * .05704 = 49.

     [Note: .05704 approximately equals .06, the 'Prob Level' associated with the
middle line of the Classification table.]

     The number predicted NOT to have an event in the streptokinase group is 
849 - 49 = 810.

     These numbers give rise to rates of sensitivity and specificity for the
predictions produced by the logisitic model.  Sensitivity is the percent which are
correctly predicted among those who actually have the condition of interest.  The
number that actually had the condition of interest (CHD death) is 62 + 49 = 111.  
In this case, the sensitivity is therefore:

     62 / (62 + 49) = 62 / 111 = 55.86 %.

     Similarly, specificity is the number who are correctly predicted among those
who do NOT have the condition of interest.  The number who did not suffer CHD death
was 1630.  Therefore the specificity is estimated as:

     810 / (820 + 810) = 49.69 %.

------------------------------------------------------------------------

n54703.009  Last update: March 2, 2005.