PROC LOGISTIC, I: Dichotomous Outcomes. n54703.009
PROC LOGISTIC is a procedure for investigating the relationship between
predictors, or 'covariates', and a categorical outcome variable. 'Categorical'
implies here that the number of possible values of the outcome is small: usually,
2-10 or so. In the case that the number of outcomes is 2, the outcome variable is
called 'dichotomous': e.g., mortality (0 = alive, 1 = dead).
The covariates can be either continuous variables, like age and blood pressure,
or categorical, like gender. If Y denotes the outcome variable, which takes on only
values 0 and 1, and X1, X2, X3 denote the covariates, then the logistic model is
specified by a probability statement:
1
Prob(Y = 1 | X1, X2, X3) = -------------------------------------
(1 + exp(-b0 - b1*X1 - b2*X2 - b3*X3))
This is interpreted as saying: the probability that Y = 1, given the values X1,
X2, and X3 for the covariates, is given by the expression on the right side.
The probability that Y = 0 is 1 - prob(Y = 1). This turns out to be:
exp(-b0 - b1*X1 - b2*X2 - b3*X3)
Prob(Y = 0 | X1, X2, X3) = -------------------------------------
(1 + exp(-b0 - b1*X1 - b2*X2 - b3*X3)).
These expressions may seem somewhat more complicated than those for a linear
model. Why does the expression for prob(Y = 1 | X1, X2, X3) have an exponential
expression in a denominator?
The logistic model is simpler, for modelling categorical outcomes, than many
other possible models. It has the significant advantage that, regardless of the
values of the covariates X1, etc., and the unknown coefficients, it always
represents a probability between 0 and 1. Expressing the second part of the
denominator as exp(-b0 - b1*X1 - b2*X2 - b3*X3) leads to simpler calculations and
interpretations of coefficients, as you will now see.
Another advantage is that the relationship between a given covariate and Y is
simple to describe. For example, if the coefficient b1 is positive, that means that
an increase in the covariate X1 results in an increase in the probability that Y =
1.
The simplest possible case of logistic regression is the following: one
covariate, X1, which can take on only values 0 and 1. The model in this case is:
1
Prob(Y = 1 | X1) = --------------------
(1 + exp(-b0 -b1*X1)
Data for this model can be represented in a 2 x 2 table:
X = 0 X = 1
---------------------
| | |
Y = 0 | a | b |
| | |
---------------------
| | |
Y = 1 | c | d |
| | |
---------------------
a + c b + d
Now think about the interpretation of the coefficients, b0 and b1. Note that if
X = 0, the probability in the observed data that Y = 1 is c / (a + c). In the
logistic model, the probability that Y = 1, given that X = 0, is
1
Prob(Y = 1 | X1 = 0) = --------------------
(1 + exp(-b0 -b1*0)
[*] = 1 / (1 + exp(-b0)) = c / (a + c)
The ODDS that Y = 1, given that X1 = 0, are
Prob(Y = 1 | X1 = 0)
Odds(Y = 1 | X1 = 0) = ----------------------
Prob(Y = 0 | X1 = 0)
c / (a + c) c
= ------------ = ---
a / (a + c) a
Now solve equation [*] for b0:
a + c = c + c * exp(-b0), or
a = c * exp(-b0), or
a/c = exp(-b0), or
b0 = log(c / a).
That is, b0 can be interpreted as the log of Odds(Y = 1 | X1 = 0).
Note that
Prob(Y = 1 | X1 = 1) = d / (b + d).
From this we can find the solution for b1:
1 d
------------------ = -------, so
(1 + exp(-b0 - b1) (d + b)
d + b = d + d* exp(-b0 - b1), or
b
--- = exp(-b0 - b1) = exp(-b0)*exp(-b1) = (a/c) * exp(-b1), or
d
a*d
----- = exp(b1), or
b*c
b1 = log((a*d)/(b*c)) = log(Odds ratio: Y = 1, X1 = 1 vs X1 = 0).
That is, b1 can be interpreted as the log of the odds ratio from the 2 x 2
table above.
Here is the more general rule:
b1 is the log of the odds ratio that Y = 1 for a 1-unit
increase in X1.
Below is an example which shows both PROC FREQ and PROC LOGISTIC used to
analyze data from a 2 x 2 table. The data are from: The ISAM Study Group (1986):A
prospective trial of Intravenous Streptokinase in Acute Myocardial Infarction
(I.S.A.M.): Mortality, morbidity, and infarct size at 21 days. NEJM 314: 1465-1471.
The data are:
strepto
---------------------
0 1
---------------------
| | |
chddth = 0 | 820 | 810 |
| | |
---------------------
| | |
chddth = 1 | 62 | 49 |
| | |
---------------------
a + c b + d
================================================================================
options linesize = 80 ;
*===================================================================== ;
footnote "~john-c/5421/54703.009.sas &sysdate &systime" ;
data myocard byperson ;
strepto = 0 ; chddth = 0 ; count = 820 ; output myocard ;
do i = 1 to count ; output byperson ; end ;
strepto = 0 ; chddth = 1 ; count = 62 ; output myocard ;
do i = 1 to count ; output byperson ; end ;
strepto = 1 ; chddth = 0 ; count = 810 ; output myocard ;
do i = 1 to count ; output byperson ; end ;
strepto = 1 ; chddth = 1 ; count = 49 ; output myocard ;
do i = 1 to count ; output byperson ; end ;
run ;
proc freq data = myocard ;
tables chddth * strepto / chisq measures ;
weight count ;
title 'PROC FREQ analysis: CHD Death versus Streptokinase' ;
run ;
proc logistic data = byperson descending ;
model chddth = strepto / clodds = pl ctable ;
title 'PROC LOGISTIC analysis: CHD Death versus Streptokinase' ;
run ;
endsas ;
---------------------------------------------------------------------------------
PROC FREQ analysis: CHD Death versus Streptokinase 1
18:22 Sunday, March 7, 2004
The FREQ Procedure
Table of chddth by strepto
chddth strepto
Frequency|
Percent |
Row Pct |
Col Pct | 0| 1| Total
---------+--------+--------+
0 | 820 | 810 | 1630
| 47.10 | 46.52 | 93.62
| 50.31 | 49.69 |
| 92.97 | 94.30 |
---------+--------+--------+
1 | 62 | 49 | 111
| 3.56 | 2.81 | 6.38
| 55.86 | 44.14 |
| 7.03 | 5.70 |
---------+--------+--------+
Total 882 859 1741
50.66 49.34 100.00
Statistics for Table of chddth by strepto
Statistic DF Value Prob
------------------------------------------------------
Chi-Square 1 1.2802 0.2579
Likelihood Ratio Chi-Square 1 1.2835 0.2572
Continuity Adj. Chi-Square 1 1.0679 0.3014
Mantel-Haenszel Chi-Square 1 1.2795 0.2580
Phi Coefficient -0.0271
Contingency Coefficient 0.0271
Cramer's V -0.0271
Fisher's Exact Test
----------------------------------
Cell (1,1) Frequency (F) 820
Left-sided Pr <= F 0.1507
Right-sided Pr >= F 0.8907
Table Probability (P) 0.0414
Two-sided Pr <= P 0.2810
~john-c/5421/54703.009.sas 07MAR04 18:22
---------------------------------------------------------------------------------
PROC FREQ analysis: CHD Death versus Streptokinase 2
18:22 Sunday, March 7, 2004
Estimates of the Relative Risk (Row1/Row2)
Type of Study Value 95% Confidence Limits
-----------------------------------------------------------------
Case-Control (Odds Ratio) 0.8001 0.5433 1.1782
Cohort (Col1 Risk) 0.9007 0.7581 1.0700
Cohort (Col2 Risk) 1.1257 0.9080 1.3956
Sample Size = 1741
~john-c/5421/54703.009.sas 07MAR04 18:22
---------------------------------------------------------------------------------
PROC LOGISTIC analysis: CHD Death versus Streptokinase 3
18:22 Sunday, March 7, 2004
The LOGISTIC Procedure
Model Information
Data Set WORK.BYPERSON
Response Variable chddth
Number of Response Levels 2
Number of Observations 1741
Link Function Logit
Optimization Technique Fisher's scoring
Response Profile
Ordered Total
Value chddth Frequency
1 1 111
2 0 1630
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 827.864 828.580
SC 833.326 839.505
-2 Log L 825.864 824.580
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 1.2835 1 0.2572
Score 1.2802 1 0.2579
Wald 1.2758 1 0.2587
Analysis of Maximum Likelihood Estimates
Standard
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.5822 0.1317 384.3324 <.0001
strepto 1 -0.2230 0.1975 1.2758 0.2587
~john-c/5421/54703.009.sas 07MAR04 18:22
---------------------------------------------------------------------------------
PROC LOGISTIC analysis: CHD Death versus Streptokinase 4
18:22 Sunday, March 7, 2004
The LOGISTIC Procedure
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
strepto 0.800 0.543 1.178
Association of Predicted Probabilities and Observed Responses
Percent Concordant 27.8 Somers' D 0.055
Percent Discordant 22.2 Gamma 0.111
Percent Tied 50.0 Tau-a 0.007
Pairs 180930 c 0.528
Profile Likelihood Confidence Interval for Adjusted Odds Ratios
Effect Unit Estimate 95% Confidence Limits
strepto 1.0000 0.800 0.541 1.176
Classification Table
Correct Incorrect Percentages
Prob Non- Non- Sensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.040 111 0 1630 0 6.4 100.0 0.0 93.6 .
0.060 62 810 820 49 50.1 55.9 49.7 93.0 5.7
0.080 0 1630 0 111 93.6 0.0 100.0 . 6.4
~john-c/5421/54703.009.sas 07MAR04 18:22
* ==================================================================== ;
Note that both PROC FREQ and PROC LOGISTIC print an estimate of
the odds ratio:
------------------------------------------------------------------------
PROC FREQ:
Estimates of the Relative Risk (Row1/Row2)
Type of Study Value 95% Confidence Limits
-----------------------------------------------------------------
Case-Control (Odds Ratio) 0.8001 0.5433 1.1782
------------------------------------------------------------------------
PROC LOGISTIC:
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
strepto 0.800 0.543 1.178
------------------------------------------------------------------------
These estimates agree, and the confidence limits are the same also. Recall
that the 'measures' option must be used in PROC FREQ to cause printing of the odds
ratio. In PROC LOGISTIC, similarly one must specify 'clodds' as an option on the
MODEL statement.
There is one characteristic of PROC LOGISTIC that you must be careful about.
In the example above, the outcome, chddth, was coded as 0 (no CHD death) and 1 (CHD
death). The default option is to code the lower value as an event and the higher
value as a non-event. To get what you expect in the analysis, you must do one of two
things:
1) Recode the outcome so that events correspond to 1 and non-events
correspond to 2.
2) Or, put the option 'descending' on the PROC LOGISTIC line:
------------------------------------------------------------------------
proc logistic data = byperson descending ;
model chddth = strepto / clodds = pl ctable ;
title 'PROC LOGISTIC analysis: CHD Death versus Streptokinase' ;
run ;
------------------------------------------------------------------------
But don't do *both* 1) and 2)!
Note also that the dataset analyzed by PROC LOGISTIC was 'byperson' rather than
the original dataset. The WEIGHT statement has essentially no effect in PROC
LOGISTIC. This means that the observationsn must be written out individually.
PROC LOGISTIC prints parameter estimates for both b0 and b1. It represents b0
as the intercept. In this case, b0 = -2.5822. As noted above, this should equal
log(c/a) = log(62/820) = -2.58216995 (according to my calculator). Moreover, again
from the PROC LOGISTIC printout, b1 = -0.2230, which should equal log(a*d/(b*c) =
log(820*49/(810*62)) = -.22304399.
In fact, in this case there is virtually no difference between the PROC FREQ
analysis and the PROC LOGISTIC analysis.
The PROC LOGISTIC analysis does include some additional statistics:
1. The 'Model Fit' statistics
2. The 'Association of Predicted Probabilities and Observed Responses' Table
3. The 'Classification' Table
------------------------------------------------------------------------
1. MODEL FIT STATISTICS:
Intercept
Intercept and
Criterion Only Covariates
AIC 827.864 828.580
SC 833.326 839.505
-2 Log L 825.864 824.580
------------------------------------------------------------------------
The most important of these is '-2 Log L', which means: -2 * log(likelihood).
PROC LOGISTIC uses a numerical technique to produce parameter estimates. This
technique yields what are called maximum likelihood estimates, or MLEs. MLEs have
certain desirable properties: if the sample size is very large, there is a high
probability that MLEs will agree with the true values of the unknowns parameters.
In this example, -2 Log L actually compares two models: the 'Intercept Only'
model, and the 'Intercept and Covariates' model. The intercept only model could be
written as:
1
Prob(Y = 1) = ------------- .
(1 + exp(-b0))
This is analogous to the 'intercept-only' (flat line) model is linear
regression. Moreover, the value of -2 Log L for the intercept-only model is
analogous to the 'C Total' in the ANOVA table for linear regression.
The 'Intercept and Covariates' model in this case is
1
Prob(Y = 1 | X1) = --------------------
(1 + exp(-b0 -b1*X1)
where X1 = strepto and Y = chddth.
The important thing to look at in comparing the intercept-only model with the
intercept-and-covariates model is the *difference* between the two in -2 Log L. In
the example above, this is
Diff(-2 Log L) = 825.864 - 824.580 = 1.284.
Under the null hypothesis that the covariates have no effect on the outcome,
the difference between these two -2 Log L values has a a chi-square distribution
with degrees of freedom equal to the number of covariates in the model. A large
value of Diff(-2 Log L) means that you reject the null hypothesis. In this case
there is only one covariate, X1 = strepto. The p-value for a chi-square statistic
with value 1.284 and 1 degree of freedom is p = .2572. So in this case, although it
appears that streptokinase reduces CHD mortality, the difference between the
Streptokinase group and the Placebo group was not large enough to be statistically
significant.
Note, incidentally, that the p-value from PROC FREQ for the Likelihood Ratio
Chi-Square also equals 0.2572. This is not a coincidence; for this simple 2 x 2
table, the PROC LOGISTIC analysis based on Diff(-2 Log L) is exactly the same as the
Likelihood Ratio Chi-Square analysis.
So why do we need PROC LOGISTIC? Why not simply use PROC FREQ?
In this case, with the dichotomous covariate X1 = strepto, the two analyses are
essentially the same. However, PROC LOGISTIC can accommodate continuous covariates
(like age, cholesterol levels, etc) while PROC FREQ is restricted to only
categorical covariates. Furthermore, PROC LOGISTIC can accommodate much more
complicated models - models involving many covariates and interaction terms. This
will be explained in more detail later in these notes. The AIC ('Akaike Information
Criterion') statistic will be explained later also.
------------------------------------------------------------------------
2. ASSOCIATION OF PREDICTED AND OBSERVED
Association of Predicted Probabilities and Observed Responses
Percent Concordant 27.8 Somers' D 0.055
Percent Discordant 22.2 Gamma 0.111
Percent Tied 50.0 Tau-a 0.007
Pairs 180930 c 0.528
------------------------------------------------------------------------
This interesting table is based on consideration of possiblew pairs of cases
and non-cases. The number of cases (i.e., CHD deaths) is 62 + 49 = 111. The number
of non-cases is 1630. That means there is a total of 111 * 1630 possible pairs, and
111*1630 = 180930.
Of these pairs, some are 'discordant' and some are 'concordant'. The discordant
pairs are people like the following:
Prob(Person 1 has event) is less than Prob(Person 2 has event),
but the opposite actually happens - that is, Person 1 has an
event and Person 2 does not.
Now: in this data set, those who are in the Strepto group have a lower
probability of event than those who are in the Placebo group.
Therefore the number of discordant pairs is:
49 * 820 = 40180, so
Percent discordant: 40180 / 180930 = 22.21 %
Similarly, the number of concordant pairs is:
62 * 810 = 50220, and
Percent concordant: 50220 / 180930 = 27.76%.
There is a large group of pairs in this data where the responses are *tied*.
That is, both members of the pair have the same predicted outcome and they also have
the same observed outcome. The number of such pairs in this dataset is:
49 * 820 + 62 * 810 = 40180 + 50220 = 90400.
The percent of such pairs is 90400 / 180930 = 49.96 %.
------------------------------------------------------------------------
3. CLASSIFICATION TABLE:
Correct Incorrect Percentages
Prob Non- Non- Sensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.040 111 0 1630 0 6.4 100.0 0.0 93.6 .
0.060 62 810 820 49 50.1 55.9 49.7 93.0 5.7
0.080 0 1630 0 111 93.6 0.0 100.0 . 6.4
------------------------------------------------------------------------
The important line to consider in this table is the middle one, which
corresponds to the observed data. The total number of people who were studied was
1740. The number predicted by the model to have an event in the no-streptokinase
(X1 = 0) group is:
882 * Prob(Y = 1 | X1 = 0) = 882 / (1 + exp(+2.5822)) = 61.9983.
Thus the number predicted by the model to have an event in this group equals
(to within roundoff) the number (62) who actually did have an event. This is the
number of correctly predicted events in the no-streptokinase group.
The number predicted to NOT have an event in the same group is 882 - 62 = 820.
This is the number of INCORRECTLY predicted events in the no-streptokinase group.
Similarly, the number correctly predicted to have an event in the streptokinase
group is
859 * Prob(Y = 1|X1 = 1) = 859 / (1 + exp(2.5822 + .2230))
= 859 * .05704 = 49.
[Note: .05704 approximately equals .06, the 'Prob Level' associated with the
middle line of the Classification table.]
The number predicted NOT to have an event in the streptokinase group is
849 - 49 = 810.
These numbers give rise to rates of sensitivity and specificity for the
predictions produced by the logisitic model. Sensitivity is the percent which are
correctly predicted among those who actually have the condition of interest. The
number that actually had the condition of interest (CHD death) is 62 + 49 = 111.
In this case, the sensitivity is therefore:
62 / (62 + 49) = 62 / 111 = 55.86 %.
Similarly, specificity is the number who are correctly predicted among those
who do NOT have the condition of interest. The number who did not suffer CHD death
was 1630. Therefore the specificity is estimated as:
810 / (820 + 810) = 49.69 %.
------------------------------------------------------------------------
n54703.009 Last update: March 2, 2005.