PROC LOGISTIC, I: Dichotomous Outcomes. n54703.009 PROC LOGISTIC is a procedure for investigating the relationship between predictors, or 'covariates', and a categorical outcome variable. 'Categorical' implies here that the number of possible values of the outcome is small: usually, 2-10 or so. In the case that the number of outcomes is 2, the outcome variable is called 'dichotomous': e.g., mortality (0 = alive, 1 = dead). The covariates can be either continuous variables, like age and blood pressure, or categorical, like gender. If Y denotes the outcome variable, which takes on only values 0 and 1, and X1, X2, X3 denote the covariates, then the logistic model is specified by a probability statement: 1 Prob(Y = 1 | X1, X2, X3) = ------------------------------------- (1 + exp(-b0 - b1*X1 - b2*X2 - b3*X3)) This is interpreted as saying: the probability that Y = 1, given the values X1, X2, and X3 for the covariates, is given by the expression on the right side. The probability that Y = 0 is 1 - prob(Y = 1). This turns out to be: exp(-b0 - b1*X1 - b2*X2 - b3*X3) Prob(Y = 0 | X1, X2, X3) = ------------------------------------- (1 + exp(-b0 - b1*X1 - b2*X2 - b3*X3)). These expressions may seem somewhat more complicated than those for a linear model. Why does the expression for prob(Y = 1 | X1, X2, X3) have an exponential expression in a denominator? The logistic model is simpler, for modelling categorical outcomes, than many other possible models. It has the significant advantage that, regardless of the values of the covariates X1, etc., and the unknown coefficients, it always represents a probability between 0 and 1. Expressing the second part of the denominator as exp(-b0 - b1*X1 - b2*X2 - b3*X3) leads to simpler calculations and interpretations of coefficients, as you will now see. Another advantage is that the relationship between a given covariate and Y is simple to describe. For example, if the coefficient b1 is positive, that means that an increase in the covariate X1 results in an increase in the probability that Y = 1. The simplest possible case of logistic regression is the following: one covariate, X1, which can take on only values 0 and 1. The model in this case is: 1 Prob(Y = 1 | X1) = -------------------- (1 + exp(-b0 -b1*X1) Data for this model can be represented in a 2 x 2 table: X = 0 X = 1 --------------------- | | | Y = 0 | a | b | | | | --------------------- | | | Y = 1 | c | d | | | | --------------------- a + c b + d Now think about the interpretation of the coefficients, b0 and b1. Note that if X = 0, the probability in the observed data that Y = 1 is c / (a + c). In the logistic model, the probability that Y = 1, given that X = 0, is 1 Prob(Y = 1 | X1 = 0) = -------------------- (1 + exp(-b0 -b1*0) [*] = 1 / (1 + exp(-b0)) = c / (a + c) The ODDS that Y = 1, given that X1 = 0, are Prob(Y = 1 | X1 = 0) Odds(Y = 1 | X1 = 0) = ---------------------- Prob(Y = 0 | X1 = 0) c / (a + c) c = ------------ = --- a / (a + c) a Now solve equation [*] for b0: a + c = c + c * exp(-b0), or a = c * exp(-b0), or a/c = exp(-b0), or b0 = log(c / a). That is, b0 can be interpreted as the log of Odds(Y = 1 | X1 = 0). Note that Prob(Y = 1 | X1 = 1) = d / (b + d). From this we can find the solution for b1: 1 d ------------------ = -------, so (1 + exp(-b0 - b1) (d + b) d + b = d + d* exp(-b0 - b1), or b --- = exp(-b0 - b1) = exp(-b0)*exp(-b1) = (a/c) * exp(-b1), or d a*d ----- = exp(b1), or b*c b1 = log((a*d)/(b*c)) = log(Odds ratio: Y = 1, X1 = 1 vs X1 = 0). That is, b1 can be interpreted as the log of the odds ratio from the 2 x 2 table above. Here is the more general rule: b1 is the log of the odds ratio that Y = 1 for a 1-unit increase in X1. Below is an example which shows both PROC FREQ and PROC LOGISTIC used to analyze data from a 2 x 2 table. The data are from: The ISAM Study Group (1986):A prospective trial of Intravenous Streptokinase in Acute Myocardial Infarction (I.S.A.M.): Mortality, morbidity, and infarct size at 21 days. NEJM 314: 1465-1471. The data are: strepto --------------------- 0 1 --------------------- | | | chddth = 0 | 820 | 810 | | | | --------------------- | | | chddth = 1 | 62 | 49 | | | | --------------------- a + c b + d ================================================================================ options linesize = 80 ; *===================================================================== ; footnote "~john-c/5421/54703.009.sas &sysdate &systime" ; data myocard byperson ; strepto = 0 ; chddth = 0 ; count = 820 ; output myocard ; do i = 1 to count ; output byperson ; end ; strepto = 0 ; chddth = 1 ; count = 62 ; output myocard ; do i = 1 to count ; output byperson ; end ; strepto = 1 ; chddth = 0 ; count = 810 ; output myocard ; do i = 1 to count ; output byperson ; end ; strepto = 1 ; chddth = 1 ; count = 49 ; output myocard ; do i = 1 to count ; output byperson ; end ; run ; proc freq data = myocard ; tables chddth * strepto / chisq measures ; weight count ; title 'PROC FREQ analysis: CHD Death versus Streptokinase' ; run ; proc logistic data = byperson descending ; model chddth = strepto / clodds = pl ctable ; title 'PROC LOGISTIC analysis: CHD Death versus Streptokinase' ; run ; endsas ; --------------------------------------------------------------------------------- PROC FREQ analysis: CHD Death versus Streptokinase 1 18:22 Sunday, March 7, 2004 The FREQ Procedure Table of chddth by strepto chddth strepto Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0 | 820 | 810 | 1630 | 47.10 | 46.52 | 93.62 | 50.31 | 49.69 | | 92.97 | 94.30 | ---------+--------+--------+ 1 | 62 | 49 | 111 | 3.56 | 2.81 | 6.38 | 55.86 | 44.14 | | 7.03 | 5.70 | ---------+--------+--------+ Total 882 859 1741 50.66 49.34 100.00 Statistics for Table of chddth by strepto Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 1.2802 0.2579 Likelihood Ratio Chi-Square 1 1.2835 0.2572 Continuity Adj. Chi-Square 1 1.0679 0.3014 Mantel-Haenszel Chi-Square 1 1.2795 0.2580 Phi Coefficient -0.0271 Contingency Coefficient 0.0271 Cramer's V -0.0271 Fisher's Exact Test ---------------------------------- Cell (1,1) Frequency (F) 820 Left-sided Pr <= F 0.1507 Right-sided Pr >= F 0.8907 Table Probability (P) 0.0414 Two-sided Pr <= P 0.2810 ~john-c/5421/54703.009.sas 07MAR04 18:22 --------------------------------------------------------------------------------- PROC FREQ analysis: CHD Death versus Streptokinase 2 18:22 Sunday, March 7, 2004 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ----------------------------------------------------------------- Case-Control (Odds Ratio) 0.8001 0.5433 1.1782 Cohort (Col1 Risk) 0.9007 0.7581 1.0700 Cohort (Col2 Risk) 1.1257 0.9080 1.3956 Sample Size = 1741 ~john-c/5421/54703.009.sas 07MAR04 18:22 --------------------------------------------------------------------------------- PROC LOGISTIC analysis: CHD Death versus Streptokinase 3 18:22 Sunday, March 7, 2004 The LOGISTIC Procedure Model Information Data Set WORK.BYPERSON Response Variable chddth Number of Response Levels 2 Number of Observations 1741 Link Function Logit Optimization Technique Fisher's scoring Response Profile Ordered Total Value chddth Frequency 1 1 111 2 0 1630 Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 827.864 828.580 SC 833.326 839.505 -2 Log L 825.864 824.580 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 1.2835 1 0.2572 Score 1.2802 1 0.2579 Wald 1.2758 1 0.2587 Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.5822 0.1317 384.3324 <.0001 strepto 1 -0.2230 0.1975 1.2758 0.2587 ~john-c/5421/54703.009.sas 07MAR04 18:22 --------------------------------------------------------------------------------- PROC LOGISTIC analysis: CHD Death versus Streptokinase 4 18:22 Sunday, March 7, 2004 The LOGISTIC Procedure Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits strepto 0.800 0.543 1.178 Association of Predicted Probabilities and Observed Responses Percent Concordant 27.8 Somers' D 0.055 Percent Discordant 22.2 Gamma 0.111 Percent Tied 50.0 Tau-a 0.007 Pairs 180930 c 0.528 Profile Likelihood Confidence Interval for Adjusted Odds Ratios Effect Unit Estimate 95% Confidence Limits strepto 1.0000 0.800 0.541 1.176 Classification Table Correct Incorrect Percentages Prob Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG 0.040 111 0 1630 0 6.4 100.0 0.0 93.6 . 0.060 62 810 820 49 50.1 55.9 49.7 93.0 5.7 0.080 0 1630 0 111 93.6 0.0 100.0 . 6.4 ~john-c/5421/54703.009.sas 07MAR04 18:22 * ==================================================================== ; Note that both PROC FREQ and PROC LOGISTIC print an estimate of the odds ratio: ------------------------------------------------------------------------ PROC FREQ: Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ----------------------------------------------------------------- Case-Control (Odds Ratio) 0.8001 0.5433 1.1782 ------------------------------------------------------------------------ PROC LOGISTIC: Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits strepto 0.800 0.543 1.178 ------------------------------------------------------------------------ These estimates agree, and the confidence limits are the same also. Recall that the 'measures' option must be used in PROC FREQ to cause printing of the odds ratio. In PROC LOGISTIC, similarly one must specify 'clodds' as an option on the MODEL statement. There is one characteristic of PROC LOGISTIC that you must be careful about. In the example above, the outcome, chddth, was coded as 0 (no CHD death) and 1 (CHD death). The default option is to code the lower value as an event and the higher value as a non-event. To get what you expect in the analysis, you must do one of two things: 1) Recode the outcome so that events correspond to 1 and non-events correspond to 2. 2) Or, put the option 'descending' on the PROC LOGISTIC line: ------------------------------------------------------------------------ proc logistic data = byperson descending ; model chddth = strepto / clodds = pl ctable ; title 'PROC LOGISTIC analysis: CHD Death versus Streptokinase' ; run ; ------------------------------------------------------------------------ But don't do *both* 1) and 2)! Note also that the dataset analyzed by PROC LOGISTIC was 'byperson' rather than the original dataset. The WEIGHT statement has essentially no effect in PROC LOGISTIC. This means that the observationsn must be written out individually. PROC LOGISTIC prints parameter estimates for both b0 and b1. It represents b0 as the intercept. In this case, b0 = -2.5822. As noted above, this should equal log(c/a) = log(62/820) = -2.58216995 (according to my calculator). Moreover, again from the PROC LOGISTIC printout, b1 = -0.2230, which should equal log(a*d/(b*c) = log(820*49/(810*62)) = -.22304399. In fact, in this case there is virtually no difference between the PROC FREQ analysis and the PROC LOGISTIC analysis. The PROC LOGISTIC analysis does include some additional statistics: 1. The 'Model Fit' statistics 2. The 'Association of Predicted Probabilities and Observed Responses' Table 3. The 'Classification' Table ------------------------------------------------------------------------ 1. MODEL FIT STATISTICS: Intercept Intercept and Criterion Only Covariates AIC 827.864 828.580 SC 833.326 839.505 -2 Log L 825.864 824.580 ------------------------------------------------------------------------ The most important of these is '-2 Log L', which means: -2 * log(likelihood). PROC LOGISTIC uses a numerical technique to produce parameter estimates. This technique yields what are called maximum likelihood estimates, or MLEs. MLEs have certain desirable properties: if the sample size is very large, there is a high probability that MLEs will agree with the true values of the unknowns parameters. In this example, -2 Log L actually compares two models: the 'Intercept Only' model, and the 'Intercept and Covariates' model. The intercept only model could be written as: 1 Prob(Y = 1) = ------------- . (1 + exp(-b0)) This is analogous to the 'intercept-only' (flat line) model is linear regression. Moreover, the value of -2 Log L for the intercept-only model is analogous to the 'C Total' in the ANOVA table for linear regression. The 'Intercept and Covariates' model in this case is 1 Prob(Y = 1 | X1) = -------------------- (1 + exp(-b0 -b1*X1) where X1 = strepto and Y = chddth. The important thing to look at in comparing the intercept-only model with the intercept-and-covariates model is the *difference* between the two in -2 Log L. In the example above, this is Diff(-2 Log L) = 825.864 - 824.580 = 1.284. Under the null hypothesis that the covariates have no effect on the outcome, the difference between these two -2 Log L values has a a chi-square distribution with degrees of freedom equal to the number of covariates in the model. A large value of Diff(-2 Log L) means that you reject the null hypothesis. In this case there is only one covariate, X1 = strepto. The p-value for a chi-square statistic with value 1.284 and 1 degree of freedom is p = .2572. So in this case, although it appears that streptokinase reduces CHD mortality, the difference between the Streptokinase group and the Placebo group was not large enough to be statistically significant. Note, incidentally, that the p-value from PROC FREQ for the Likelihood Ratio Chi-Square also equals 0.2572. This is not a coincidence; for this simple 2 x 2 table, the PROC LOGISTIC analysis based on Diff(-2 Log L) is exactly the same as the Likelihood Ratio Chi-Square analysis. So why do we need PROC LOGISTIC? Why not simply use PROC FREQ? In this case, with the dichotomous covariate X1 = strepto, the two analyses are essentially the same. However, PROC LOGISTIC can accommodate continuous covariates (like age, cholesterol levels, etc) while PROC FREQ is restricted to only categorical covariates. Furthermore, PROC LOGISTIC can accommodate much more complicated models - models involving many covariates and interaction terms. This will be explained in more detail later in these notes. The AIC ('Akaike Information Criterion') statistic will be explained later also. ------------------------------------------------------------------------ 2. ASSOCIATION OF PREDICTED AND OBSERVED Association of Predicted Probabilities and Observed Responses Percent Concordant 27.8 Somers' D 0.055 Percent Discordant 22.2 Gamma 0.111 Percent Tied 50.0 Tau-a 0.007 Pairs 180930 c 0.528 ------------------------------------------------------------------------ This interesting table is based on consideration of possiblew pairs of cases and non-cases. The number of cases (i.e., CHD deaths) is 62 + 49 = 111. The number of non-cases is 1630. That means there is a total of 111 * 1630 possible pairs, and 111*1630 = 180930. Of these pairs, some are 'discordant' and some are 'concordant'. The discordant pairs are people like the following: Prob(Person 1 has event) is less than Prob(Person 2 has event), but the opposite actually happens - that is, Person 1 has an event and Person 2 does not. Now: in this data set, those who are in the Strepto group have a lower probability of event than those who are in the Placebo group. Therefore the number of discordant pairs is: 49 * 820 = 40180, so Percent discordant: 40180 / 180930 = 22.21 % Similarly, the number of concordant pairs is: 62 * 810 = 50220, and Percent concordant: 50220 / 180930 = 27.76%. There is a large group of pairs in this data where the responses are *tied*. That is, both members of the pair have the same predicted outcome and they also have the same observed outcome. The number of such pairs in this dataset is: 49 * 820 + 62 * 810 = 40180 + 50220 = 90400. The percent of such pairs is 90400 / 180930 = 49.96 %. ------------------------------------------------------------------------ 3. CLASSIFICATION TABLE: Correct Incorrect Percentages Prob Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG 0.040 111 0 1630 0 6.4 100.0 0.0 93.6 . 0.060 62 810 820 49 50.1 55.9 49.7 93.0 5.7 0.080 0 1630 0 111 93.6 0.0 100.0 . 6.4 ------------------------------------------------------------------------ The important line to consider in this table is the middle one, which corresponds to the observed data. The total number of people who were studied was 1740. The number predicted by the model to have an event in the no-streptokinase (X1 = 0) group is: 882 * Prob(Y = 1 | X1 = 0) = 882 / (1 + exp(+2.5822)) = 61.9983. Thus the number predicted by the model to have an event in this group equals (to within roundoff) the number (62) who actually did have an event. This is the number of correctly predicted events in the no-streptokinase group. The number predicted to NOT have an event in the same group is 882 - 62 = 820. This is the number of INCORRECTLY predicted events in the no-streptokinase group. Similarly, the number correctly predicted to have an event in the streptokinase group is 859 * Prob(Y = 1|X1 = 1) = 859 / (1 + exp(2.5822 + .2230)) = 859 * .05704 = 49. [Note: .05704 approximately equals .06, the 'Prob Level' associated with the middle line of the Classification table.] The number predicted NOT to have an event in the streptokinase group is 849 - 49 = 810. These numbers give rise to rates of sensitivity and specificity for the predictions produced by the logisitic model. Sensitivity is the percent which are correctly predicted among those who actually have the condition of interest. The number that actually had the condition of interest (CHD death) is 62 + 49 = 111. In this case, the sensitivity is therefore: 62 / (62 + 49) = 62 / 111 = 55.86 %. Similarly, specificity is the number who are correctly predicted among those who do NOT have the condition of interest. The number who did not suffer CHD death was 1630. Therefore the specificity is estimated as: 810 / (820 + 810) = 49.69 %. ------------------------------------------------------------------------ n54703.009 Last update: March 2, 2005.