PROC LOGISTIC, V: Ordered Logistic n54703.013 The usual logistic model applies for dichotomous outcomes; that is, the outcome event either occurs or it does not occur, and it can take on only two values, one designating occurrence and the other designating non-occurrence. The model has been generalized so that the outcome can take only several different ordered values. For example, the outcome variable may be coded as: Y = 1: dead Y = 2: sick but alive Y = 3: healthy As is usual with the logistic model, the probability that Y attains a certain value is a function of covariates, X1, X2, ..., Xp. The expression of this probability for ordered outcomes is more complicated than it is for the dichotomous case, and a bit harder to understand. Assume as above that the outcome variable Y has 3 possible levels, and that there are two covariates, X1 and X2. The logistic model is specified by the two expressions: prob(Y <= 1 | X1, X2) = 1 / (1 + exp(-b01 - b1*X1 - b2*X2)). prob(Y <= 2 | X1, X2) = 1 / (1 + exp(-b02 - b1*X1 - b2*X2)). [Note here, " <= " is to be read as "less than or equal to".] Some notes: 1. Note here that the model specifies the probability that Y is *less than or equal to* the specified value. Of course, in this example, prob(Y <= 1 | X1, X2) = prob(Y = 1 | X1, X2), and prob(Y <= 2 | X1, X2) = prob(Y = 1 or Y = 2 | X1, X2). 2. Note that the coefficients of X1 and X2 are the same regardless of the level of Y that is specified. However, the intercept term is *dependent* on the level. Thus, b01 and b02 are two different intercept terms. 3. You might wonder how to compute prob(Y = 2 | X1, X2). This is: prob(Y = 2 | X1, X2) = prob(Y <= 2 | X1, X2) - prob(Y <= 1 | X1, X2). This can be written out more fully as: prob(Y = 2 | X1, X2) = 1/(1 + exp(-b02 - b1*X1 - b2*X2)) - 1/(1 + exp(-b01 - b1*X1 - b2*X2)). 4. You might also wonder how to compute prob(Y = 3 | X1, X2): the correct expression is: prob(Y = 3 | X1, X2) = 1 - prob(Y <= 2 | X1, X2) = exp(-b02 - b1*X1 - b2*X2)/(1 + exp(-b02 - b1*X1 - b2*X2)). 5. Because the values of Y are ordered, and the probability of any Y level is nonnegative, the intercept coefficients are also necessarily ordered: in this case, b01 < b02. 6. The number of categories for Y should be relatively *small*, in general, 5 or fewer. There is no theoretical upper bound, but including too many categories increases the number of parameters in the model and, if they are "junk" variables, they cannot improve things. 6. Finally, you might wonder how to interpret the coefficients of X1 and X2 in terms of odds ratios. Recall that for dichotomous outcomes, exp(b1) is interpreted as the odds ratio of having an event (versus not having it) for an increase of one unit in the covariate X1. The interpretation for multi-valued ordered logistic is similar: exp(b1) is the odds ratio that Y is at level j (versus being at a level *greater than j*) for a one-unit increase in covariate X1. I will certainly grant that it is hard to explain this further! EXAMPLE OF THE USE OF ORDERED LOGISTIC: The SAS program below applies ordered logistic to a datafile of 485 records from the Minnesota Heart Survey. The file is called mwheart.data. There are links to this datafile on this course's website, on the page titled 'Computer Programs and Datafiles'. There are also links to a description of the datafile and to a SAS program which reads the datafile. The datafile includes a variable, SMOKE, which has three levels: SMOKE = 1: current smoker SMOKE = 2: ex-smoker SMOKE = 3: never smoked ================================================================================= * MWHEART.SAS; * Reads MWHEART.DATA file (485 cases, 17 vars) randomly selected from; * 4086 cases of the Mid-West Heart study; OPTIONS LINESIZE = 80 CENTER PAGESIZE = 58 NUMBER LABEL; TITLE 'Selected cases from the Mid-West Heart study conducted 1980-82 by'; TITLE2 'the Division of Epidemiology, SPH U. Minnesota'; TITLE3 'Mortality follow-up through 7/1/92 is based on National Death Index'; PROC FORMAT; VALUE sexfmt 1='M' 2='F'; VALUE ynfmt 1='Y' 2='N'; VALUE smkfmt 1='1: current' 2='2: exsmoker' 3='3: nonsmoker'; DATA HEART; INFILE 'mwheart.data'; INPUT @1 ID 12. @14 AGE 2. @17 SEX 1. @19 ENTRYDAT mmddyy8. @28 DTHDATE mmddyy8. @37 CAUSE 5.1 @43 CHOL 3. @47 HDL 2. @50 BMI 5.2 @56 SMOKE 1. @58 CIGSDAY 2. @61 THIOC 3. @65 EVERBPRX 1. @67 NOWBPRX 1. @69 DBPAV2 3. @72 SBPAV2 3. @76 EDUYRS 2. ; footnote "~john-c/5421/jecheart.sas &sysdate &systime" ; * For survival analysis need: dead=death indicator and surv=survival; * Compute days to death and convert to months; IF dthdate = . THEN censor=MDY(01,07,92); ELSE censor=dthdate; surv = (censor - entrydat)/30.4; *SURV units are in months; IF dthdate = . THEN dead = 0; ELSE dead = 1; *DEAD is indicator; survyrs = INT(surv/12); *integer years of survival; LABEL ID = 'Identifying sequential number' AGE = 'age at entry' SEX = 'Gender: 1=M, 2=F' ENTRYDAT = 'Date of entry interview' DTHDATE = 'Date of death' CAUSE = 'ICD-9 code cause of death XXX.X' CHOL = 'Serum total cholesterol (mg/dl)' HDL = 'HDL cholesterol (mg/dl)' BMI = 'Body Mass Index (function of heigt and weight)' SMOKE = 'smoking status' CIGSDAY = 'Cigarettes smoked per day' THIOC = 'Thiocyanate level (indicator of smoking)' EVERBPRX = 'Ever use BP med: 1=Y, 2=N' NOWBPRX = 'Now use BP med: 1=Y, 2=N' DBPAV2 = 'Diastolic Blood Pressure (mmHg) - ave. of two readings.' SBPAV2 = 'Systolic blood pressure (mmHg) - average of two readings' EDUYRS = 'Years of education'; FORMAT SEX sexfmt. EVERBPRX NOWBPRX ynfmt. SMOKE smkfmt.; ********************************** END DATA STEP ****************************; proc freq data = heart ; tables eduyrs * smoke / chisq ; title1 'Years education vs. smoking status' ; run ; proc logistic data = heart ; model smoke = age sex eduyrs / clodds = pl ; title1 'Smoking status [3 levels] modelled as function of age, sex, yrs educ.' ; endsas ; *--------------------------------------------------------------------------------- Years education vs. smoking status 1 16:59 Sunday, April 4, 2004 TABLE OF EDUYRS BY SMOKE EDUYRS(Years of education) SMOKE(smoking status) Frequency| Percent | Row Pct | Col Pct |1: curre|2: exsmo|3: nonsm| Total |nt |ker |oker | ---------+--------+--------+--------+ 1 | 5 | 2 | 4 | 11 | 1.04 | 0.42 | 0.83 | 2.29 | 45.45 | 18.18 | 36.36 | | 3.11 | 1.44 | 2.22 | ---------+--------+--------+--------+ 2 | 8 | 4 | 4 | 16 | 1.67 | 0.83 | 0.83 | 3.33 | 50.00 | 25.00 | 25.00 | | 4.97 | 2.88 | 2.22 | ---------+--------+--------+--------+ 10 | 15 | 9 | 4 | 28 | 3.13 | 1.88 | 0.83 | 5.83 | 53.57 | 32.14 | 14.29 | | 9.32 | 6.47 | 2.22 | ---------+--------+--------+--------+ 11 | 59 | 45 | 60 | 164 | 12.29 | 9.38 | 12.50 | 34.17 | 35.98 | 27.44 | 36.59 | | 36.65 | 32.37 | 33.33 | ---------+--------+--------+--------+ 13 | 18 | 19 | 15 | 52 | 3.75 | 3.96 | 3.13 | 10.83 | 34.62 | 36.54 | 28.85 | | 11.18 | 13.67 | 8.33 | ---------+--------+--------+--------+ 15 | 29 | 26 | 42 | 97 | 6.04 | 5.42 | 8.75 | 20.21 | 29.90 | 26.80 | 43.30 | | 18.01 | 18.71 | 23.33 | ---------+--------+--------+--------+ 16 | 20 | 22 | 37 | 79 | 4.17 | 4.58 | 7.71 | 16.46 | 25.32 | 27.85 | 46.84 | | 12.42 | 15.83 | 20.56 | ---------+--------+--------+--------+ 18 | 7 | 12 | 14 | 33 | 1.46 | 2.50 | 2.92 | 6.88 | 21.21 | 36.36 | 42.42 | | 4.35 | 8.63 | 7.78 | ---------+--------+--------+--------+ Total 161 139 180 480 33.54 28.96 37.50 100.00 Frequency Missing = 5 ~john-c/5421/jecheart.sas 04APR04 16:59 Years education vs. smoking status 3 16:59 Sunday, April 4, 2004 STATISTICS FOR TABLE OF EDUYRS BY SMOKE Statistic DF Value Prob ------------------------------------------------------ Chi-Square 14 20.178 0.125 Likelihood Ratio Chi-Square 14 21.023 0.101 Mantel-Haenszel Chi-Square 1 9.262 0.002 Phi Coefficient 0.205 Contingency Coefficient 0.201 Cramer's V 0.145 Effective Sample Size = 480 Frequency Missing = 5 ~john-c/5421/jecheart.sas 04APR04 16:59 ----------------------------------------------------------------------------------- Smoking status [3 levels] modelled as function of age, sex, yrs educ. 4 16:59 Sunday, April 4, 2004 The LOGISTIC Procedure Data Set: WORK.HEART Response Variable: SMOKE smoking status Response Levels: 3 Number of Observations: 479 Link Function: Logit Response Profile Ordered Value SMOKE Count 1 1: current 161 2 2: exsmoker 138 3 3: nonsmoker 180 WARNING: 6 observation(s) were deleted due to missing values for the response or explanatory variables. Score Test for the Proportional Odds Assumption Chi-Square = 36.6287 with 3 DF (p=0.0001) Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 1050.890 1030.476 . SC 1059.234 1051.334 . -2 LOG L 1046.890 1020.476 26.415 with 3 DF (p=0.0001) Score . . 25.736 with 3 DF (p=0.0001) Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCP1 1 2.0594 0.6095 11.4183 0.0007 . . INTERCP2 1 3.3021 0.6212 28.2554 0.0001 . . AGE 1 -0.0121 0.00694 3.0234 0.0821 -0.087417 0.988 SEX 1 -0.6532 0.1736 14.1654 0.0002 -0.179459 0.520 EDUYRS 1 -0.0972 0.0259 14.1349 0.0002 -0.194934 0.907 ~john-c/5421/jecheart.sas 04APR04 16:59 ----------------------------------------------------------------------------------- Smoking status [3 levels] modelled as function of age, sex, yrs educ. 5 16:59 Sunday, April 4, 2004 The LOGISTIC Procedure Association of Predicted Probabilities and Observed Responses Concordant = 59.8% Somers' D = 0.203 Discordant = 39.5% Gamma = 0.204 Tied = 0.7% Tau-a = 0.135 (76038 pairs) c = 0.601 Conditional Odds Ratios and 95% Confidence Intervals Profile Likelihood Confidence Limits Odds Variable Unit Ratio Lower Upper AGE 1.0000 0.988 0.975 1.001 SEX 1.0000 0.520 0.370 0.729 EDUYRS 1.0000 0.907 0.862 0.954 ~john-c/5421/jecheart.sas 04APR04 16:59 ================================================================================= The program includes 'proc freq' in which smoking status is tabulated versus years education. It appears that people with more years of education are less likely to smoke. Because of the way SMOKE is coded (1 = current smoker, 2 = ex-smoker, 3 = nonsmoker), one would expect the coefficient of EDUYRS in proc logistic to be *negative*: the more education you have, the less likely you are to be in one of the 'lower' SMOKE categories, assuming the DESCENDING option is not used in proc logistic. The proc logistic output indicates two INTERCEPT terms. These correspond to b01 and b02 discussed above. Note that b01 <--> INTERCEPT 1, and b02 <--> INTERCEPT 2. Note that, as described above, INTERCEPT 1 < INTERCEPT 2. The proc logistic output indicates two significant predictors of smoking status: SEX and EDUYRS. SEX is coded as 1 ( = MALE) and 2 ( = FEMALE). The coefficient for SEX is -.6532. This translates into an odds ratio of exp(-.6532) = 0.52. This can be interpreted as meaning that women in the study are less likely to be current smokers than men. We will make additional use of this data file for other SAS analyses. ================================================================================= PROBLEM 1: Use the model above to estimate the probability that a male of age 50 with 12 years of education will be an ex-smoker. PROBLEM 2: Verify that women in the Minnesota Heart Survey are less likely to be current- or ex-smokers than men. PROBLEM 3: Perform a logistic analysis with, again, SMOKE as the outcome variable, versus the following covariates: SEX, CHOL, and DEAD. Also investigate whether there is an interaction of SEX with DEAD as "predictors" of SMOKE. ================================================================================= n54703.013 Last update: April 5, 2004.