PROC LOGISTIC, V: Ordered Logistic n54703.013
The usual logistic model applies for dichotomous outcomes; that is,
the outcome event either occurs or it does not occur, and it can take on
only two values, one designating occurrence and the other designating
non-occurrence.
The model has been generalized so that the outcome can take only
several different ordered values. For example, the outcome variable
may be coded as:
Y = 1: dead
Y = 2: sick but alive
Y = 3: healthy
As is usual with the logistic model, the probability that Y attains
a certain value is a function of covariates, X1, X2, ..., Xp. The expression
of this probability for ordered outcomes is more complicated than it is for
the dichotomous case, and a bit harder to understand.
Assume as above that the outcome variable Y has 3 possible levels, and that
there are two covariates, X1 and X2. The logistic model is specified by the
two expressions:
prob(Y <= 1 | X1, X2) = 1 / (1 + exp(-b01 - b1*X1 - b2*X2)).
prob(Y <= 2 | X1, X2) = 1 / (1 + exp(-b02 - b1*X1 - b2*X2)).
[Note here, " <= " is to be read as "less than or equal to".]
Some notes:
1. Note here that the model specifies the probability that Y is *less than or
equal to* the specified value. Of course, in this example,
prob(Y <= 1 | X1, X2) = prob(Y = 1 | X1, X2), and
prob(Y <= 2 | X1, X2) = prob(Y = 1 or Y = 2 | X1, X2).
2. Note that the coefficients of X1 and X2 are the same regardless of
the level of Y that is specified. However, the intercept term
is *dependent* on the level. Thus, b01 and b02 are two different
intercept terms.
3. You might wonder how to compute prob(Y = 2 | X1, X2). This is:
prob(Y = 2 | X1, X2) = prob(Y <= 2 | X1, X2) - prob(Y <= 1 | X1, X2).
This can be written out more fully as:
prob(Y = 2 | X1, X2) = 1/(1 + exp(-b02 - b1*X1 - b2*X2))
- 1/(1 + exp(-b01 - b1*X1 - b2*X2)).
4. You might also wonder how to compute prob(Y = 3 | X1, X2): the correct
expression is:
prob(Y = 3 | X1, X2) = 1 - prob(Y <= 2 | X1, X2)
= exp(-b02 - b1*X1 - b2*X2)/(1 + exp(-b02 - b1*X1 - b2*X2)).
5. Because the values of Y are ordered, and the probability of any Y
level is nonnegative, the intercept coefficients are also necessarily
ordered: in this case,
b01 < b02.
6. The number of categories for Y should be relatively *small*, in general,
5 or fewer. There is no theoretical upper bound, but including
too many categories increases the number of parameters in the model
and, if they are "junk" variables, they cannot improve things.
6. Finally, you might wonder how to interpret the coefficients of X1 and
X2 in terms of odds ratios. Recall that for dichotomous outcomes,
exp(b1) is interpreted as the odds ratio of having an event (versus
not having it) for an increase of one unit in the covariate X1.
The interpretation for multi-valued ordered logistic is similar:
exp(b1) is the odds ratio that Y is at level j (versus being at a
level *greater than j*) for a one-unit increase in covariate X1.
I will certainly grant that it is hard to explain this further!
EXAMPLE OF THE USE OF ORDERED LOGISTIC:
The SAS program below applies ordered logistic to a datafile of 485
records from the Minnesota Heart Survey. The file is called mwheart.data.
There are links to this datafile on this course's website, on the page titled
'Computer Programs and Datafiles'. There are also links to a description of
the datafile and to a SAS program which reads the datafile.
The datafile includes a variable, SMOKE, which has three levels:
SMOKE = 1: current smoker
SMOKE = 2: ex-smoker
SMOKE = 3: never smoked
=================================================================================
* MWHEART.SAS;
* Reads MWHEART.DATA file (485 cases, 17 vars) randomly selected from;
* 4086 cases of the Mid-West Heart study;
OPTIONS LINESIZE = 80 CENTER PAGESIZE = 58 NUMBER LABEL;
TITLE 'Selected cases from the Mid-West Heart study conducted 1980-82 by';
TITLE2 'the Division of Epidemiology, SPH U. Minnesota';
TITLE3 'Mortality follow-up through 7/1/92 is based on National Death Index';
PROC FORMAT;
VALUE sexfmt 1='M' 2='F';
VALUE ynfmt 1='Y' 2='N';
VALUE smkfmt 1='1: current' 2='2: exsmoker' 3='3: nonsmoker';
DATA HEART;
INFILE 'mwheart.data';
INPUT
@1 ID 12.
@14 AGE 2.
@17 SEX 1.
@19 ENTRYDAT mmddyy8.
@28 DTHDATE mmddyy8.
@37 CAUSE 5.1
@43 CHOL 3.
@47 HDL 2.
@50 BMI 5.2
@56 SMOKE 1.
@58 CIGSDAY 2.
@61 THIOC 3.
@65 EVERBPRX 1.
@67 NOWBPRX 1.
@69 DBPAV2 3.
@72 SBPAV2 3.
@76 EDUYRS 2.
;
footnote "~john-c/5421/jecheart.sas &sysdate &systime" ;
* For survival analysis need: dead=death indicator and surv=survival;
* Compute days to death and convert to months;
IF dthdate = . THEN censor=MDY(01,07,92); ELSE censor=dthdate;
surv = (censor - entrydat)/30.4; *SURV units are in months;
IF dthdate = . THEN dead = 0; ELSE dead = 1; *DEAD is indicator;
survyrs = INT(surv/12); *integer years of survival;
LABEL
ID = 'Identifying sequential number'
AGE = 'age at entry'
SEX = 'Gender: 1=M, 2=F'
ENTRYDAT = 'Date of entry interview'
DTHDATE = 'Date of death'
CAUSE = 'ICD-9 code cause of death XXX.X'
CHOL = 'Serum total cholesterol (mg/dl)'
HDL = 'HDL cholesterol (mg/dl)'
BMI = 'Body Mass Index (function of heigt and weight)'
SMOKE = 'smoking status'
CIGSDAY = 'Cigarettes smoked per day'
THIOC = 'Thiocyanate level (indicator of smoking)'
EVERBPRX = 'Ever use BP med: 1=Y, 2=N'
NOWBPRX = 'Now use BP med: 1=Y, 2=N'
DBPAV2 = 'Diastolic Blood Pressure (mmHg) - ave. of two readings.'
SBPAV2 = 'Systolic blood pressure (mmHg) - average of two readings'
EDUYRS = 'Years of education';
FORMAT SEX sexfmt. EVERBPRX NOWBPRX ynfmt. SMOKE smkfmt.;
********************************** END DATA STEP ****************************;
proc freq data = heart ;
tables eduyrs * smoke / chisq ;
title1 'Years education vs. smoking status' ;
run ;
proc logistic data = heart ;
model smoke = age sex eduyrs / clodds = pl ;
title1 'Smoking status [3 levels] modelled as function of age, sex, yrs educ.' ;
endsas ;
*---------------------------------------------------------------------------------
Years education vs. smoking status 1
16:59 Sunday, April 4, 2004
TABLE OF EDUYRS BY SMOKE
EDUYRS(Years of education) SMOKE(smoking status)
Frequency|
Percent |
Row Pct |
Col Pct |1: curre|2: exsmo|3: nonsm| Total
|nt |ker |oker |
---------+--------+--------+--------+
1 | 5 | 2 | 4 | 11
| 1.04 | 0.42 | 0.83 | 2.29
| 45.45 | 18.18 | 36.36 |
| 3.11 | 1.44 | 2.22 |
---------+--------+--------+--------+
2 | 8 | 4 | 4 | 16
| 1.67 | 0.83 | 0.83 | 3.33
| 50.00 | 25.00 | 25.00 |
| 4.97 | 2.88 | 2.22 |
---------+--------+--------+--------+
10 | 15 | 9 | 4 | 28
| 3.13 | 1.88 | 0.83 | 5.83
| 53.57 | 32.14 | 14.29 |
| 9.32 | 6.47 | 2.22 |
---------+--------+--------+--------+
11 | 59 | 45 | 60 | 164
| 12.29 | 9.38 | 12.50 | 34.17
| 35.98 | 27.44 | 36.59 |
| 36.65 | 32.37 | 33.33 |
---------+--------+--------+--------+
13 | 18 | 19 | 15 | 52
| 3.75 | 3.96 | 3.13 | 10.83
| 34.62 | 36.54 | 28.85 |
| 11.18 | 13.67 | 8.33 |
---------+--------+--------+--------+
15 | 29 | 26 | 42 | 97
| 6.04 | 5.42 | 8.75 | 20.21
| 29.90 | 26.80 | 43.30 |
| 18.01 | 18.71 | 23.33 |
---------+--------+--------+--------+
16 | 20 | 22 | 37 | 79
| 4.17 | 4.58 | 7.71 | 16.46
| 25.32 | 27.85 | 46.84 |
| 12.42 | 15.83 | 20.56 |
---------+--------+--------+--------+
18 | 7 | 12 | 14 | 33
| 1.46 | 2.50 | 2.92 | 6.88
| 21.21 | 36.36 | 42.42 |
| 4.35 | 8.63 | 7.78 |
---------+--------+--------+--------+
Total 161 139 180 480
33.54 28.96 37.50 100.00
Frequency Missing = 5
~john-c/5421/jecheart.sas 04APR04 16:59
Years education vs. smoking status 3
16:59 Sunday, April 4, 2004
STATISTICS FOR TABLE OF EDUYRS BY SMOKE
Statistic DF Value Prob
------------------------------------------------------
Chi-Square 14 20.178 0.125
Likelihood Ratio Chi-Square 14 21.023 0.101
Mantel-Haenszel Chi-Square 1 9.262 0.002
Phi Coefficient 0.205
Contingency Coefficient 0.201
Cramer's V 0.145
Effective Sample Size = 480
Frequency Missing = 5
~john-c/5421/jecheart.sas 04APR04 16:59
-----------------------------------------------------------------------------------
Smoking status [3 levels] modelled as function of age, sex, yrs educ. 4
16:59 Sunday, April 4, 2004
The LOGISTIC Procedure
Data Set: WORK.HEART
Response Variable: SMOKE smoking status
Response Levels: 3
Number of Observations: 479
Link Function: Logit
Response Profile
Ordered
Value SMOKE Count
1 1: current 161
2 2: exsmoker 138
3 3: nonsmoker 180
WARNING: 6 observation(s) were deleted due to missing values for the response
or explanatory variables.
Score Test for the Proportional Odds Assumption
Chi-Square = 36.6287 with 3 DF (p=0.0001)
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 1050.890 1030.476 .
SC 1059.234 1051.334 .
-2 LOG L 1046.890 1020.476 26.415 with 3 DF (p=0.0001)
Score . . 25.736 with 3 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCP1 1 2.0594 0.6095 11.4183 0.0007 . .
INTERCP2 1 3.3021 0.6212 28.2554 0.0001 . .
AGE 1 -0.0121 0.00694 3.0234 0.0821 -0.087417 0.988
SEX 1 -0.6532 0.1736 14.1654 0.0002 -0.179459 0.520
EDUYRS 1 -0.0972 0.0259 14.1349 0.0002 -0.194934 0.907
~john-c/5421/jecheart.sas 04APR04 16:59
-----------------------------------------------------------------------------------
Smoking status [3 levels] modelled as function of age, sex, yrs educ. 5
16:59 Sunday, April 4, 2004
The LOGISTIC Procedure
Association of Predicted Probabilities and Observed Responses
Concordant = 59.8% Somers' D = 0.203
Discordant = 39.5% Gamma = 0.204
Tied = 0.7% Tau-a = 0.135
(76038 pairs) c = 0.601
Conditional Odds Ratios and 95% Confidence Intervals
Profile Likelihood
Confidence Limits
Odds
Variable Unit Ratio Lower Upper
AGE 1.0000 0.988 0.975 1.001
SEX 1.0000 0.520 0.370 0.729
EDUYRS 1.0000 0.907 0.862 0.954
~john-c/5421/jecheart.sas 04APR04 16:59
=================================================================================
The program includes 'proc freq' in which smoking status is tabulated
versus years education. It appears that people with more years of education
are less likely to smoke. Because of the way SMOKE is coded (1 = current
smoker, 2 = ex-smoker, 3 = nonsmoker), one would expect the coefficient
of EDUYRS in proc logistic to be *negative*: the more education you have, the
less likely you are to be in one of the 'lower' SMOKE categories, assuming the
DESCENDING option is not used in proc logistic.
The proc logistic output indicates two INTERCEPT terms. These correspond
to b01 and b02 discussed above. Note that b01 <--> INTERCEPT 1, and
b02 <--> INTERCEPT 2. Note that, as described above, INTERCEPT 1 < INTERCEPT 2.
The proc logistic output indicates two significant predictors of smoking
status: SEX and EDUYRS. SEX is coded as 1 ( = MALE) and 2 ( = FEMALE). The
coefficient for SEX is -.6532. This translates into an odds ratio of
exp(-.6532) = 0.52. This can be interpreted as meaning that women in the study are
less likely to be current smokers than men.
We will make additional use of this data file for other SAS analyses.
=================================================================================
PROBLEM 1:
Use the model above to estimate the probability that a male of age
50 with 12 years of education will be an ex-smoker.
PROBLEM 2:
Verify that women in the Minnesota Heart Survey are less likely to be
current- or ex-smokers than men.
PROBLEM 3:
Perform a logistic analysis with, again, SMOKE as the outcome variable,
versus the following covariates: SEX, CHOL, and DEAD. Also investigate whether
there is an interaction of SEX with DEAD as "predictors" of SMOKE.
=================================================================================
n54703.013 Last update: April 5, 2004.