PROC LOGISTIC, V: Ordered Logistic                              n54703.013

     The usual logistic model applies for dichotomous outcomes; that is,
the outcome event either occurs or it does not occur, and it can take on
only two values, one designating occurrence and the other designating
non-occurrence.

     The model has been generalized so that the outcome can take only
several different ordered values.  For example, the outcome variable
may be coded as:

     Y = 1:  dead

     Y = 2:  sick but alive

     Y = 3:  healthy

     As is usual with the logistic model, the probability that Y attains
a certain value is a function of covariates, X1, X2, ..., Xp.  The expression
of this probability for ordered outcomes is more complicated than it is for
the dichotomous case, and a bit harder to understand.

     Assume as above that the outcome variable Y has 3 possible levels, and that
there are two covariates, X1 and X2.  The logistic model is specified by the
two expressions:

     prob(Y <= 1 | X1, X2) = 1 / (1 + exp(-b01 - b1*X1 - b2*X2)).

     prob(Y <= 2 | X1, X2) = 1 / (1 + exp(-b02 - b1*X1 - b2*X2)).


     [Note here, " <= " is to be read as "less than or equal to".]

     Some notes:

     1.  Note here that the model specifies the probability that Y is *less than or
         equal to* the specified value.  Of course, in this example,

         prob(Y <= 1 | X1, X2) = prob(Y = 1 | X1, X2), and

         prob(Y <= 2 | X1, X2) = prob(Y = 1 or Y = 2 | X1, X2).

     2.  Note that the coefficients of X1 and X2 are the same regardless of
         the level of Y that is specified.  However, the intercept term
         is *dependent* on the level.  Thus, b01 and b02 are two different
         intercept terms.

     3.  You might wonder how to compute prob(Y = 2 | X1, X2).  This is:

         prob(Y = 2 | X1, X2) = prob(Y <= 2 | X1, X2) - prob(Y <= 1 | X1, X2).

         This can be written out more fully as:

         prob(Y = 2 | X1, X2) = 1/(1 + exp(-b02 - b1*X1 - b2*X2))

                               - 1/(1 + exp(-b01 - b1*X1 - b2*X2)).

     4.  You might also wonder how to compute prob(Y = 3 | X1, X2): the correct
         expression is:

         prob(Y = 3 | X1, X2) = 1 - prob(Y <= 2 | X1, X2)

                              = exp(-b02 - b1*X1 - b2*X2)/(1 + exp(-b02 - b1*X1 - b2*X2)).


     5.  Because the values of Y are ordered, and the probability of any Y
         level is nonnegative, the intercept coefficients are also necessarily
         ordered: in this case,

                  b01 < b02.

     6.  The number of categories for Y should be relatively *small*, in general,
         5 or fewer.  There is no theoretical upper bound, but including
         too many categories increases the number of parameters in the model
         and, if they are "junk" variables, they cannot improve things.

     6.  Finally, you might wonder how to interpret the coefficients of X1 and
         X2 in terms of odds ratios.  Recall that for dichotomous outcomes,
         exp(b1) is interpreted as the odds ratio of having an event (versus
         not having it) for an increase of one unit in the covariate X1.

         The interpretation for multi-valued ordered logistic is similar:

         exp(b1) is the odds ratio that Y is at level j (versus being at a
         level *greater than j*) for a one-unit increase in covariate X1.

         I will certainly grant that it is hard to explain this further!


EXAMPLE OF THE USE OF ORDERED LOGISTIC:

     The SAS program below applies ordered logistic to a datafile of 485
records from the Minnesota Heart Survey.  The file is called mwheart.data.
There are links to this datafile on this course's website, on the page titled
'Computer Programs and Datafiles'.  There are also links to a description of
the datafile and to a SAS program which reads the datafile.

     The datafile includes a variable, SMOKE, which has three levels:

         SMOKE = 1:  current smoker
         SMOKE = 2:  ex-smoker
         SMOKE = 3:  never smoked

=================================================================================


*   MWHEART.SAS;
*   Reads MWHEART.DATA file (485 cases, 17 vars) randomly selected from;
*   4086 cases of the Mid-West Heart study;

OPTIONS LINESIZE = 80 CENTER PAGESIZE = 58  NUMBER LABEL;
TITLE 'Selected cases from the Mid-West Heart study conducted 1980-82 by';
TITLE2 'the Division of Epidemiology, SPH U. Minnesota';
TITLE3 'Mortality follow-up through 7/1/92 is based on National Death Index';

PROC FORMAT;
	VALUE sexfmt 	1='M'  2='F';
 VALUE ynfmt	  1='Y'  2='N';
	VALUE smkfmt	 1='1: current'  2='2: exsmoker'  3='3: nonsmoker';
 
DATA HEART;
INFILE 'mwheart.data';
INPUT
@1 ID         12.
@14 AGE        2.
@17 SEX        1.
@19 ENTRYDAT  mmddyy8.
@28 DTHDATE   mmddyy8.
@37 CAUSE      5.1
@43 CHOL       3.		
@47 HDL        2.
@50 BMI        5.2
@56 SMOKE	     1.
@58 CIGSDAY    2.
@61 THIOC      3.
@65 EVERBPRX   1.
@67 NOWBPRX    1.
@69 DBPAV2     3.
@72 SBPAV2     3.
@76 EDUYRS     2.
;
footnote "~john-c/5421/jecheart.sas &sysdate &systime" ;

* For survival analysis need: dead=death indicator and surv=survival;
* Compute days to death and convert to months;

	IF dthdate = . THEN censor=MDY(01,07,92); ELSE censor=dthdate;
	surv = (censor - entrydat)/30.4;             *SURV units are in months;
	IF dthdate = . THEN dead = 0; ELSE dead = 1; *DEAD is indicator;
	survyrs = INT(surv/12);                      *integer years of survival;

LABEL
 ID       = 'Identifying sequential number'
 AGE      = 'age at entry'
 SEX      = 'Gender: 1=M, 2=F'
 ENTRYDAT = 'Date of entry interview'
 DTHDATE  = 'Date of death'
 CAUSE    = 'ICD-9 code cause of death XXX.X'
 CHOL     = 'Serum total cholesterol (mg/dl)'
 HDL      = 'HDL cholesterol (mg/dl)'
 BMI      = 'Body Mass Index (function of heigt and weight)'
 SMOKE    = 'smoking status'
 CIGSDAY  = 'Cigarettes smoked per day'
 THIOC    = 'Thiocyanate level (indicator of smoking)'
 EVERBPRX = 'Ever use BP med: 1=Y, 2=N'
 NOWBPRX  = 'Now use BP med: 1=Y, 2=N'
 DBPAV2	  = 'Diastolic Blood Pressure (mmHg) - ave. of two readings.'
 SBPAV2   = 'Systolic blood pressure (mmHg) - average of two readings'
 EDUYRS   = 'Years of education';

FORMAT	SEX sexfmt. EVERBPRX NOWBPRX ynfmt. SMOKE smkfmt.;
		
********************************** END DATA STEP ****************************;

proc freq data = heart ;
     tables eduyrs * smoke / chisq ;
title1 'Years education vs. smoking status' ;
run ;

proc logistic data = heart ;
     model smoke = age sex eduyrs / clodds = pl ;
title1 'Smoking status [3 levels] modelled as function of age, sex, yrs educ.' ;

endsas ;

*---------------------------------------------------------------------------------

                       Years education vs. smoking status                      1
                                                     16:59 Sunday, April 4, 2004

                            TABLE OF EDUYRS BY SMOKE

                  EDUYRS(Years of education)     SMOKE(smoking status)

                  Frequency|
                  Percent  |
                  Row Pct  |
                  Col Pct  |1: curre|2: exsmo|3: nonsm|  Total
                           |nt      |ker     |oker    |
                  ---------+--------+--------+--------+
                         1 |      5 |      2 |      4 |     11
                           |   1.04 |   0.42 |   0.83 |   2.29
                           |  45.45 |  18.18 |  36.36 |
                           |   3.11 |   1.44 |   2.22 |
                  ---------+--------+--------+--------+
                         2 |      8 |      4 |      4 |     16
                           |   1.67 |   0.83 |   0.83 |   3.33
                           |  50.00 |  25.00 |  25.00 |
                           |   4.97 |   2.88 |   2.22 |
                  ---------+--------+--------+--------+
                        10 |     15 |      9 |      4 |     28
                           |   3.13 |   1.88 |   0.83 |   5.83
                           |  53.57 |  32.14 |  14.29 |
                           |   9.32 |   6.47 |   2.22 |
                  ---------+--------+--------+--------+
                        11 |     59 |     45 |     60 |    164
                           |  12.29 |   9.38 |  12.50 |  34.17
                           |  35.98 |  27.44 |  36.59 |
                           |  36.65 |  32.37 |  33.33 |
                  ---------+--------+--------+--------+
                        13 |     18 |     19 |     15 |     52
                           |   3.75 |   3.96 |   3.13 |  10.83
                           |  34.62 |  36.54 |  28.85 |
                           |  11.18 |  13.67 |   8.33 |
                  ---------+--------+--------+--------+
                        15 |     29 |     26 |     42 |     97
                           |   6.04 |   5.42 |   8.75 |  20.21
                           |  29.90 |  26.80 |  43.30 |
                           |  18.01 |  18.71 |  23.33 |
                  ---------+--------+--------+--------+
                        16 |     20 |     22 |     37 |     79
                           |   4.17 |   4.58 |   7.71 |  16.46
                           |  25.32 |  27.85 |  46.84 |
                           |  12.42 |  15.83 |  20.56 |
                  ---------+--------+--------+--------+
                        18 |      7 |     12 |     14 |     33
                           |   1.46 |   2.50 |   2.92 |   6.88
                           |  21.21 |  36.36 |  42.42 |
                           |   4.35 |   8.63 |   7.78 |
                  ---------+--------+--------+--------+
                  Total         161      139      180      480
                              33.54    28.96    37.50   100.00

                  Frequency Missing = 5
 
                    ~john-c/5421/jecheart.sas 04APR04 16:59
                        Years education vs. smoking status                      3
                                                     16:59 Sunday, April 4, 2004

                    STATISTICS FOR TABLE OF EDUYRS BY SMOKE

             Statistic                     DF     Value        Prob
             ------------------------------------------------------
             Chi-Square                    14    20.178       0.125
             Likelihood Ratio Chi-Square   14    21.023       0.101
             Mantel-Haenszel Chi-Square     1     9.262       0.002
             Phi Coefficient                      0.205            
             Contingency Coefficient              0.201            
             Cramer's V                           0.145            

             Effective Sample Size = 480
             Frequency Missing = 5
 
                    ~john-c/5421/jecheart.sas 04APR04 16:59
-----------------------------------------------------------------------------------
      Smoking status [3 levels] modelled as function of age, sex, yrs educ.     4
                                                     16:59 Sunday, April 4, 2004

                             The LOGISTIC Procedure

     Data Set: WORK.HEART   
     Response Variable: SMOKE     smoking status
     Response Levels: 3
     Number of Observations: 479
     Link Function: Logit


                                Response Profile
 
                        Ordered
                          Value  SMOKE            Count

                              1  1: current         161
                              2  2: exsmoker        138
                              3  3: nonsmoker       180

WARNING: 6 observation(s) were deleted due to missing values for the response 
         or explanatory variables.



                Score Test for the Proportional Odds Assumption

                   Chi-Square = 36.6287 with 3 DF (p=0.0001)


      Model Fitting Information and Testing Global Null Hypothesis BETA=0
 
                               Intercept
                 Intercept        and   
   Criterion       Only       Covariates    Chi-Square for Covariates

   AIC            1050.890      1030.476         .                          
   SC             1059.234      1051.334         .                          
   -2 LOG L       1046.890      1020.476       26.415 with 3 DF (p=0.0001)  

   Score              .             .          25.736 with 3 DF (p=0.0001)  


                   Analysis of Maximum Likelihood Estimates
 
              Parameter Standard    Wald       Pr >    Standardized     Odds
  Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio

  INTERCP1 1     2.0594   0.6095    11.4183     0.0007            .     .    
  INTERCP2 1     3.3021   0.6212    28.2554     0.0001            .     .    
  AGE      1    -0.0121  0.00694     3.0234     0.0821    -0.087417    0.988 
  SEX      1    -0.6532   0.1736    14.1654     0.0002    -0.179459    0.520 
  EDUYRS   1    -0.0972   0.0259    14.1349     0.0002    -0.194934    0.907 
 
 
 
 
                    ~john-c/5421/jecheart.sas 04APR04 16:59
-----------------------------------------------------------------------------------
      Smoking status [3 levels] modelled as function of age, sex, yrs educ.     5
                                                     16:59 Sunday, April 4, 2004

                             The LOGISTIC Procedure

         Association of Predicted Probabilities and Observed Responses

                   Concordant = 59.8%          Somers' D = 0.203
                   Discordant = 39.5%          Gamma     = 0.204
                   Tied       =  0.7%          Tau-a     = 0.135
                   (76038 pairs)               c         = 0.601


              Conditional Odds Ratios and 95% Confidence Intervals
 
                                                 Profile Likelihood
                                                  Confidence Limits
                                        Odds
            Variable        Unit       Ratio       Lower       Upper

            AGE           1.0000       0.988       0.975       1.001
            SEX           1.0000       0.520       0.370       0.729
            EDUYRS        1.0000       0.907       0.862       0.954

 
                    ~john-c/5421/jecheart.sas 04APR04 16:59
=================================================================================

     The program includes 'proc freq' in which smoking status is tabulated
versus years education.  It appears that people with more years of education
are less likely to smoke.  Because of the way SMOKE is coded (1 = current
smoker, 2 = ex-smoker, 3 = nonsmoker), one would expect the coefficient
of EDUYRS in proc logistic to be *negative*: the more education you have, the
less likely you are to be in one of the 'lower' SMOKE categories, assuming the
DESCENDING option is not used in proc logistic.

     The proc logistic output indicates two INTERCEPT terms.  These correspond
to b01 and b02 discussed above.  Note that b01 <--> INTERCEPT 1, and
b02 <--> INTERCEPT 2.  Note that, as described above, INTERCEPT 1 < INTERCEPT 2.

     The proc logistic output indicates two significant predictors of smoking
status: SEX and EDUYRS.  SEX is coded as 1 ( = MALE) and 2 ( = FEMALE).  The
coefficient for SEX is -.6532.  This translates into an odds ratio of 
exp(-.6532) = 0.52.  This can be interpreted as meaning that women in the study are
less likely to be current smokers than men.

     We will make additional use of this data file for other SAS analyses.
=================================================================================

PROBLEM 1:  

     Use the model above to estimate the probability that a male of age
50 with 12 years of education will be an ex-smoker.


PROBLEM 2:

     Verify that women in the Minnesota Heart Survey are less likely to be
current- or ex-smokers than men.


PROBLEM 3:

     Perform a logistic analysis with, again, SMOKE as the outcome variable,
versus the following covariates: SEX, CHOL, and DEAD.  Also investigate whether
there is an interaction of SEX with DEAD as "predictors" of SMOKE.


=================================================================================

n54703.013  Last update: April 5, 2004.