PROC LOGISTIC, III: Continuous Covariates n54703.011
In the preceding discussions of PROC LOGISTIC we have focussed on
examples in which categorical covariates (or 'predictors') are
employed. However, PROC LOGISTIC, like PROC GLM, can be used with
continuous covariates also.
If Y is a dichotomous outcome variable [e.g., success or failure
of an operation, and X1, X2, ... Xp are p covariates, the usual logistic
model is:
1
Prob(Y = 1 | X1, X2, ..., Xp) = ------------------------------------------ ,
1 + exp(-b0 - b1*X1 - b2*X2 - ... - bp*Xp)
where b0, b1, b2, ..., bp are unknown coefficients.
It is certainly possible that the covariates of interest are
in fact risk factors, like age or systolic blood pressure or FEV1.
Age in particular is a risk factor for a great many chronic conditions
and for survival of serious illnesses: in general you expect that
the higher the age, the greater the probability of an event. In terms
of the coefficients in the above model, that translates into the
coefficient of age being *positive*.
This also relates to the interpretation of exp(b1): this is equal to an
odds ratio. It is the odds ratio associated with a 1-unit increase in the
value of the covariate X1.
Appended is an example of logistic regression with several covariates.
The data are from the Lung Health Study. The outcome variable is smoking
status at Year 1: VPCQUIT1 = 0 means the person did not quit smoking, whereas
VPCQUIT1 = 1 means the person did quit smoking. The covariates are AGE at
baseline; GENDER (0 = men, 1 = women); BMI (body mass index); F10CIGS (baseline
cigarettes per day); YEAREDUC (years education category, 1-9); and S2FEVPC2
(baseline FEV1 percent of normal).
The first regression uses the selection = stepwise option. The
second regression shows the effect of adding in interaction terms.
=================================================================================
educage = age*educ ;
educbmi = educ*bmi ;
* ==================================================================== ;
options linesize = 100 ;
proc logistic data = smoke descending ;
where siuc eq 1 ;
model vpcquit1 = age gender bmi f10cigs yeareduc s2fevpc2
/selection = stepwise lackfit rsquare clodds = pl;
title1 'PROC LOGISTIC: quit-smoking at 1 year versus demographic vars' ;
title2 'and baseline FEV1 percent predicted. Stepwise option employed.' ;
title3 'Also lack-of-fit statistics, pseudo-R-square' ;
proc logistic data = smoke descending ;
where siuc eq 1 ;
model vpcquit1 = yeareduc age bmi educage educbmi
/lackfit rsquare clodds = pl;
title1 'PROC LOGISTIC: quit-smoking at 1 year versus demographic vars' ;
title2 'and interactions. Also lack-of-fit statistics, pseudo-R-square' ;
title3 '' ;
run ;
endsas;
----------------------------------------------------------------------------------------------------
PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 1
and baseline FEV1 percent predicted. Stepwise option employed.
Also lack-of-fit statistics, pseudo-R-square
17:37 Saturday, March 13, 2004
The LOGISTIC Procedure
Data Set: WORK.SMOKE
Response Variable: VPCQUIT1 VALID QUIT AT AV1
Response Levels: 2
Number of Observations: 3916
Link Function: Logit
Response Profile
Ordered
Value VPCQUIT1 Count
1 1 1354
2 0 2562
WARNING: 7 observation(s) were deleted due to missing values for the response or explanatory
variables.
Stepwise Selection Procedure
Step 0. Intercept entered:
Residual Chi-Square = 56.3119 with 6 DF (p=0.0001)
Step 1. Variable YEAREDUC entered:
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 5051.940 5026.333 .
SC 5058.213 5038.878 .
-2 LOG L 5049.940 5022.333 27.607 with 1 DF (p=0.0001)
Score . . 27.458 with 1 DF (p=0.0001)
RSquare = 0.0070 Max-rescaled RSquare = 0.0097
Residual Chi-Square = 28.8407 with 5 DF (p=0.0001)
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37
----------------------------------------------------------------------------------------------------
PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 2
and baseline FEV1 percent predicted. Stepwise option employed.
Also lack-of-fit statistics, pseudo-R-square
17:37 Saturday, March 13, 2004
The LOGISTIC Procedure
Step 2. Variable AGE entered:
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 5051.940 5019.408 .
SC 5058.213 5038.226 .
-2 LOG L 5049.940 5013.408 36.533 with 2 DF (p=0.0001)
Score . . 36.451 with 2 DF (p=0.0001)
RSquare = 0.0093 Max-rescaled RSquare = 0.0128
Residual Chi-Square = 19.9902 with 4 DF (p=0.0005)
Step 3. Variable BMI entered:
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 5051.940 5012.708 .
SC 5058.213 5037.799 .
-2 LOG L 5049.940 5004.708 45.232 with 3 DF (p=0.0001)
Score . . 45.113 with 3 DF (p=0.0001)
RSquare = 0.0115 Max-rescaled RSquare = 0.0158
Residual Chi-Square = 11.2975 with 3 DF (p=0.0102)
Step 4. Variable S2FEVPC2 entered:
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37
----------------------------------------------------------------------------------------------------
PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 3
and baseline FEV1 percent predicted. Stepwise option employed.
Also lack-of-fit statistics, pseudo-R-square
17:37 Saturday, March 13, 2004
The LOGISTIC Procedure
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 5051.940 5006.806 .
SC 5058.213 5038.170 .
-2 LOG L 5049.940 4996.806 53.134 with 4 DF (p=0.0001)
Score . . 52.875 with 4 DF (p=0.0001)
RSquare = 0.0135 Max-rescaled RSquare = 0.0186
Residual Chi-Square = 3.4348 with 2 DF (p=0.1795)
NOTE: No (additional) variables met the 0.05 significance level for entry into the model.
Summary of Stepwise Procedure
Variable Number Score Wald Pr > Variable
Step Entered Removed In Chi-Square Chi-Square Chi-Square Label
1 YEAREDUC 1 27.4576 . 0.0001 YEARS EDUCATION
2 AGE 2 8.8999 . 0.0029 AGE AT ENTRY INTO LHS
3 BMI 3 8.7354 . 0.0031 BODY MASS INDEX (KG/M2)
4 S2FEVPC2 4 7.8697 . 0.0050 FEV1 % PRED POST-BD SCREEN 2
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 -3.8862 0.5186 56.1531 0.0001 . .
AGE 1 0.0177 0.00509 12.0865 0.0005 0.066710 1.018
BMI 1 0.0263 0.00860 9.3313 0.0023 0.056911 1.027
YEAREDUC 1 0.0640 0.0121 28.1388 0.0001 0.100014 1.066
S2FEVPC2 1 0.0107 0.00383 7.8554 0.0051 0.053863 1.011
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37
----------------------------------------------------------------------------------------------------
PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 4
and baseline FEV1 percent predicted. Stepwise option employed.
Also lack-of-fit statistics, pseudo-R-square
17:37 Saturday, March 13, 2004
The LOGISTIC Procedure
Association of Predicted Probabilities and Observed Responses
Concordant = 56.4% Somers' D = 0.139
Discordant = 42.5% Gamma = 0.141
Tied = 1.0% Tau-a = 0.063
(3468948 pairs) c = 0.570
Conditional Odds Ratios and 95% Confidence Intervals
Profile Likelihood
Confidence Limits
Odds
Variable Unit Ratio Lower Upper
AGE 1.0000 1.018 1.008 1.028
BMI 1.0000 1.027 1.009 1.044
YEAREDUC 1.0000 1.066 1.041 1.092
S2FEVPC2 1.0000 1.011 1.003 1.018
Hosmer and Lemeshow Goodness-of-Fit Test
VPCQUIT1 = 1 VPCQUIT1 = 0
-------------------- --------------------
Group Total Observed Expected Observed Expected
1 394 95 100.76 299 293.24
2 392 127 113.73 265 278.27
3 392 120 120.87 272 271.13
4 394 123 127.16 271 266.84
5 393 118 131.82 275 261.18
6 391 142 136.21 249 254.79
7 390 141 141.57 249 248.43
8 393 152 149.16 241 243.84
9 392 160 158.32 232 233.68
10 385 176 174.29 209 210.71
Goodness-of-fit Statistic = 5.544 with 8 DF (p=0.6982)
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37
----------------------------------------------------------------------------------------------------
PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 5
and interactions. Also lack-of-fit statistics, pseudo-R-square
' 17:37 Saturday, March 13, 2004
The LOGISTIC Procedure
Data Set: WORK.SMOKE
Response Variable: VPCQUIT1 VALID QUIT AT AV1
Response Levels: 2
Number of Observations: 3916
Link Function: Logit
Response Profile
Ordered
Value VPCQUIT1 Count
1 1 1354
2 0 2562
WARNING: 7 observation(s) were deleted due to missing values for the response or explanatory
variables.
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 5051.940 5014.621 .
SC 5058.213 5052.258 .
-2 LOG L 5049.940 5002.621 47.319 with 5 DF (p=0.0001)
Score . . 46.732 with 5 DF (p=0.0001)
RSquare = 0.0120 Max-rescaled RSquare = 0.0166
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds Variable
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio Label
INTERCPT 1 -5.1471 1.6542 9.6815 0.0019 . . Intercept
YEAREDUC 1 0.2289 0.1177 3.7839 0.0517 0.357704 1.257 YEARS EDUCATION
AGE 1 0.0459 0.0251 3.3474 0.0673 0.173002 1.047 AGE AT ENTRY INTO LHS
BMI 1 0.0525 0.0431 1.4794 0.2239 0.113603 1.054 BODY MASS INDEX (KG/M2)
EDUCAGE 1 -0.00225 0.00177 1.6187 0.2033 -0.203210 0.998
EDUCBMI 1 -0.00201 0.00310 0.4188 0.5175 -0.100583 0.998
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37
----------------------------------------------------------------------------------------------------
PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 6
and interactions. Also lack-of-fit statistics, pseudo-R-square
' 17:37 Saturday, March 13, 2004
The LOGISTIC Procedure
Association of Predicted Probabilities and Observed Responses
Concordant = 56.1% Somers' D = 0.134
Discordant = 42.8% Gamma = 0.135
Tied = 1.1% Tau-a = 0.060
(3468948 pairs) c = 0.567
Conditional Odds Ratios and 95% Confidence Intervals
Profile Likelihood
Confidence Limits
Odds
Variable Unit Ratio Lower Upper
YEAREDUC 1.0000 1.257 0.998 1.583
AGE 1.0000 1.047 0.997 1.100
BMI 1.0000 1.054 0.968 1.147
EDUCAGE 1.0000 0.998 0.994 1.001
EDUCBMI 1.0000 0.998 0.992 1.004
Hosmer and Lemeshow Goodness-of-Fit Test
VPCQUIT1 = 1 VPCQUIT1 = 0
-------------------- --------------------
Group Total Observed Expected Observed Expected
1 393 98 100.53 295 292.47
2 391 121 114.67 270 276.33
3 393 121 122.35 272 270.65
4 394 109 127.97 285 266.03
5 392 135 132.52 257 259.48
6 394 148 137.97 246 256.03
7 392 146 142.65 246 249.35
8 393 151 149.32 242 243.68
9 394 170 158.21 224 235.79
10 380 155 167.68 225 212.32
Goodness-of-fit Statistic = 9.298 with 8 DF (p=0.3178)
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37
=================================================================================
The stepwise regression selected 4 of the covariates as useful
predictors of the outcome (quitting smoking at Year 1): years education,
age, body mass index, and baseline FEV1 percent of normal. All had
positive coefficients, indicating that higher levels of any of them
are associated with an increased chance of having quit smoking by
Year 1.
Note that the coefficient for AGE is 0.0177. The odds ratio
corresponding to a 1-year increase in age is exp(0.0177) = 1.018.
This is obviously not very different from 1. In fact you might
not be very interested in the effect of only a 1-year increase in
age. You might be more interested in the effect of a 10-year
increment in age. This can be calculated also. The odds ratio
of quitting smoking for a 10-year increase in age is:
exp(10*.0177) = exp(.177) = 1.1936.
SAS will compute a 'pseudo-R-square' for logistic regressions.
This is analogous to the R-square statistic for ordinary linear
regression. If R-square is close to 1, it is an indications that much
of the variability in outcome is due to the regression, i.e., is
explained by the covariates.
In this case the final value of R-square is very small: 0.0135. This
implies that almost all of the variability is due to factors that are
not entered into the model. They may be factors that were not even
measured in the study. Quitting smoking is a complex behavioral change:
it is believed to involve physical and psychological dependence, brain
chemistry and genetics, social pressures, economics, fears of weight
gain, co-existing chronic diseases, work restrictions, and many other
factors, and, probably, interactions of factors.
Another option on the MODEL statement that was used here is the
Hosmer-Lemeshow goodness-of-fit statistic. Values of this statistic
are compared to values from a chi-square distribution. In this case,
the chi-distribution has 8 degrees of freedom and the value of the
statistic is 5.544. This yields a p-value of 0.6982. Thus this test
does not provide any convincing evidence that the fit of the model is not
good.
The second analysis shows the effect of adding two interaction
terms: years education with age (EDUCAGE) and years education with
with body mass index (EDUCBMI). As shown at the top of the program,
these are both defined simply in terms of the product of the two
variables.
The best way to assess whether adding these two interaction
terms makes a significant difference in the model is to examine
the change in -2 Log L from a model which does not include the
interaction terms, to another model which does. The results were
as follows:
Model vpcquit1 = yeareduc age bmi -2 Log L = 5004.708
Model vpcquit1 = yeareduc age bmi educage educbmi -2 Log L = 5002.621
---------------------------------------------------------------------
Diff(-2 Log L) = 2.087
This should be compared to a chi-squared distribution with 2
degrees of freedom. The p-value for testing the null hypothesis that
the two added interaction terms do not make a difference is: 0.3522.
This is not sufficient evidence to reject the null hypothesis.
[Note that results for the first of these two models are not shown
in the printout.]
One precaution about comparing models using -2 Log L. Say, for
example, you were comparing:
Model 1: model y = x ;
and Model 2: model y = x z ;
For the comparison of these two models based on -2 Log L, the
number of cases analyzed *must be the same* for these two models.
Note that it is possible that the covariate z may have missing values
in certain cases, while the covariate x is not missing in those same
cases. In such cases -2 Log L should NOT be used uncritically
to compare the effect of adding the covariate z to the model. You must
make sure that both analyses are carried out on the same dataset. Here
is how to do it:
Model 1: proc logistic descending data = dataset ;
where x ne . and z ne . ;
model y = x ;
title1 'Model 1: y vs x, restricted to cases where x and z are not missing.' ;
run ;
Model 2: proc logistic descending data = dataset ;
where x ne . and z ne . ;
model y = x z ;
title1 'Model 1: y vs x and z, restricted to cases where x and z are not missing.' ;
run ;
=================================================================================
Problem 1.
Below is a program which reads a dataset based on a study of arthritis
treatment. Participants in the study received either a treatment drug
or a placebo. The outcome variable is called 'better'. Other variables
in the study are ntreat, age and gender.
You can use the 'Save as' option in the 'File' menu on a browser
to save this file, then edit it as you see fit.
Use logistic regression to analysis these data. Specifically:
Model 1: Age only as a predictor of 'better'.
Model 2: Gender only.
Model 3: ntreat only.
Model 4: Age and gender.
Model 5: Age and ntreat.
Model 6: Gender and ntreat.
Model 7: Age, gender and ntreat.
For each of these models, tabulate the odds ratios and 95%
confidence intervals for each variable in the model. Also for each
variable in these models, determine whether that variable makes a
statistically significant difference in the model: specify also exactly
what null hypothesis is being tested.
Test for interactions of:
Age with ntreat
Gender with ntreat
Age with gender
Looking over all the analyses you have carried out, specify the model
which you think is best, and justify your choice.
------------------------------------------------------------------------
options linesize = 80 ;
footnote "yourprogram.sas &sysdate &systime" ;
data arthrit;
length treat $7. sex $6. ;
input id treat $ sex $ age improve @@ ;
better = 0 ;
if improve > 0 then better = 1 ;
gender = 0 ;
if sex eq 'Female' then gender = 1 ;
ntreat = 0 ;
if treat eq 'Treated' then ntreat = 1 ;
cards ;
57 Treated Male 27 1 9 Placebo Male 37 0
46 Treated Male 29 0 14 Placebo Male 44 0
77 Treated Male 30 0 73 Placebo Male 50 0
17 Treated Male 32 2 74 Placebo Male 51 0
36 Treated Male 46 2 25 Placebo Male 52 0
23 Treated Male 58 2 18 Placebo Male 53 0
75 Treated Male 59 0 21 Placebo Male 59 0
39 Treated Male 59 2 52 Placebo Male 59 0
33 Treated Male 63 0 45 Placebo Male 62 0
55 Treated Male 63 0 41 Placebo Male 62 0
30 Treated Male 64 0 8 Placebo Male 63 2
5 Treated Male 64 1 80 Placebo Female 23 0
63 Treated Male 69 0 12 Placebo Female 30 0
83 Treated Male 70 2 29 Placebo Female 30 0
66 Treated Female 23 0 50 Placebo Female 31 1
40 Treated Female 32 0 38 Placebo Female 32 0
6 Treated Female 37 1 35 Placebo Female 33 2
7 Treated Female 41 0 51 Placebo Female 37 0
72 Treated Female 41 2 54 Placebo Female 44 0
37 Treated Female 48 0 76 Placebo Female 45 0
82 Treated Female 48 2 16 Placebo Female 46 0
53 Treated Female 55 2 69 Placebo Female 48 0
79 Treated Female 55 2 31 Placebo Female 49 0
26 Treated Female 56 2 20 Placebo Female 51 0
28 Treated Female 57 2 68 Placebo Female 53 0
60 Treated Female 57 2 81 Placebo Female 54 0
22 Treated Female 57 2 4 Placebo Female 54 0
27 Treated Female 58 0 78 Placebo Female 54 2
2 Treated Female 59 2 70 Placebo Female 55 2
59 Treated Female 59 2 49 Placebo Female 57 0
62 Treated Female 60 2 10 Placebo Female 57 1
84 Treated Female 61 2 47 Placebo Female 58 1
64 Treated Female 62 1 44 Placebo Female 59 1
34 Treated Female 62 2 24 Placebo Female 59 2
58 Treated Female 66 2 48 Placebo Female 61 0
13 Treated Female 67 2 19 Placebo Female 63 1
61 Treated Female 68 1 3 Placebo Female 64 0
65 Treated Female 68 2 67 Placebo Female 65 2
11 Treated Female 69 0 32 Placebo Female 66 0
56 Treated Female 69 1 42 Placebo Female 66 0
43 Treated Female 70 1 15 Placebo Female 66 1
71 Placebo Female 68 1
1 Placebo Female 74 2
;
run ;
--------------------------------------------------------------------------------------
Problem 2.
The dataset below is from a study of remission of cancer.
The outcome variable is 'remiss' (0 = no, 1 = yes). The covariates
of interest which are thought to be related to remission are cell,
smear, infil, li, blast, and temp. The dataset is shown below.
Carry out a stepwise logistic regression for this data. Include
a test for goodness of fit (the 'lackfit' option). Explain
and interpret the results of the final model.
data remission;
input remiss cell smear infil li blast temp;
datalines;
1 .8 .83 .66 1.9 1.1 .996
1 .9 .36 .32 1.4 .74 .992
0 .8 .88 .7 .8 .176 .982
0 1 .87 .87 .7 1.053 .986
1 .9 .75 .68 1.3 .519 .98
0 1 .65 .65 .6 .519 .982
1 .95 .97 .92 1 1.23 .992
0 .95 .87 .83 1.9 1.354 1.02
0 1 .45 .45 .8 .322 .999
0 .95 .36 .34 .5 0 1.038
0 .85 .39 .33 .7 .279 .988
0 .7 .76 .53 1.2 .146 .982
0 .8 .46 .37 .4 .38 1.006
0 .2 .39 .08 .8 .114 .99
0 1 .9 .9 1.1 1.037 .99
1 1 .84 .84 1.9 2.064 1.02
0 .65 .42 .27 .5 .114 1.014
0 1 .75 .75 1 1.322 1.004
0 .5 .44 .22 .6 .114 .99
1 1 .63 .63 1.1 1.072 .986
0 1 .33 .33 .4 .176 1.01
0 .9 .93 .84 .6 1.591 1.02
1 1 .58 .58 1 .531 1.002
0 .95 .32 .3 1.6 .886 .988
1 1 .6 .6 1.7 .964 .99
1 1 .69 .69 .9 .398 .986
0 1 .73 .73 .7 .398 .986
;
=================================================================================
n54703.011 Last update: March 28, 2006.