PROC LOGISTIC, III: Continuous Covariates n54703.011 In the preceding discussions of PROC LOGISTIC we have focussed on examples in which categorical covariates (or 'predictors') are employed. However, PROC LOGISTIC, like PROC GLM, can be used with continuous covariates also. If Y is a dichotomous outcome variable [e.g., success or failure of an operation, and X1, X2, ... Xp are p covariates, the usual logistic model is: 1 Prob(Y = 1 | X1, X2, ..., Xp) = ------------------------------------------ , 1 + exp(-b0 - b1*X1 - b2*X2 - ... - bp*Xp) where b0, b1, b2, ..., bp are unknown coefficients. It is certainly possible that the covariates of interest are in fact risk factors, like age or systolic blood pressure or FEV1. Age in particular is a risk factor for a great many chronic conditions and for survival of serious illnesses: in general you expect that the higher the age, the greater the probability of an event. In terms of the coefficients in the above model, that translates into the coefficient of age being *positive*. This also relates to the interpretation of exp(b1): this is equal to an odds ratio. It is the odds ratio associated with a 1-unit increase in the value of the covariate X1. Appended is an example of logistic regression with several covariates. The data are from the Lung Health Study. The outcome variable is smoking status at Year 1: VPCQUIT1 = 0 means the person did not quit smoking, whereas VPCQUIT1 = 1 means the person did quit smoking. The covariates are AGE at baseline; GENDER (0 = men, 1 = women); BMI (body mass index); F10CIGS (baseline cigarettes per day); YEAREDUC (years education category, 1-9); and S2FEVPC2 (baseline FEV1 percent of normal). The first regression uses the selection = stepwise option. The second regression shows the effect of adding in interaction terms. ================================================================================= educage = age*educ ; educbmi = educ*bmi ; * ==================================================================== ; options linesize = 100 ; proc logistic data = smoke descending ; where siuc eq 1 ; model vpcquit1 = age gender bmi f10cigs yeareduc s2fevpc2 /selection = stepwise lackfit rsquare clodds = pl; title1 'PROC LOGISTIC: quit-smoking at 1 year versus demographic vars' ; title2 'and baseline FEV1 percent predicted. Stepwise option employed.' ; title3 'Also lack-of-fit statistics, pseudo-R-square' ; proc logistic data = smoke descending ; where siuc eq 1 ; model vpcquit1 = yeareduc age bmi educage educbmi /lackfit rsquare clodds = pl; title1 'PROC LOGISTIC: quit-smoking at 1 year versus demographic vars' ; title2 'and interactions. Also lack-of-fit statistics, pseudo-R-square' ; title3 '' ; run ; endsas; ---------------------------------------------------------------------------------------------------- PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 1 and baseline FEV1 percent predicted. Stepwise option employed. Also lack-of-fit statistics, pseudo-R-square 17:37 Saturday, March 13, 2004 The LOGISTIC Procedure Data Set: WORK.SMOKE Response Variable: VPCQUIT1 VALID QUIT AT AV1 Response Levels: 2 Number of Observations: 3916 Link Function: Logit Response Profile Ordered Value VPCQUIT1 Count 1 1 1354 2 0 2562 WARNING: 7 observation(s) were deleted due to missing values for the response or explanatory variables. Stepwise Selection Procedure Step 0. Intercept entered: Residual Chi-Square = 56.3119 with 6 DF (p=0.0001) Step 1. Variable YEAREDUC entered: Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 5051.940 5026.333 . SC 5058.213 5038.878 . -2 LOG L 5049.940 5022.333 27.607 with 1 DF (p=0.0001) Score . . 27.458 with 1 DF (p=0.0001) RSquare = 0.0070 Max-rescaled RSquare = 0.0097 Residual Chi-Square = 28.8407 with 5 DF (p=0.0001) LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37 ---------------------------------------------------------------------------------------------------- PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 2 and baseline FEV1 percent predicted. Stepwise option employed. Also lack-of-fit statistics, pseudo-R-square 17:37 Saturday, March 13, 2004 The LOGISTIC Procedure Step 2. Variable AGE entered: Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 5051.940 5019.408 . SC 5058.213 5038.226 . -2 LOG L 5049.940 5013.408 36.533 with 2 DF (p=0.0001) Score . . 36.451 with 2 DF (p=0.0001) RSquare = 0.0093 Max-rescaled RSquare = 0.0128 Residual Chi-Square = 19.9902 with 4 DF (p=0.0005) Step 3. Variable BMI entered: Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 5051.940 5012.708 . SC 5058.213 5037.799 . -2 LOG L 5049.940 5004.708 45.232 with 3 DF (p=0.0001) Score . . 45.113 with 3 DF (p=0.0001) RSquare = 0.0115 Max-rescaled RSquare = 0.0158 Residual Chi-Square = 11.2975 with 3 DF (p=0.0102) Step 4. Variable S2FEVPC2 entered: LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37 ---------------------------------------------------------------------------------------------------- PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 3 and baseline FEV1 percent predicted. Stepwise option employed. Also lack-of-fit statistics, pseudo-R-square 17:37 Saturday, March 13, 2004 The LOGISTIC Procedure Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 5051.940 5006.806 . SC 5058.213 5038.170 . -2 LOG L 5049.940 4996.806 53.134 with 4 DF (p=0.0001) Score . . 52.875 with 4 DF (p=0.0001) RSquare = 0.0135 Max-rescaled RSquare = 0.0186 Residual Chi-Square = 3.4348 with 2 DF (p=0.1795) NOTE: No (additional) variables met the 0.05 significance level for entry into the model. Summary of Stepwise Procedure Variable Number Score Wald Pr > Variable Step Entered Removed In Chi-Square Chi-Square Chi-Square Label 1 YEAREDUC 1 27.4576 . 0.0001 YEARS EDUCATION 2 AGE 2 8.8999 . 0.0029 AGE AT ENTRY INTO LHS 3 BMI 3 8.7354 . 0.0031 BODY MASS INDEX (KG/M2) 4 S2FEVPC2 4 7.8697 . 0.0050 FEV1 % PRED POST-BD SCREEN 2 Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT 1 -3.8862 0.5186 56.1531 0.0001 . . AGE 1 0.0177 0.00509 12.0865 0.0005 0.066710 1.018 BMI 1 0.0263 0.00860 9.3313 0.0023 0.056911 1.027 YEAREDUC 1 0.0640 0.0121 28.1388 0.0001 0.100014 1.066 S2FEVPC2 1 0.0107 0.00383 7.8554 0.0051 0.053863 1.011 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37 ---------------------------------------------------------------------------------------------------- PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 4 and baseline FEV1 percent predicted. Stepwise option employed. Also lack-of-fit statistics, pseudo-R-square 17:37 Saturday, March 13, 2004 The LOGISTIC Procedure Association of Predicted Probabilities and Observed Responses Concordant = 56.4% Somers' D = 0.139 Discordant = 42.5% Gamma = 0.141 Tied = 1.0% Tau-a = 0.063 (3468948 pairs) c = 0.570 Conditional Odds Ratios and 95% Confidence Intervals Profile Likelihood Confidence Limits Odds Variable Unit Ratio Lower Upper AGE 1.0000 1.018 1.008 1.028 BMI 1.0000 1.027 1.009 1.044 YEAREDUC 1.0000 1.066 1.041 1.092 S2FEVPC2 1.0000 1.011 1.003 1.018 Hosmer and Lemeshow Goodness-of-Fit Test VPCQUIT1 = 1 VPCQUIT1 = 0 -------------------- -------------------- Group Total Observed Expected Observed Expected 1 394 95 100.76 299 293.24 2 392 127 113.73 265 278.27 3 392 120 120.87 272 271.13 4 394 123 127.16 271 266.84 5 393 118 131.82 275 261.18 6 391 142 136.21 249 254.79 7 390 141 141.57 249 248.43 8 393 152 149.16 241 243.84 9 392 160 158.32 232 233.68 10 385 176 174.29 209 210.71 Goodness-of-fit Statistic = 5.544 with 8 DF (p=0.6982) LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37 ---------------------------------------------------------------------------------------------------- PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 5 and interactions. Also lack-of-fit statistics, pseudo-R-square ' 17:37 Saturday, March 13, 2004 The LOGISTIC Procedure Data Set: WORK.SMOKE Response Variable: VPCQUIT1 VALID QUIT AT AV1 Response Levels: 2 Number of Observations: 3916 Link Function: Logit Response Profile Ordered Value VPCQUIT1 Count 1 1 1354 2 0 2562 WARNING: 7 observation(s) were deleted due to missing values for the response or explanatory variables. Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 5051.940 5014.621 . SC 5058.213 5052.258 . -2 LOG L 5049.940 5002.621 47.319 with 5 DF (p=0.0001) Score . . 46.732 with 5 DF (p=0.0001) RSquare = 0.0120 Max-rescaled RSquare = 0.0166 Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio Label INTERCPT 1 -5.1471 1.6542 9.6815 0.0019 . . Intercept YEAREDUC 1 0.2289 0.1177 3.7839 0.0517 0.357704 1.257 YEARS EDUCATION AGE 1 0.0459 0.0251 3.3474 0.0673 0.173002 1.047 AGE AT ENTRY INTO LHS BMI 1 0.0525 0.0431 1.4794 0.2239 0.113603 1.054 BODY MASS INDEX (KG/M2) EDUCAGE 1 -0.00225 0.00177 1.6187 0.2033 -0.203210 0.998 EDUCBMI 1 -0.00201 0.00310 0.4188 0.5175 -0.100583 0.998 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37 ---------------------------------------------------------------------------------------------------- PROC LOGISTIC: quit-smoking at 1 year versus demographic vars 6 and interactions. Also lack-of-fit statistics, pseudo-R-square ' 17:37 Saturday, March 13, 2004 The LOGISTIC Procedure Association of Predicted Probabilities and Observed Responses Concordant = 56.1% Somers' D = 0.134 Discordant = 42.8% Gamma = 0.135 Tied = 1.1% Tau-a = 0.060 (3468948 pairs) c = 0.567 Conditional Odds Ratios and 95% Confidence Intervals Profile Likelihood Confidence Limits Odds Variable Unit Ratio Lower Upper YEAREDUC 1.0000 1.257 0.998 1.583 AGE 1.0000 1.047 0.997 1.100 BMI 1.0000 1.054 0.968 1.147 EDUCAGE 1.0000 0.998 0.994 1.001 EDUCBMI 1.0000 0.998 0.992 1.004 Hosmer and Lemeshow Goodness-of-Fit Test VPCQUIT1 = 1 VPCQUIT1 = 0 -------------------- -------------------- Group Total Observed Expected Observed Expected 1 393 98 100.53 295 292.47 2 391 121 114.67 270 276.33 3 393 121 122.35 272 270.65 4 394 109 127.97 285 266.03 5 392 135 132.52 257 259.48 6 394 148 137.97 246 256.03 7 392 146 142.65 246 249.35 8 393 151 149.32 242 243.68 9 394 170 158.21 224 235.79 10 380 155 167.68 225 212.32 Goodness-of-fit Statistic = 9.298 with 8 DF (p=0.3178) LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 13MAR04 17:37 ================================================================================= The stepwise regression selected 4 of the covariates as useful predictors of the outcome (quitting smoking at Year 1): years education, age, body mass index, and baseline FEV1 percent of normal. All had positive coefficients, indicating that higher levels of any of them are associated with an increased chance of having quit smoking by Year 1. Note that the coefficient for AGE is 0.0177. The odds ratio corresponding to a 1-year increase in age is exp(0.0177) = 1.018. This is obviously not very different from 1. In fact you might not be very interested in the effect of only a 1-year increase in age. You might be more interested in the effect of a 10-year increment in age. This can be calculated also. The odds ratio of quitting smoking for a 10-year increase in age is: exp(10*.0177) = exp(.177) = 1.1936. SAS will compute a 'pseudo-R-square' for logistic regressions. This is analogous to the R-square statistic for ordinary linear regression. If R-square is close to 1, it is an indications that much of the variability in outcome is due to the regression, i.e., is explained by the covariates. In this case the final value of R-square is very small: 0.0135. This implies that almost all of the variability is due to factors that are not entered into the model. They may be factors that were not even measured in the study. Quitting smoking is a complex behavioral change: it is believed to involve physical and psychological dependence, brain chemistry and genetics, social pressures, economics, fears of weight gain, co-existing chronic diseases, work restrictions, and many other factors, and, probably, interactions of factors. Another option on the MODEL statement that was used here is the Hosmer-Lemeshow goodness-of-fit statistic. Values of this statistic are compared to values from a chi-square distribution. In this case, the chi-distribution has 8 degrees of freedom and the value of the statistic is 5.544. This yields a p-value of 0.6982. Thus this test does not provide any convincing evidence that the fit of the model is not good. The second analysis shows the effect of adding two interaction terms: years education with age (EDUCAGE) and years education with with body mass index (EDUCBMI). As shown at the top of the program, these are both defined simply in terms of the product of the two variables. The best way to assess whether adding these two interaction terms makes a significant difference in the model is to examine the change in -2 Log L from a model which does not include the interaction terms, to another model which does. The results were as follows: Model vpcquit1 = yeareduc age bmi -2 Log L = 5004.708 Model vpcquit1 = yeareduc age bmi educage educbmi -2 Log L = 5002.621 --------------------------------------------------------------------- Diff(-2 Log L) = 2.087 This should be compared to a chi-squared distribution with 2 degrees of freedom. The p-value for testing the null hypothesis that the two added interaction terms do not make a difference is: 0.3522. This is not sufficient evidence to reject the null hypothesis. [Note that results for the first of these two models are not shown in the printout.] One precaution about comparing models using -2 Log L. Say, for example, you were comparing: Model 1: model y = x ; and Model 2: model y = x z ; For the comparison of these two models based on -2 Log L, the number of cases analyzed *must be the same* for these two models. Note that it is possible that the covariate z may have missing values in certain cases, while the covariate x is not missing in those same cases. In such cases -2 Log L should NOT be used uncritically to compare the effect of adding the covariate z to the model. You must make sure that both analyses are carried out on the same dataset. Here is how to do it: Model 1: proc logistic descending data = dataset ; where x ne . and z ne . ; model y = x ; title1 'Model 1: y vs x, restricted to cases where x and z are not missing.' ; run ; Model 2: proc logistic descending data = dataset ; where x ne . and z ne . ; model y = x z ; title1 'Model 1: y vs x and z, restricted to cases where x and z are not missing.' ; run ; ================================================================================= Problem 1. Below is a program which reads a dataset based on a study of arthritis treatment. Participants in the study received either a treatment drug or a placebo. The outcome variable is called 'better'. Other variables in the study are ntreat, age and gender. You can use the 'Save as' option in the 'File' menu on a browser to save this file, then edit it as you see fit. Use logistic regression to analysis these data. Specifically: Model 1: Age only as a predictor of 'better'. Model 2: Gender only. Model 3: ntreat only. Model 4: Age and gender. Model 5: Age and ntreat. Model 6: Gender and ntreat. Model 7: Age, gender and ntreat. For each of these models, tabulate the odds ratios and 95% confidence intervals for each variable in the model. Also for each variable in these models, determine whether that variable makes a statistically significant difference in the model: specify also exactly what null hypothesis is being tested. Test for interactions of: Age with ntreat Gender with ntreat Age with gender Looking over all the analyses you have carried out, specify the model which you think is best, and justify your choice. ------------------------------------------------------------------------ options linesize = 80 ; footnote "yourprogram.sas &sysdate &systime" ; data arthrit; length treat $7. sex $6. ; input id treat $ sex $ age improve @@ ; better = 0 ; if improve > 0 then better = 1 ; gender = 0 ; if sex eq 'Female' then gender = 1 ; ntreat = 0 ; if treat eq 'Treated' then ntreat = 1 ; cards ; 57 Treated Male 27 1 9 Placebo Male 37 0 46 Treated Male 29 0 14 Placebo Male 44 0 77 Treated Male 30 0 73 Placebo Male 50 0 17 Treated Male 32 2 74 Placebo Male 51 0 36 Treated Male 46 2 25 Placebo Male 52 0 23 Treated Male 58 2 18 Placebo Male 53 0 75 Treated Male 59 0 21 Placebo Male 59 0 39 Treated Male 59 2 52 Placebo Male 59 0 33 Treated Male 63 0 45 Placebo Male 62 0 55 Treated Male 63 0 41 Placebo Male 62 0 30 Treated Male 64 0 8 Placebo Male 63 2 5 Treated Male 64 1 80 Placebo Female 23 0 63 Treated Male 69 0 12 Placebo Female 30 0 83 Treated Male 70 2 29 Placebo Female 30 0 66 Treated Female 23 0 50 Placebo Female 31 1 40 Treated Female 32 0 38 Placebo Female 32 0 6 Treated Female 37 1 35 Placebo Female 33 2 7 Treated Female 41 0 51 Placebo Female 37 0 72 Treated Female 41 2 54 Placebo Female 44 0 37 Treated Female 48 0 76 Placebo Female 45 0 82 Treated Female 48 2 16 Placebo Female 46 0 53 Treated Female 55 2 69 Placebo Female 48 0 79 Treated Female 55 2 31 Placebo Female 49 0 26 Treated Female 56 2 20 Placebo Female 51 0 28 Treated Female 57 2 68 Placebo Female 53 0 60 Treated Female 57 2 81 Placebo Female 54 0 22 Treated Female 57 2 4 Placebo Female 54 0 27 Treated Female 58 0 78 Placebo Female 54 2 2 Treated Female 59 2 70 Placebo Female 55 2 59 Treated Female 59 2 49 Placebo Female 57 0 62 Treated Female 60 2 10 Placebo Female 57 1 84 Treated Female 61 2 47 Placebo Female 58 1 64 Treated Female 62 1 44 Placebo Female 59 1 34 Treated Female 62 2 24 Placebo Female 59 2 58 Treated Female 66 2 48 Placebo Female 61 0 13 Treated Female 67 2 19 Placebo Female 63 1 61 Treated Female 68 1 3 Placebo Female 64 0 65 Treated Female 68 2 67 Placebo Female 65 2 11 Treated Female 69 0 32 Placebo Female 66 0 56 Treated Female 69 1 42 Placebo Female 66 0 43 Treated Female 70 1 15 Placebo Female 66 1 71 Placebo Female 68 1 1 Placebo Female 74 2 ; run ; -------------------------------------------------------------------------------------- Problem 2. The dataset below is from a study of remission of cancer. The outcome variable is 'remiss' (0 = no, 1 = yes). The covariates of interest which are thought to be related to remission are cell, smear, infil, li, blast, and temp. The dataset is shown below. Carry out a stepwise logistic regression for this data. Include a test for goodness of fit (the 'lackfit' option). Explain and interpret the results of the final model. data remission; input remiss cell smear infil li blast temp; datalines; 1 .8 .83 .66 1.9 1.1 .996 1 .9 .36 .32 1.4 .74 .992 0 .8 .88 .7 .8 .176 .982 0 1 .87 .87 .7 1.053 .986 1 .9 .75 .68 1.3 .519 .98 0 1 .65 .65 .6 .519 .982 1 .95 .97 .92 1 1.23 .992 0 .95 .87 .83 1.9 1.354 1.02 0 1 .45 .45 .8 .322 .999 0 .95 .36 .34 .5 0 1.038 0 .85 .39 .33 .7 .279 .988 0 .7 .76 .53 1.2 .146 .982 0 .8 .46 .37 .4 .38 1.006 0 .2 .39 .08 .8 .114 .99 0 1 .9 .9 1.1 1.037 .99 1 1 .84 .84 1.9 2.064 1.02 0 .65 .42 .27 .5 .114 1.014 0 1 .75 .75 1 1.322 1.004 0 .5 .44 .22 .6 .114 .99 1 1 .63 .63 1.1 1.072 .986 0 1 .33 .33 .4 .176 1.01 0 .9 .93 .84 .6 1.591 1.02 1 1 .58 .58 1 .531 1.002 0 .95 .32 .3 1.6 .886 .988 1 1 .6 .6 1.7 .964 .99 1 1 .69 .69 .9 .398 .986 0 1 .73 .73 .7 .398 .986 ; ================================================================================= n54703.011 Last update: March 28, 2006.