PROC REG, II : Multivariate regression, collinearity, and other topics n54703.005 Multivariate Regression: Multivariate regression is a natural generalization of simple linear regression. The model may be stated as Y = a0 + a1*X1 + a2*X2 + ... + ap*Xp + e, where as before the error term e has mean 0 and standard deviation sigma, and the variables X1, X2, ..., Xp are possible predictors of the outcome Y. The coefficients a0, a1, a2, ..., ap are unknown. The object of the regression is to estimate values of a0, a1, ..., ap such that the resulting predicted values agree as closely as possible with the observed data. If there are two predicting variables, the model corresponds to a flat plane in 3-dimensional space which provides the best fit to a cloud of observed data points. This is called a *linear model* because the expression on the right side of the equation is a linear function of the predictors. It is reasonable to assume a linear model in many situations, but there are a great many others where a linear model is not reasonable. The following questions arise in dealing with multivariate linear models: 1. How can the data be displayed ? 2. What if some of the predictors are highly correlated with each other? Should they all be entered into the model? 3. What predictors should be entered into the model? What subset of possible predictors provides an adequate fit to the data ? 4. What do you do with residuals in multivariate regression ? How do you look for nonlinearity ? 5. How can you compare regression models ? We consider these in succession in the following: 1. How can the data be displayed? It is a good idea to examine the relationship of the outcome variable Y to each of the predictors. An example of this is on page 84 of the Der-Everitt text. The figure there is generated by a SAS macro and shows small scatterplots of all the variables versus each other. The 'outcome' variable for that dataset is R, the crime rate. The relationship of R to the other variables is shown in the leftmost column in that figure. This is one useful way to show multivariate relationships, but essentially it just shows a kind of 2-dimensional shadow of a cloud of points in (p + 1)-dimensional space (if there are p predictors). Since we humans are not able to visualize spaces of dimension higher than 3, we cannot really display all the data for this kind of model unless p = 1 or 2. Another graphical approach is like that shown in Chapter 2 of the text: 3-D plots of smoothed bivariate densities. This is accomplished using proc kde (kernel density estimator) and proc g3d, and requires access to graphics screens. A program to do this for crime rate versus unemployment in the Chapter 4 dataset on U.S. crime rates is given below. The smoothed-density plot actually does not contain any more information that a simple scatterplot. It depicts the relationship of two variables. The resulting 3-D plot may be seen at: http://www.biostat.umn.edu/%7Ejohn-c/5421/crimevsunemp2.grf However, it is possible to show 3-dimensional relationships: e.g., observed crime rates versus two covariates. The following figure shows this, and the program below generates the plot in the second "proc g3d" section: http://www.biostat.umn.edu/%7Ejohn-c/5421/crime.vs.educ.unemp2.grf ======================================================================== footnote "~john-c/5421/chap4.sas &sysdate &systime" ; FILENAME GRAPH 'gsas.grf' ; OPTIONS LINESIZE = 150 MPRINT ; GOPTIONS RESET = GLOBAL ROTATE = LANDSCAPE FTEXT = SWISSB DEVICE = GIF GACCESS = SASGASTD GSFNAME = GRAPH GSFMODE = REPLACE GUNIT = PCT BORDER CBACK = WHITE HTITLE = 2 HTEXT = 1 ; *===================================================================== ; /* Chapter 4 */ data uscrime; /* Chapter 4 */ * infile 'n:\handbook2\datasets\uscrime.dat' expandtabs; infile cards expandtabs ; input R Age S Ed Ex0 Ex1 LF M N NW U1 U2 W X; datalines ; 79.1 151 1 91 58 56 510 950 33 301 108 41 394 261 163.5 143 0 113 103 95 583 1012 13 102 96 36 557 194 57.8 142 1 89 45 44 533 969 18 219 94 33 318 250 196.9 136 0 121 149 141 577 994 157 80 102 39 673 167 123.4 141 0 121 109 101 591 985 18 30 91 20 578 174 68.2 121 0 110 118 115 547 964 25 44 84 29 689 126 96.3 127 1 111 82 79 519 982 4 139 97 38 620 168 155.5 131 1 109 115 109 542 969 50 179 79 35 472 206 85.6 157 1 90 65 62 553 955 39 286 81 28 421 239 70.5 140 0 118 71 68 632 1029 7 15 100 24 526 174 167.4 124 0 105 121 116 580 966 101 106 77 35 657 170 84.9 134 0 108 75 71 595 972 47 59 83 31 580 172 51.1 128 0 113 67 60 624 972 28 10 77 25 507 206 66.4 135 0 117 62 61 595 986 22 46 77 27 529 190 79.8 152 1 87 57 53 530 986 30 72 92 43 405 264 94.6 142 1 88 81 77 497 956 33 321 116 47 427 247 53.9 143 0 110 66 63 537 977 10 6 114 35 487 166 92.9 135 1 104 123 115 537 978 31 170 89 34 631 165 75.0 130 0 116 128 128 536 934 51 24 78 34 627 135 122.5 125 0 108 113 105 567 985 78 94 130 58 626 166 74.2 126 0 108 74 67 602 984 34 12 102 33 557 195 43.9 157 1 89 47 44 512 962 22 423 97 34 288 276 121.6 132 0 96 87 83 564 953 43 92 83 32 513 227 96.8 131 0 116 78 73 574 1038 7 36 142 42 540 176 52.3 130 0 116 63 57 641 984 14 26 70 21 486 196 199.3 131 0 121 160 143 631 1071 3 77 102 41 674 152 34.2 135 0 109 69 71 540 965 6 4 80 22 564 139 121.6 152 0 112 82 76 571 1018 10 79 103 28 537 215 104.3 119 0 107 166 157 521 938 168 89 92 36 637 154 69.6 166 1 89 58 54 521 973 46 254 72 26 396 237 37.3 140 0 93 55 54 535 1045 6 20 135 40 453 200 75.4 125 0 109 90 81 586 964 97 82 105 43 617 163 107.2 147 1 104 63 64 560 972 23 95 76 24 462 233 92.3 126 0 118 97 97 542 990 18 21 102 35 589 166 65.3 123 0 102 97 87 526 948 113 76 124 50 572 158 127.2 150 0 100 109 98 531 964 9 24 87 38 559 153 83.1 177 1 87 58 56 638 974 24 349 76 28 382 254 56.6 133 0 104 51 47 599 1024 7 40 99 27 425 225 82.6 149 1 88 61 54 515 953 36 165 86 35 395 251 115.1 145 1 104 82 74 560 981 96 126 88 31 488 228 88.0 148 0 122 72 66 601 998 9 19 84 20 590 144 54.2 141 0 109 56 54 523 968 4 2 107 37 489 170 82.3 162 1 99 75 70 522 996 40 208 73 27 496 224 103.0 136 0 121 95 96 574 1012 29 36 111 37 622 162 45.5 139 1 88 46 41 480 968 19 49 135 53 457 249 50.8 126 0 104 106 97 599 989 40 24 78 25 593 171 84.9 130 0 121 90 91 623 1049 3 22 113 40 588 160 ; run ; proc print data = uscrime ; proc kde data = uscrime out = bivest ; var r u2 ; run ; proc g3d data = bivest ; plot r * u2 = density ; title1 'Plot of Bivariate Density: Crime Rate vs Unemp 35-39' ; run ; proc g3d data = uscrime ; scatter ed * u2 = r / shape = 'prism' ; title1 'Plot of crime rates vs education level and unemployment' ; run ; ======================================================================== 2. What if some of the predictors are highly correlated with each other? Should they all be entered into the model? The plot on page 84 of the text shows strong correlations between some of the variables: particularly EX0 and EX1 (both related to expenditures for police), and between W and X (both related to income). Their correlation (obtainable from proc corr) is greater than 0.99. Entering highly correlated variables as predictors in a regression can lead to erroneous conclusions and should be avoided. 'Highly correlated' and 'collinear' mean essentially the same thing. SAS includes tests for collinearity. One such is based on something called the 'Variance Inflation Factor' (VIF). This can be invoked in proc reg as follows (see also p. 85 of the Der-Everitt text): proc reg data = uscrime ; model R = Age--X / vif ; run ; Printout for this is shown on page 86. Note that variables EX0 and EX1 have very high VIFs: 94.6 and 98.6 resp. A rule of thumb is that VIFs > 10 are not good. The VIF indicates that the variable in question is highly correlated with the ensemble of other variables in the analysis. What do you do if the VIF is very large for a given variable ? In this case, seeing the VIF for EX0 and EX1 and also noting that they are very highly correlated, you should enter only one of them in the model. Given their close agreement, it does not matter much which one you choose. In general, one would take the approach of omitting variables from the model one at a time until there are none left which have a high VIF. Another more systematic approach to avoiding collinearity of predictors is to use stepwise regression. This will select a subset of the predictors which does a good job of explaining most of the variability in the outcome variable, without adding in variables that are highly correlated with others. This is discussed in more detail below. 3. What predictors should be entered into the model? What subset of possible predictors provides an adequate fit to the data ? An outcome variable Y may be strongly related to some of the other variables in your dataset, and weakly related, or even unrelated, to others. You can enter many possible predictors in a multivariate regression, and in general the procedure will give you back the usual output: an ANOVA table, an F-test for overall effect of all the covariates on the outcome, R-square, coefficient estimates, standard errors, etc.. However, if variables are entered into the model which are really unrelated to the outcome, the effect is to decrease the apparent effect of other variables which are genuinely related. That is, you pay a price for entering useless predictors: the price is to underestimate the effects of the genuine predictors. Proc reg incorporates ways to select "best" models: that is, models which do not include variables which do not have strong relationships to the outcome variable. Proc reg makes use of *stepwise* algorithms, which can enter or delete variables from a model depending on certain variable selection criteria. One approach, call *backward* stepwise selection, starts by putting all the variables in the model and then omitting the least useful variables one at a time until the variables left in the model all contribute significantly to prediction of the outcome. The other approach is called *forward* stepwise regression, which starts with only the intercept term in model and adds the candidate variables one at a time until all the potentially useful variables have been added. It may happen that a given predictor X by itself is fairly strongly related to the outcome, but when other variables are already in the model which are perhaps correlated with X, the variable X itself does not make any useful addition. There is a variant of stepwise regression which is similar to 'forward' except that variables which are entered in one step may be removed in a subsequent step. This gives quite a lot of flexibility for variable selection. Below is an example of the use of forward stepwise using data in the Lung Health Study: the outcome variable is baseline FEV1, and the possible predictors are other baseline covariates: ================================================================================== PROC CORR DATA = SMOKE ; VAR S2FEVPOS AGE PACKYEAR GENDER HGTM WGTKG AFROAMER F10CIGS DRINKS COUGH30 ; TITLE1 'Correlations of Baseline Variables in the Lung Health Study' ; PROC REG DATA = SMOKE ; MODEL S2FEVPOS = AGE PACKYEAR GENDER HGTM WGTKG AFROAMER F10CIGS DRINKS COUGH30 / VIF SELECTION = STEPWISE ; TITLE1 'Proc reg: use of stepwise and VIF in Lung Health Study data' ; endsas ; -------------------------------------------------------------------------------- Correlations of Baseline Variables in the Lung Health Study 1 17:45 Tuesday, February 3, 2004 Correlation Analysis 10 'VAR' Variables: S2FEVPOS AGE PACKYEAR GENDER HGTM WGTKG AFROAMER F10CIGS DRINKS COUGH30 Simple Statistics Variable N Mean Std Dev Sum S2FEVPOS 5885 2.745760 0.627641 16159 AGE 5887 48.468660 6.825130 285335 PACKYEAR 5885 40.450127 19.126189 238049 GENDER 5887 0.371157 0.483155 2185.000000 HGTM 5887 1.719552 0.089076 10123 WGTKG 5886 76.005590 15.072866 447369 AFROAMER 5887 0.038220 0.191743 225.000000 F10CIGS 5887 31.272804 12.851235 184103 DRINKS 5887 4.353321 5.537889 25628 COUGH30 5887 0.422626 0.494019 2488.000000 Simple Statistics Variable Minimum Maximum Label S2FEVPOS 1.150000 4.940000 AGE 34.000000 67.000000 AGE AT ENTRY INTO LHS PACKYEAR 0 190.000000 PACK YEARS OF CIG SMOKING GENDER 0 1.000000 GENDER 0 = MEN, 1 = WOMEN HGTM 1.420000 2.160000 WGTKG 38.400000 136.000000 AFROAMER 0 1.000000 ETHNIC GROUP: AFRO-AMER OR NOT F10CIGS 2.000000 120.000000 CIGS PER DAY AT SCREEN 1 DRINKS 0 25.000000 DRINKS PER WEEK AMONG ALL COUGH30 0 1.000000 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45 Correlations of Baseline Variables in the Lung Health Study 2 17:45 Tuesday, February 3, 2004 Correlation Analysis Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / Number of Observations S2FEVPOS AGE PACKYEAR GENDER HGTM S2FEVPOS 1.00000 -0.35781 -0.07002 -0.70189 0.76342 0.0 0.0001 0.0001 0.0001 0.0001 5885 5885 5883 5885 5885 AGE -0.35781 1.00000 0.41517 0.00938 -0.07931 AGE AT ENTRY INTO LHS 0.0001 0.0 0.0001 0.4720 0.0001 5885 5887 5885 5887 5887 PACKYEAR -0.07002 0.41517 1.00000 -0.16390 0.08554 PACK YEARS OF CIG SMOKING 0.0001 0.0001 0.0 0.0001 0.0001 5883 5885 5885 5885 5885 GENDER -0.70189 0.00938 -0.16390 1.00000 -0.70252 GENDER 0 = MEN, 1 = WOMEN 0.0001 0.4720 0.0001 0.0 0.0001 5885 5887 5885 5887 5887 HGTM 0.76342 -0.07931 0.08554 -0.70252 1.00000 0.0001 0.0001 0.0001 0.0001 0.0 5885 5887 5885 5887 5887 WGTKG 0.52249 -0.03590 0.12352 -0.56202 0.65363 0.0001 0.0059 0.0001 0.0001 0.0001 5884 5886 5884 5886 5886 AFROAMER -0.13084 0.01020 -0.09143 0.02474 -0.01591 ETHNIC GROUP: AFRO-AMER OR NOT 0.0001 0.4340 0.0001 0.0577 0.2224 5885 5887 5885 5887 5887 F10CIGS 0.06159 -0.06894 0.47318 -0.14237 0.09523 CIGS PER DAY AT SCREEN 1 0.0001 0.0001 0.0001 0.0001 0.0001 5885 5887 5885 5887 5887 DRINKS 0.10331 0.01647 0.02992 -0.12344 0.09372 DRINKS PER WEEK AMONG ALL 0.0001 0.2065 0.0217 0.0001 0.0001 5885 5887 5885 5887 5887 COUGH30 -0.04240 -0.02897 0.09957 0.00752 -0.02272 0.0011 0.0262 0.0001 0.5641 0.0813 5885 5887 5885 5887 5887 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45 Correlations of Baseline Variables in the Lung Health Study 3 17:45 Tuesday, February 3, 2004 Correlation Analysis Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / Number of Observations WGTKG AFROAMER F10CIGS DRINKS COUGH30 S2FEVPOS 0.52249 -0.13084 0.06159 0.10331 -0.04240 0.0001 0.0001 0.0001 0.0001 0.0011 5884 5885 5885 5885 5885 AGE -0.03590 0.01020 -0.06894 0.01647 -0.02897 AGE AT ENTRY INTO LHS 0.0059 0.4340 0.0001 0.2065 0.0262 5886 5887 5887 5887 5887 PACKYEAR 0.12352 -0.09143 0.47318 0.02992 0.09957 PACK YEARS OF CIG SMOKING 0.0001 0.0001 0.0001 0.0217 0.0001 5884 5885 5885 5885 5885 GENDER -0.56202 0.02474 -0.14237 -0.12344 0.00752 GENDER 0 = MEN, 1 = WOMEN 0.0001 0.0577 0.0001 0.0001 0.5641 5886 5887 5887 5887 5887 HGTM 0.65363 -0.01591 0.09523 0.09372 -0.02272 0.0001 0.2224 0.0001 0.0001 0.0813 5886 5887 5887 5887 5887 WGTKG 1.00000 0.00712 0.12452 0.02545 -0.00144 0.0 0.5852 0.0001 0.0508 0.9120 5886 5886 5886 5886 5886 AFROAMER 0.00712 1.00000 -0.11793 -0.02648 -0.07729 ETHNIC GROUP: AFRO-AMER OR NOT 0.5852 0.0 0.0001 0.0422 0.0001 5886 5887 5887 5887 5887 F10CIGS 0.12452 -0.11793 1.00000 0.05608 0.17580 CIGS PER DAY AT SCREEN 1 0.0001 0.0001 0.0 0.0001 0.0001 5886 5887 5887 5887 5887 DRINKS 0.02545 -0.02648 0.05608 1.00000 0.03986 DRINKS PER WEEK AMONG ALL 0.0508 0.0422 0.0001 0.0 0.0022 5886 5887 5887 5887 5887 COUGH30 -0.00144 -0.07729 0.17580 0.03986 1.00000 0.9120 0.0001 0.0001 0.0022 0.0 5886 5887 5887 5887 5887 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45 Proc reg: use of stepwise and VIF in Lung Health Study data 4 17:45 Tuesday, February 3, 2004 Stepwise Procedure for Dependent Variable S2FEVPOS Step 1 Variable HGTM Entered R-square = 0.58277189 C(p) =4107.9115432 DF Sum of Squares Mean Square F Prob>F Regression 1 1350.30512662 1350.30512662 8213.01 0.0001 Error 5880 966.73373487 0.16441050 Total 5881 2317.03886149 Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP -6.50720345 0.10223574 666.05914651 4051.20 0.0001 HGTM 5.38101003 0.05937623 1350.30512662 8213.01 0.0001 Bounds on condition number: 1, 1 -------------------------------------------------------------------------------- Step 2 Variable AGE Entered R-square = 0.67130902 C(p) =1990.8694694 DF Sum of Squares Mean Square F Prob>F Regression 2 1555.44908864 777.72454432 6003.55 0.0001 Error 5879 761.58977285 0.12954410 Total 5881 2317.03886149 Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP -4.88569345 0.09947821 312.47421230 2412.11 0.0001 AGE -0.02745985 0.00069005 205.14396202 1583.58 0.0001 HGTM 5.21202416 0.05287638 1258.65595418 9716.04 0.0001 Bounds on condition number: 1.006492, 4.025966 -------------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Steps 3-7 OMITTED. <<<<<<<<<<<<<<<<<<<<<<<<<<<< -------------------------------------------------------------------------------- Step 8 Variable PACKYEAR Entered R-square = 0.75465759 C(p) = 8.01000201 DF Sum of Squares Mean Square F Prob>F Regression 8 1748.57096190 218.57137024 2258.12 0.0001 Error 5873 568.46789959 0.09679344 Total 5881 2317.03886149 Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP -1.44889268 0.12197908 13.65675946 141.09 0.0001 AGE -0.02841031 0.00069827 160.23325283 1655.41 0.0001 PACKYEAR -0.00081371 0.00028272 0.80181738 8.28 0.0040 GENDER -0.47249446 0.01197224 150.76096651 1557.55 0.0001 HGTM 3.42461517 0.06437664 273.91311523 2829.87 0.0001 AFROAMER -0.39917697 0.02136117 33.80078310 349.21 0.0001 F10CIGS -0.00269115 0.00038467 4.73734587 48.94 0.0001 DRINKS 0.00226678 0.00073970 0.90898139 9.39 0.0022 COUGH30 -0.04576476 0.00837485 2.89037675 29.86 0.0001 Bounds on condition number: 2.033289, 94.01067 -------------------------------------------------------------------------------- LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45 Proc reg: use of stepwise and VIF in Lung Health Study data 7 17:45 Tuesday, February 3, 2004 All variables left in the model are significant at the 0.1500 level. No other variable met the 0.1500 significance level for entry into the model. Summary of Stepwise Procedure for Dependent Variable S2FEVPOS Variable Number Partial Model Step Entered Removed In R**2 R**2 C(p) F Prob>F Label 1 HGTM 1 0.5828 0.5828 4107.9115 8213.0103 0.0001 2 AGE 2 0.0885 0.6713 1990.8695 1583.5840 0.0001 AGE AT ENTRY INTO LHS 3 GENDER 3 0.0639 0.7352 462.7931 1419.2966 0.0001 GENDER 0 = MEN, 1 = WOMEN 4 AFROAMER 4 0.0123 0.7476 169.3526 287.4031 0.0001 ETHNIC GROUP: AFRO-AMER OR NOT 5 F10CIGS 5 0.0051 0.7527 49.9965 120.4542 0.0001 CIGS PER DAY AT SCREEN 1 6 COUGH30 6 0.0012 0.7539 22.1442 29.7755 0.0001 7 DRINKS 7 0.0004 0.7543 14.2924 9.8413 0.0017 DRINKS PER WEEK AMONG ALL 8 PACKYEAR 8 0.0003 0.7547 8.0100 8.2838 0.0040 PACK YEARS OF CIG SMOKING LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45 Proc reg: use of stepwise and VIF in Lung Health Study data 8 17:45 Tuesday, February 3, 2004 Model: MODEL1 Dependent Variable: S2FEVPOS Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 8 1748.57096 218.57137 2258.122 0.0001 Error 5873 568.46790 0.09679 C Total 5881 2317.03886 Root MSE 0.31112 R-square 0.7547 Dep Mean 2.74558 Adj R-sq 0.7543 C.V. 11.33153 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 -1.448893 0.12197908 -11.878 0.0001 AGE 1 -0.028410 0.00069827 -40.687 0.0001 PACKYEAR 1 -0.000814 0.00028272 -2.878 0.0040 GENDER 1 -0.472494 0.01197224 -39.466 0.0001 HGTM 1 3.424615 0.06437664 53.197 0.0001 AFROAMER 1 -0.399177 0.02136117 -18.687 0.0001 F10CIGS 1 -0.002691 0.00038467 -6.996 0.0001 DRINKS 1 0.002267 0.00073970 3.064 0.0022 COUGH30 1 -0.045765 0.00837485 -5.465 0.0001 Variance Variable Variable DF Inflation Label INTERCEP 1 0.00000000 Intercept AGE 1 1.37934114 AGE AT ENTRY INTO LHS PACKYEAR 1 1.77696126 PACK YEARS OF CIG SMOKING GENDER 1 2.03328873 GENDER 0 = MEN, 1 = WOMEN HGTM 1 1.99671021 AFROAMER 1 1.02011203 ETHNIC GROUP: AFRO-AMER OR NOT F10CIGS 1 1.48491823 CIGS PER DAY AT SCREEN 1 DRINKS 1 1.01995800 DRINKS PER WEEK AMONG ALL COUGH30 1 1.04004387 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45 ================================================================================== Notes on preceding printout: A 'proc corr' was run before the 'proc reg'. This shows the correlations of both the outcome variable (S2FEVPOS) and the other baseline variables entered in the regression. S2FEVPOS is strongly correlated with age (r = -.358), gender (r = -.702), height (r = .763), and weight (r = .522). It is significantly correlated with most of the other variables. There are a number of other strong correlations (gender and height, for example). Only one of the variables are not included in the final model: WGTKG (weight in kilograms). If the variable HGTM (height in meters) is not been included in the list, the stepwise selection process *will* include WGTKG. It is *not* correct to conclude that, if a stepwise selection process does not select a given variable, then that variable is not strongly related to the outcome variable. Note that the stepwise procedure enters variables into the model if their significance level is < .1500. This default can be changed in the procedure. Note also that the variance inflation factor is relatively small for all variables entered into the model: the maximum VIF is 2.033, for GENDER. 4. What do you do with residuals in multivariate regression ? As with simple linear regression, proc reg can compute residuals for multivariate regression. It can be useful to plot the residuals against either (1) single predictors that were entered into the regression, or (2) the predicted value of the outcome variable. Plotting residuals against individual variables can indicate whether the relationship between that predictor and the outcome is actually linear. The residuals must average to 0, but this does not exclude the possibility that the pattern of residuals might resemble, for example, a rainbow. Example: the scatterplot of W vs EX1 on page 84 of the text. W is a measure of wealth for each state. EX1 is the per capita expenditure for police in the year 1959. The scatterplot indicates a curvilinear relationship between these two variables. If you regress W on EX1, the plot suggests that it might be useful to add a quadratic term to the model. Model 1 (linear) W = a + b*EX1 + e Model 2 (quadratic) W = a + b*EX1 + c*EX1*EX1 + e. The following program-segment uses proc reg to analyze these data using Model 1 and Model 2. The first analysis includes an output file of the W-residuals from Model 1. These residuals were then plotted against EX1, and the result has a (very roughly) curvilinear shape. The second analysis is a stepwise one, with both EX1 and EX1SQ ( = EX1*EX1) offered to the procedure. The stepwise algorithm first selected EX1, and then selected EX1SQ. This a strong indication that the relationship of W to EX1 is not linear (as suggested by the scatterplot). ================================================================================== proc reg data = uscrime ; model w = ex1 ; output out = regout r = wresid ; title1 'Regression of W (wealth) vs EX1 (police expenditures in 1959)' ; title2 'plus computation of residuals' ; run ; options pagesize = 40 ; proc plot data = regout ; plot wresid * ex1 ; title1 'Plot of Wealth residuals versus EX1 to detect nonlinearity ...' ; options pagesize = 60 ; proc reg data = uscrime ; model w = ex1 ex1sq / selection = stepwise ; output out = regout r = wresid ; title1 'Stepwise regression ... crime dataset ... wealth vs police expenditures' ; title2 'Both EX1 and EX1-squared are entered ...' ; endsas ; -------------------------------------------------------------------------------- Regression of W (wealth) vs EX1 (police expenditures in 1959) 1 plus computation of residuals 19:22 Tuesday, February 3, 2004 Model: MODEL1 Dependent Variable: W Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 1 270183.34186 270183.34186 76.902 0.0001 Error 45 158099.76452 3513.32810 C Total 46 428283.10638 Root MSE 59.27333 R-square 0.6309 Dep Mean 525.38298 Adj R-sq 0.6226 C.V. 11.28193 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 305.469730 26.52592383 11.516 0.0001 EX1 1 2.740897 0.31255237 8.769 0.0001 Plot of Wealth residuals versus EX1 to detect nonlinearity ... 2 19:22 Tuesday, February 3, 2004 Plot of WRESID*EX1. Legend: A = 1 obs, B = 2 obs, etc. | | 150 + | | | | | | A 100 + A | A | | A | A A | A | A 50 + A | | A A A A A A A A R | A e | A A A s | A i | A A d 0 + A A u | A A A a | A l | A A A AA | A | | A -50 + | B A | A | | A | | A -100 + A | A | | | | A A | -150 + | ---+----------+----------+----------+----------+----------+----------+-- 40 60 80 100 120 140 160 EX1 Stepwise regression ... crime dataset ... wealth vs police expenditures 3 Both EX1 and EX1-squared are entered ... 19:22 Tuesday, February 3, 2004 Stepwise Procedure for Dependent Variable W Step 1 Variable EX1 Entered R-square = 0.63085220 C(p) = 12.31414903 DF Sum of Squares Mean Square F Prob>F Regression 1 270183.34185858 270183.34185858 76.90 0.0001 Error 45 158099.76452440 3513.32810054 Total 46 428283.10638298 Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP 305.46972954 26.52592383 465922.87567564 132.62 0.0001 EX1 2.74089703 0.31255237 270183.34185858 76.90 0.0001 Bounds on condition number: 1, 1 -------------------------------------------------------------------------------- Step 2 Variable EX1SQ Entered R-square = 0.70635898 C(p) = 3.00000000 DF Sum of Squares Mean Square F Prob>F Regression 2 302521.61930531 151260.80965266 52.92 0.0001 Error 44 125761.48707767 2858.21561540 Total 46 428283.10638298 Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP 83.49815183 70.19451350 4044.29080773 1.41 0.2406 EX1 8.12669513 1.62580398 71414.52619009 24.99 0.0001 EX1SQ -0.02917694 0.00867419 32338.27744673 11.31 0.0016 Bounds on condition number: 33.25941, 133.0376 -------------------------------------------------------------------------------- All variables left in the model are significant at the 0.1500 level. No other variable met the 0.1500 significance level for entry into the model. Summary of Stepwise Procedure for Dependent Variable W Variable Number Partial Model Step Entered Removed In R**2 R**2 C(p) F Prob>F 1 EX1 1 0.6309 0.6309 12.3141 76.9024 0.0001 2 EX1SQ 2 0.0755 0.7064 3.0000 11.3141 0.0016 ================================================================================== Residuals may also be used to determine whether variability in the data is related to the magnitude of the outcome variable (i.e., heteroscedasticity). If this appears to be true, a transformation of the outcome variable may be warranted (see n54703.004 for more discussion on this.). 5. How Do You Compare Regression Models ? Assume model with an outcome variable Y is believed to be related to predictor X as follows: Y = a + b * X + e, where as usual e has a normal distribution with standard deviation sigma. It may happen that this model holds in each of two strata, but that the intercepts "a" and slopes "b" are different. Two questions arise: 1. How do you detect this ? 2. What do you do about it ? An example might be the following. The relationship of height and weight may differ according to age group. This can be tested using a large dataset such as that from the Multiple Risk Factor Intervention Trial. *=================================================================== ; options linesize = 80 ; PROC REG DATA = SEL ; MODEL WGTKG = HEIGHTCM ; TITLE1 'MODEL 1: WEIGHTKG VS HEIGHTCM: ONE LINE' ; RUN ; PROC REG DATA = SEL ; MODEL WGTKG = AGEGE50 HEIGHTCM ; TITLE1 'MODEL 2: WEIGHTKG VS HEIGHTCM: TWO PARALLEL LINES, SEPARATE INTERCEPTS'; RUN ; PROC REG DATA = SEL ; MODEL WGTKG = AGEGE50 HEIGHTCM AGE50HGT ; TITLE1 'MODEL 3: WEIGHTKG VS HEIGHTCM: TWO LINES, TWO INTERCEPTS, TWO SLOPES' ; RUN ; ENDSAS ; ------------------------------------------------------------------------ MODEL 1: WEIGHTKG VS HEIGHTCM: ONE LINE 1 19:09 Sunday, February 8, 2004 Model: MODEL1 Dependent Variable: WGTKG Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 1 42597433.044 42597433.044 3493.372 0.0001 Error 11649 142045417.63 12193.786387 C Total 11650 184642850.67 Root MSE 110.42548 R-square 0.2307 Dep Mean 850.15488 Adj R-sq 0.2306 C.V. 12.98887 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 -751.200974 27.11282728 -27.706 0.0001 HEIGHTCM 1 9.103589 0.15402464 59.105 0.0001 billings.sas (joanneb) 08FEB04 19:09 ------------------------------------------------------------------------ MODEL 2: WEIGHTKG VS HEIGHTCM: TWO PARALLEL LINES, SEPARATE INTERCEPTS 2 19:09 Sunday, February 8, 2004 Model: MODEL1 Dependent Variable: WGTKG Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 2 43210854.606 21605427.303 1779.371 0.0001 Error 11648 141431996.07 12142.169992 C Total 11650 184642850.67 Root MSE 110.19152 R-square 0.2340 Dep Mean 850.15488 Adj R-sq 0.2339 C.V. 12.96135 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 -728.268559 27.24707991 -26.728 0.0001 AGEGE50 1 -15.381681 2.16407565 -7.108 0.0001 HEIGHTCM 1 9.002956 0.15434904 58.329 0.0001 billings.sas (joanneb) 08FEB04 19:09 ------------------------------------------------------------------------ MODEL 3: WEIGHTKG VS HEIGHTCM: TWO LINES, TWO INTERCEPTS, TWO SLOPES 3 19:09 Sunday, February 8, 2004 Model: MODEL1 Dependent Variable: WGTKG Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 3 43240090.637 14413363.546 1187.194 0.0001 Error 11647 141402760.03 12140.70233 C Total 11650 184642850.67 Root MSE 110.18486 R-square 0.2342 Dep Mean 850.15488 Adj R-sq 0.2340 C.V. 12.96056 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 -758.286174 33.41393737 -22.694 0.0001 AGEGE50 1 73.626432 57.39857942 1.283 0.1996 HEIGHTCM 1 9.173180 0.18935060 48.445 0.0001 AGE50HGT 1 -0.507207 0.32684974 -1.552 0.1207 billings.sas (joanneb) 08FEB04 19:09 ================================================================================== The stratifying factor in this example is age group, coded as: AGEGE50 = 0 : Age less than 50 AGEGE50 = 1 : Age 50 or greater. Model 1 simply fits one straight line to weight (kg) as a function of height (cm). The regression coefficient for height indicates that each increase in height of 1 cm is associated with a weight increase of about 9.1 kg. Model 2 fits two separate but parallel lines with different intercepts. The equation of the line for men < 50 years old is: WGTKG = -728.27 + 9.00 * HEIGHTCM The equation of the line for men >= 50 years old is: WGTKG = -743.65 + 9.00 * HEIGHTCM Note here that the coefficient of AGEGE50 is -15.38, and the corresponding t-statistic is -7.108 (p = .0001). Model 3 fits two separate lines with different slopes and intercepts. The equations of these lines are: Men < 50 : WGTKG = -758.29 + 9.17 * HEIGHTCM Men >= 50: WGTKG = -684.66 + 8.66 * HEIGHTCM Here the coefficient of AGE50HGT is -.507, which indicates that weight is not as strongly related to height in older men as it is in younger. The t-statistic for this coefficient is -1.552 (p = .1207). How do you test which of these models is best ? Note that there is a hierarchy here: you go from simple to more complicated models: Model 1: 1 line; 2 parameters Model 2: 2 lines, parallel (same slope): 3 parameters Model 3: 2 lines, two different slopes: 4 parameters. Or, to put it another way: the *model* statements in proc reg are: Model 1: model weight = height ; Model 2: model weight = agege50 height ; Model 3: model weight = agege50 height age50hgt ; This makes the hierarchical relationship very clear: the independent variables in Model 1 are a subset of those in Models 2 and 3, and those in Model 2 are a subset of those in Model 3. What you might want to know here is: Is Model 2 better than Model 1? Is Model 3 better than Model 2 ? Is Model 3 better than Model 1? Here is how you carry out a test for hierarchically related models of this kind: 1. Compute the residual [or error] sum of squares for Model 3: ReSS3. 2. Compute the residual sum of squares form Model 1 : ReSS1. (ReSS1 - ReSS3) / 2 Compute F = ---------------------. ReSS3 / (n - 4) 3. Compare this to an F-distribution with degrees of freedom (2, n - 4). (Use tables). Questions about this: Q1. What is that "2" in the denominator of the numerator of F ? A1. That is the difference in the number of parameters between Model 1 and Model 3. Q2. What is the (n - 4) term? A2. That is total sample size (n) minus the number of parameters (4) in the 'bigger' model. Q3. Is F always positive ? A3 If it isn't, you have made a mistake. The residual sum of squares for the smaller model should always be greater than the residual sum of squares for the larger one. Q4. Say I am testing Model 3 versus Model 2. Can't I just look at the t-statistic for the added variable, AGE50HT ? A4. You can, but the F-test above is considered more appropriate. Q5. What does the F test actually say for this example? A5. ReSS1 = 142045417.63 ReSS3 = 141402760.03 (142045417.63 - 141402760.03)/2 F = -------------------------------- = 26.476, 141402760.03 / 11651 where F has df = (2, 11651). The p-value is: p < .0001 Q6. So what's the conclusion ? A6. The two-line model fits better than the one-line model. (Note, however, that there is not much difference in the R-square). Q7. Is it correct to say that since the slope of weight as a function of height is not the same in the two age groups, then there is an INTERACTION between age and height as predictors of weight ? A7. To say this, you need to check that the equal-slopes model (Model 2) is significantly 'worse' than the unequal-slopes model (Model 3). The appropriate F-statistic is: (ReSS2 - ReSS3)/1 (141431996.07 - 141402760.03)/1 F = -------------------- = ------------------------------- ReSS3/11651 141402760.03 / 11651 = 2.4089. The p-value in this case (see notes n54703.006) is .1207. Thus one would not reject the hypothesis that the slopes are the same. You cannot conclude that the there is age-height interaction. ================================================================================== Problem 1. This is a continuation of problem 1 in n54703.004. Again, consider the data on mortality and water hardness from Chapter 2 of Der-Everitt. (1) Carry out analysis for the model: mortality = a + b*hardness + c*norsouth + e, where norsouth = 1 if the city is northern, 0 if the city is southern, and e is a normally distributed error term with mean 0 and standard deviation sigma. How can you tell whether norsouth an important variable? What effect does it have on R-square ? (2) Comment on collinearity of hardness and norsouth. What does it mean to say that a continuous variable like hardness is "collinear" with a dichotomous variable like norsouth? (3) Carry out a stepwise regression with the predictors being hardness and norsouth. (4) Back to the simpler model: mortality = a + b*hardness + e. Compute the coefficients separately for northern and southern cities. (5) Test the following hypothesis: the slope of mortality versus hardness is different for northern cities than for southern cities. Problem 2. In the dataset on crime in the U.S. versus other variables in Chapter 4, consider the regression of crime rate R versus the variable Age. Do the regression for 3 models: Model 1 : R = a + b*Age + e Model 2 : R = a + b*Age + c*S + e Model 3 : R = a + b*Age + c*S + d*Age*S + e, where S = southern state (0 = no, 1 = yes), and e represents the 'error' term in each model. Test for whether Model 2 is better than Model 1 and for whether Model 3 is better than Model 2, and for whether Model 3 is better than Model 1. Also note the R-squares for the 3 models. n54703.005 Last update: February 29, 2004.