PROC REG, II : Multivariate regression, collinearity, and other topics n54703.005
Multivariate Regression:
Multivariate regression is a natural generalization of simple linear
regression. The model may be stated as
Y = a0 + a1*X1 + a2*X2 + ... + ap*Xp + e,
where as before the error term e has mean 0 and standard deviation sigma, and
the variables X1, X2, ..., Xp are possible predictors of the outcome Y. The
coefficients a0, a1, a2, ..., ap are unknown.
The object of the regression is to estimate values of a0, a1, ..., ap
such that the resulting predicted values agree as closely as possible with
the observed data. If there are two predicting variables, the model corresponds
to a flat plane in 3-dimensional space which provides the best fit to a cloud of
observed data points. This is called a *linear model* because the expression on
the right side of the equation is a linear function of the predictors. It is
reasonable to assume a linear model in many situations, but there are a great
many others where a linear model is not reasonable.
The following questions arise in dealing with multivariate linear models:
1. How can the data be displayed ?
2. What if some of the predictors are highly correlated with each other?
Should they all be entered into the model?
3. What predictors should be entered into the model? What subset of
possible predictors provides an adequate fit to the data ?
4. What do you do with residuals in multivariate regression ?
How do you look for nonlinearity ?
5. How can you compare regression models ?
We consider these in succession in the following:
1. How can the data be displayed?
It is a good idea to examine the relationship of the outcome
variable Y to each of the predictors. An example of this is on page 84
of the Der-Everitt text. The figure there is generated by a SAS macro
and shows small scatterplots of all the variables versus each other.
The 'outcome' variable for that dataset is R, the crime rate. The
relationship of R to the other variables is shown in the leftmost
column in that figure. This is one useful way to show multivariate
relationships, but essentially it just shows a kind of 2-dimensional
shadow of a cloud of points in (p + 1)-dimensional space (if there
are p predictors). Since we humans are not able to visualize spaces of
dimension higher than 3, we cannot really display all the data for
this kind of model unless p = 1 or 2.
Another graphical approach is like that shown in Chapter 2 of the
text: 3-D plots of smoothed bivariate densities. This is accomplished
using proc kde (kernel density estimator) and proc g3d, and requires
access to graphics screens. A program to do this for crime rate versus
unemployment in the Chapter 4 dataset on U.S. crime rates is given
below. The smoothed-density plot actually does not contain any more
information that a simple scatterplot. It depicts the relationship
of two variables.
The resulting 3-D plot may be seen at:
http://www.biostat.umn.edu/%7Ejohn-c/5421/crimevsunemp2.grf
However, it is possible to show 3-dimensional relationships: e.g.,
observed crime rates versus two covariates. The following figure
shows this, and the program below generates the plot in the second
"proc g3d" section:
http://www.biostat.umn.edu/%7Ejohn-c/5421/crime.vs.educ.unemp2.grf
========================================================================
footnote "~john-c/5421/chap4.sas &sysdate &systime" ;
FILENAME GRAPH 'gsas.grf' ;
OPTIONS LINESIZE = 150 MPRINT ;
GOPTIONS
RESET = GLOBAL
ROTATE = LANDSCAPE
FTEXT = SWISSB
DEVICE = GIF
GACCESS = SASGASTD
GSFNAME = GRAPH
GSFMODE = REPLACE
GUNIT = PCT BORDER
CBACK = WHITE
HTITLE = 2 HTEXT = 1 ;
*===================================================================== ;
/* Chapter 4 */
data uscrime; /* Chapter 4 */
* infile 'n:\handbook2\datasets\uscrime.dat' expandtabs;
infile cards expandtabs ;
input R Age S Ed Ex0 Ex1 LF M N NW U1 U2 W X;
datalines ;
79.1 151 1 91 58 56 510 950 33 301 108 41 394 261
163.5 143 0 113 103 95 583 1012 13 102 96 36 557 194
57.8 142 1 89 45 44 533 969 18 219 94 33 318 250
196.9 136 0 121 149 141 577 994 157 80 102 39 673 167
123.4 141 0 121 109 101 591 985 18 30 91 20 578 174
68.2 121 0 110 118 115 547 964 25 44 84 29 689 126
96.3 127 1 111 82 79 519 982 4 139 97 38 620 168
155.5 131 1 109 115 109 542 969 50 179 79 35 472 206
85.6 157 1 90 65 62 553 955 39 286 81 28 421 239
70.5 140 0 118 71 68 632 1029 7 15 100 24 526 174
167.4 124 0 105 121 116 580 966 101 106 77 35 657 170
84.9 134 0 108 75 71 595 972 47 59 83 31 580 172
51.1 128 0 113 67 60 624 972 28 10 77 25 507 206
66.4 135 0 117 62 61 595 986 22 46 77 27 529 190
79.8 152 1 87 57 53 530 986 30 72 92 43 405 264
94.6 142 1 88 81 77 497 956 33 321 116 47 427 247
53.9 143 0 110 66 63 537 977 10 6 114 35 487 166
92.9 135 1 104 123 115 537 978 31 170 89 34 631 165
75.0 130 0 116 128 128 536 934 51 24 78 34 627 135
122.5 125 0 108 113 105 567 985 78 94 130 58 626 166
74.2 126 0 108 74 67 602 984 34 12 102 33 557 195
43.9 157 1 89 47 44 512 962 22 423 97 34 288 276
121.6 132 0 96 87 83 564 953 43 92 83 32 513 227
96.8 131 0 116 78 73 574 1038 7 36 142 42 540 176
52.3 130 0 116 63 57 641 984 14 26 70 21 486 196
199.3 131 0 121 160 143 631 1071 3 77 102 41 674 152
34.2 135 0 109 69 71 540 965 6 4 80 22 564 139
121.6 152 0 112 82 76 571 1018 10 79 103 28 537 215
104.3 119 0 107 166 157 521 938 168 89 92 36 637 154
69.6 166 1 89 58 54 521 973 46 254 72 26 396 237
37.3 140 0 93 55 54 535 1045 6 20 135 40 453 200
75.4 125 0 109 90 81 586 964 97 82 105 43 617 163
107.2 147 1 104 63 64 560 972 23 95 76 24 462 233
92.3 126 0 118 97 97 542 990 18 21 102 35 589 166
65.3 123 0 102 97 87 526 948 113 76 124 50 572 158
127.2 150 0 100 109 98 531 964 9 24 87 38 559 153
83.1 177 1 87 58 56 638 974 24 349 76 28 382 254
56.6 133 0 104 51 47 599 1024 7 40 99 27 425 225
82.6 149 1 88 61 54 515 953 36 165 86 35 395 251
115.1 145 1 104 82 74 560 981 96 126 88 31 488 228
88.0 148 0 122 72 66 601 998 9 19 84 20 590 144
54.2 141 0 109 56 54 523 968 4 2 107 37 489 170
82.3 162 1 99 75 70 522 996 40 208 73 27 496 224
103.0 136 0 121 95 96 574 1012 29 36 111 37 622 162
45.5 139 1 88 46 41 480 968 19 49 135 53 457 249
50.8 126 0 104 106 97 599 989 40 24 78 25 593 171
84.9 130 0 121 90 91 623 1049 3 22 113 40 588 160
;
run ;
proc print data = uscrime ;
proc kde data = uscrime out = bivest ;
var r u2 ;
run ;
proc g3d data = bivest ;
plot r * u2 = density ;
title1 'Plot of Bivariate Density: Crime Rate vs Unemp 35-39' ;
run ;
proc g3d data = uscrime ;
scatter ed * u2 = r / shape = 'prism' ;
title1 'Plot of crime rates vs education level and unemployment' ;
run ;
========================================================================
2. What if some of the predictors are highly correlated with each other?
Should they all be entered into the model?
The plot on page 84 of the text shows strong correlations between
some of the variables: particularly EX0 and EX1 (both related to
expenditures for police), and between W and X (both related to income).
Their correlation (obtainable from proc corr) is greater than 0.99.
Entering highly correlated variables as predictors in a regression can
lead to erroneous conclusions and should be avoided. 'Highly correlated'
and 'collinear' mean essentially the same thing. SAS includes tests for
collinearity. One such is based on something called the 'Variance
Inflation Factor' (VIF). This can be invoked in proc reg as follows (see
also p. 85 of the Der-Everitt text):
proc reg data = uscrime ;
model R = Age--X / vif ;
run ;
Printout for this is shown on page 86. Note that variables EX0
and EX1 have very high VIFs: 94.6 and 98.6 resp. A rule of thumb
is that VIFs > 10 are not good. The VIF indicates that the variable
in question is highly correlated with the ensemble of other variables
in the analysis.
What do you do if the VIF is very large for a given variable ?
In this case, seeing the VIF for EX0 and EX1 and also noting that
they are very highly correlated, you should enter only one of them
in the model. Given their close agreement, it does not matter
much which one you choose. In general, one would take the approach of
omitting variables from the model one at a time until there are none left
which have a high VIF.
Another more systematic approach to avoiding collinearity of
predictors is to use stepwise regression. This will select a subset
of the predictors which does a good job of explaining most of the
variability in the outcome variable, without adding in variables
that are highly correlated with others. This is discussed in more
detail below.
3. What predictors should be entered into the model? What subset of
possible predictors provides an adequate fit to the data ?
An outcome variable Y may be strongly related to some of the other
variables in your dataset, and weakly related, or even unrelated, to
others. You can enter many possible predictors in a multivariate regression,
and in general the procedure will give you back the usual output: an
ANOVA table, an F-test for overall effect of all the covariates on the
outcome, R-square, coefficient estimates, standard errors, etc.. However,
if variables are entered into the model which are really unrelated to the
outcome, the effect is to decrease the apparent effect of other variables
which are genuinely related. That is, you pay a price for entering useless
predictors: the price is to underestimate the effects of the genuine
predictors.
Proc reg incorporates ways to select "best" models: that is, models
which do not include variables which do not have strong relationships to
the outcome variable. Proc reg makes use of *stepwise* algorithms, which
can enter or delete variables from a model depending on certain variable
selection criteria. One approach, call *backward* stepwise selection, starts
by putting all the variables in the model and then omitting the least
useful variables one at a time until the variables left in the model all
contribute significantly to prediction of the outcome. The other approach
is called *forward* stepwise regression, which starts with only the intercept term in
model and adds the candidate variables one at a time until all the
potentially useful variables have been added. It may happen that a given
predictor X by itself is fairly strongly related to the outcome, but when
other variables are already in the model which are perhaps correlated with
X, the variable X itself does not make any useful addition.
There is a variant of stepwise regression which is similar to 'forward'
except that variables which are entered in one step may be removed in a subsequent
step. This gives quite a lot of flexibility for variable selection. Below is
an example of the use of forward stepwise using data in the Lung Health Study:
the outcome variable is baseline FEV1, and the possible predictors are
other baseline covariates:
==================================================================================
PROC CORR DATA = SMOKE ;
VAR S2FEVPOS AGE PACKYEAR GENDER HGTM WGTKG AFROAMER F10CIGS
DRINKS COUGH30 ;
TITLE1 'Correlations of Baseline Variables in the Lung Health Study' ;
PROC REG DATA = SMOKE ;
MODEL S2FEVPOS = AGE PACKYEAR GENDER HGTM WGTKG AFROAMER F10CIGS
DRINKS COUGH30 / VIF SELECTION = STEPWISE ;
TITLE1 'Proc reg: use of stepwise and VIF in Lung Health Study data' ;
endsas ;
--------------------------------------------------------------------------------
Correlations of Baseline Variables in the Lung Health Study 1
17:45 Tuesday, February 3, 2004
Correlation Analysis
10 'VAR' Variables: S2FEVPOS AGE PACKYEAR GENDER HGTM WGTKG
AFROAMER F10CIGS DRINKS COUGH30
Simple Statistics
Variable N Mean Std Dev Sum
S2FEVPOS 5885 2.745760 0.627641 16159
AGE 5887 48.468660 6.825130 285335
PACKYEAR 5885 40.450127 19.126189 238049
GENDER 5887 0.371157 0.483155 2185.000000
HGTM 5887 1.719552 0.089076 10123
WGTKG 5886 76.005590 15.072866 447369
AFROAMER 5887 0.038220 0.191743 225.000000
F10CIGS 5887 31.272804 12.851235 184103
DRINKS 5887 4.353321 5.537889 25628
COUGH30 5887 0.422626 0.494019 2488.000000
Simple Statistics
Variable Minimum Maximum Label
S2FEVPOS 1.150000 4.940000
AGE 34.000000 67.000000 AGE AT ENTRY INTO LHS
PACKYEAR 0 190.000000 PACK YEARS OF CIG SMOKING
GENDER 0 1.000000 GENDER 0 = MEN, 1 = WOMEN
HGTM 1.420000 2.160000
WGTKG 38.400000 136.000000
AFROAMER 0 1.000000 ETHNIC GROUP: AFRO-AMER OR NOT
F10CIGS 2.000000 120.000000 CIGS PER DAY AT SCREEN 1
DRINKS 0 25.000000 DRINKS PER WEEK AMONG ALL
COUGH30 0 1.000000
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45
Correlations of Baseline Variables in the Lung Health Study 2
17:45 Tuesday, February 3, 2004
Correlation Analysis
Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0
/ Number of Observations
S2FEVPOS AGE PACKYEAR GENDER HGTM
S2FEVPOS 1.00000 -0.35781 -0.07002 -0.70189 0.76342
0.0 0.0001 0.0001 0.0001 0.0001
5885 5885 5883 5885 5885
AGE -0.35781 1.00000 0.41517 0.00938 -0.07931
AGE AT ENTRY INTO LHS 0.0001 0.0 0.0001 0.4720 0.0001
5885 5887 5885 5887 5887
PACKYEAR -0.07002 0.41517 1.00000 -0.16390 0.08554
PACK YEARS OF CIG SMOKING 0.0001 0.0001 0.0 0.0001 0.0001
5883 5885 5885 5885 5885
GENDER -0.70189 0.00938 -0.16390 1.00000 -0.70252
GENDER 0 = MEN, 1 = WOMEN 0.0001 0.4720 0.0001 0.0 0.0001
5885 5887 5885 5887 5887
HGTM 0.76342 -0.07931 0.08554 -0.70252 1.00000
0.0001 0.0001 0.0001 0.0001 0.0
5885 5887 5885 5887 5887
WGTKG 0.52249 -0.03590 0.12352 -0.56202 0.65363
0.0001 0.0059 0.0001 0.0001 0.0001
5884 5886 5884 5886 5886
AFROAMER -0.13084 0.01020 -0.09143 0.02474 -0.01591
ETHNIC GROUP: AFRO-AMER OR NOT 0.0001 0.4340 0.0001 0.0577 0.2224
5885 5887 5885 5887 5887
F10CIGS 0.06159 -0.06894 0.47318 -0.14237 0.09523
CIGS PER DAY AT SCREEN 1 0.0001 0.0001 0.0001 0.0001 0.0001
5885 5887 5885 5887 5887
DRINKS 0.10331 0.01647 0.02992 -0.12344 0.09372
DRINKS PER WEEK AMONG ALL 0.0001 0.2065 0.0217 0.0001 0.0001
5885 5887 5885 5887 5887
COUGH30 -0.04240 -0.02897 0.09957 0.00752 -0.02272
0.0011 0.0262 0.0001 0.5641 0.0813
5885 5887 5885 5887 5887
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45
Correlations of Baseline Variables in the Lung Health Study 3
17:45 Tuesday, February 3, 2004
Correlation Analysis
Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0
/ Number of Observations
WGTKG AFROAMER F10CIGS DRINKS COUGH30
S2FEVPOS 0.52249 -0.13084 0.06159 0.10331 -0.04240
0.0001 0.0001 0.0001 0.0001 0.0011
5884 5885 5885 5885 5885
AGE -0.03590 0.01020 -0.06894 0.01647 -0.02897
AGE AT ENTRY INTO LHS 0.0059 0.4340 0.0001 0.2065 0.0262
5886 5887 5887 5887 5887
PACKYEAR 0.12352 -0.09143 0.47318 0.02992 0.09957
PACK YEARS OF CIG SMOKING 0.0001 0.0001 0.0001 0.0217 0.0001
5884 5885 5885 5885 5885
GENDER -0.56202 0.02474 -0.14237 -0.12344 0.00752
GENDER 0 = MEN, 1 = WOMEN 0.0001 0.0577 0.0001 0.0001 0.5641
5886 5887 5887 5887 5887
HGTM 0.65363 -0.01591 0.09523 0.09372 -0.02272
0.0001 0.2224 0.0001 0.0001 0.0813
5886 5887 5887 5887 5887
WGTKG 1.00000 0.00712 0.12452 0.02545 -0.00144
0.0 0.5852 0.0001 0.0508 0.9120
5886 5886 5886 5886 5886
AFROAMER 0.00712 1.00000 -0.11793 -0.02648 -0.07729
ETHNIC GROUP: AFRO-AMER OR NOT 0.5852 0.0 0.0001 0.0422 0.0001
5886 5887 5887 5887 5887
F10CIGS 0.12452 -0.11793 1.00000 0.05608 0.17580
CIGS PER DAY AT SCREEN 1 0.0001 0.0001 0.0 0.0001 0.0001
5886 5887 5887 5887 5887
DRINKS 0.02545 -0.02648 0.05608 1.00000 0.03986
DRINKS PER WEEK AMONG ALL 0.0508 0.0422 0.0001 0.0 0.0022
5886 5887 5887 5887 5887
COUGH30 -0.00144 -0.07729 0.17580 0.03986 1.00000
0.9120 0.0001 0.0001 0.0022 0.0
5886 5887 5887 5887 5887
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45
Proc reg: use of stepwise and VIF in Lung Health Study data 4
17:45 Tuesday, February 3, 2004
Stepwise Procedure for Dependent Variable S2FEVPOS
Step 1 Variable HGTM Entered R-square = 0.58277189 C(p) =4107.9115432
DF Sum of Squares Mean Square F Prob>F
Regression 1 1350.30512662 1350.30512662 8213.01 0.0001
Error 5880 966.73373487 0.16441050
Total 5881 2317.03886149
Parameter Standard Type II
Variable Estimate Error Sum of Squares F Prob>F
INTERCEP -6.50720345 0.10223574 666.05914651 4051.20 0.0001
HGTM 5.38101003 0.05937623 1350.30512662 8213.01 0.0001
Bounds on condition number: 1, 1
--------------------------------------------------------------------------------
Step 2 Variable AGE Entered R-square = 0.67130902 C(p) =1990.8694694
DF Sum of Squares Mean Square F Prob>F
Regression 2 1555.44908864 777.72454432 6003.55 0.0001
Error 5879 761.58977285 0.12954410
Total 5881 2317.03886149
Parameter Standard Type II
Variable Estimate Error Sum of Squares F Prob>F
INTERCEP -4.88569345 0.09947821 312.47421230 2412.11 0.0001
AGE -0.02745985 0.00069005 205.14396202 1583.58 0.0001
HGTM 5.21202416 0.05287638 1258.65595418 9716.04 0.0001
Bounds on condition number: 1.006492, 4.025966
--------------------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Steps 3-7 OMITTED. <<<<<<<<<<<<<<<<<<<<<<<<<<<<
--------------------------------------------------------------------------------
Step 8 Variable PACKYEAR Entered R-square = 0.75465759 C(p) = 8.01000201
DF Sum of Squares Mean Square F Prob>F
Regression 8 1748.57096190 218.57137024 2258.12 0.0001
Error 5873 568.46789959 0.09679344
Total 5881 2317.03886149
Parameter Standard Type II
Variable Estimate Error Sum of Squares F Prob>F
INTERCEP -1.44889268 0.12197908 13.65675946 141.09 0.0001
AGE -0.02841031 0.00069827 160.23325283 1655.41 0.0001
PACKYEAR -0.00081371 0.00028272 0.80181738 8.28 0.0040
GENDER -0.47249446 0.01197224 150.76096651 1557.55 0.0001
HGTM 3.42461517 0.06437664 273.91311523 2829.87 0.0001
AFROAMER -0.39917697 0.02136117 33.80078310 349.21 0.0001
F10CIGS -0.00269115 0.00038467 4.73734587 48.94 0.0001
DRINKS 0.00226678 0.00073970 0.90898139 9.39 0.0022
COUGH30 -0.04576476 0.00837485 2.89037675 29.86 0.0001
Bounds on condition number: 2.033289, 94.01067
--------------------------------------------------------------------------------
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45
Proc reg: use of stepwise and VIF in Lung Health Study data 7
17:45 Tuesday, February 3, 2004
All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the model.
Summary of Stepwise Procedure for Dependent Variable S2FEVPOS
Variable Number Partial Model
Step Entered Removed In R**2 R**2 C(p) F Prob>F
Label
1 HGTM 1 0.5828 0.5828 4107.9115 8213.0103 0.0001
2 AGE 2 0.0885 0.6713 1990.8695 1583.5840 0.0001
AGE AT ENTRY INTO LHS
3 GENDER 3 0.0639 0.7352 462.7931 1419.2966 0.0001
GENDER 0 = MEN, 1 = WOMEN
4 AFROAMER 4 0.0123 0.7476 169.3526 287.4031 0.0001
ETHNIC GROUP: AFRO-AMER OR NOT
5 F10CIGS 5 0.0051 0.7527 49.9965 120.4542 0.0001
CIGS PER DAY AT SCREEN 1
6 COUGH30 6 0.0012 0.7539 22.1442 29.7755 0.0001
7 DRINKS 7 0.0004 0.7543 14.2924 9.8413 0.0017
DRINKS PER WEEK AMONG ALL
8 PACKYEAR 8 0.0003 0.7547 8.0100 8.2838 0.0040
PACK YEARS OF CIG SMOKING
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45
Proc reg: use of stepwise and VIF in Lung Health Study data 8
17:45 Tuesday, February 3, 2004
Model: MODEL1
Dependent Variable: S2FEVPOS
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 8 1748.57096 218.57137 2258.122 0.0001
Error 5873 568.46790 0.09679
C Total 5881 2317.03886
Root MSE 0.31112 R-square 0.7547
Dep Mean 2.74558 Adj R-sq 0.7543
C.V. 11.33153
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob > |T|
INTERCEP 1 -1.448893 0.12197908 -11.878 0.0001
AGE 1 -0.028410 0.00069827 -40.687 0.0001
PACKYEAR 1 -0.000814 0.00028272 -2.878 0.0040
GENDER 1 -0.472494 0.01197224 -39.466 0.0001
HGTM 1 3.424615 0.06437664 53.197 0.0001
AFROAMER 1 -0.399177 0.02136117 -18.687 0.0001
F10CIGS 1 -0.002691 0.00038467 -6.996 0.0001
DRINKS 1 0.002267 0.00073970 3.064 0.0022
COUGH30 1 -0.045765 0.00837485 -5.465 0.0001
Variance Variable
Variable DF Inflation Label
INTERCEP 1 0.00000000 Intercept
AGE 1 1.37934114 AGE AT ENTRY INTO LHS
PACKYEAR 1 1.77696126 PACK YEARS OF CIG SMOKING
GENDER 1 2.03328873 GENDER 0 = MEN, 1 = WOMEN
HGTM 1 1.99671021
AFROAMER 1 1.02011203 ETHNIC GROUP: AFRO-AMER OR NOT
F10CIGS 1 1.48491823 CIGS PER DAY AT SCREEN 1
DRINKS 1 1.01995800 DRINKS PER WEEK AMONG ALL
COUGH30 1 1.04004387
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 03FEB04 17:45
==================================================================================
Notes on preceding printout:
A 'proc corr' was run before the 'proc reg'. This shows the correlations
of both the outcome variable (S2FEVPOS) and the other baseline variables
entered in the regression. S2FEVPOS is strongly correlated with age
(r = -.358), gender (r = -.702), height (r = .763), and weight (r = .522).
It is significantly correlated with most of the other variables. There are
a number of other strong correlations (gender and height, for example).
Only one of the variables are not included in the final model: WGTKG (weight
in kilograms). If the variable HGTM (height in meters) is not been included
in the list, the stepwise selection process *will* include WGTKG. It is *not*
correct to conclude that, if a stepwise selection process does not select
a given variable, then that variable is not strongly related to the outcome
variable.
Note that the stepwise procedure enters variables into the model if their
significance level is < .1500. This default can be changed in the procedure.
Note also that the variance inflation factor is relatively small for
all variables entered into the model: the maximum VIF is 2.033, for GENDER.
4. What do you do with residuals in multivariate regression ?
As with simple linear regression, proc reg can compute residuals
for multivariate regression. It can be useful to plot the residuals
against either (1) single predictors that were entered into the regression,
or (2) the predicted value of the outcome variable.
Plotting residuals against individual variables can indicate whether
the relationship between that predictor and the outcome is actually linear.
The residuals must average to 0, but this does not exclude the possibility
that the pattern of residuals might resemble, for example, a rainbow.
Example: the scatterplot of W vs EX1 on page 84 of the text. W is a measure
of wealth for each state. EX1 is the per capita expenditure for police in the
year 1959. The scatterplot indicates a curvilinear relationship between these
two variables. If you regress W on EX1, the plot suggests that it might be
useful to add a quadratic term to the model.
Model 1 (linear) W = a + b*EX1 + e
Model 2 (quadratic) W = a + b*EX1 + c*EX1*EX1 + e.
The following program-segment uses proc reg to analyze these data using Model
1 and Model 2. The first analysis includes an output file of the W-residuals
from Model 1. These residuals were then plotted against EX1, and the result
has a (very roughly) curvilinear shape. The second analysis is a stepwise one,
with both EX1 and EX1SQ ( = EX1*EX1) offered to the procedure. The stepwise
algorithm first selected EX1, and then selected EX1SQ. This a strong indication
that the relationship of W to EX1 is not linear (as suggested by the scatterplot).
==================================================================================
proc reg data = uscrime ;
model w = ex1 ;
output out = regout r = wresid ;
title1 'Regression of W (wealth) vs EX1 (police expenditures in 1959)' ;
title2 'plus computation of residuals' ;
run ;
options pagesize = 40 ;
proc plot data = regout ;
plot wresid * ex1 ;
title1 'Plot of Wealth residuals versus EX1 to detect nonlinearity ...' ;
options pagesize = 60 ;
proc reg data = uscrime ;
model w = ex1 ex1sq / selection = stepwise ;
output out = regout r = wresid ;
title1 'Stepwise regression ... crime dataset ... wealth vs police expenditures' ;
title2 'Both EX1 and EX1-squared are entered ...' ;
endsas ;
--------------------------------------------------------------------------------
Regression of W (wealth) vs EX1 (police expenditures in 1959) 1
plus computation of residuals
19:22 Tuesday, February 3, 2004
Model: MODEL1
Dependent Variable: W
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 1 270183.34186 270183.34186 76.902 0.0001
Error 45 158099.76452 3513.32810
C Total 46 428283.10638
Root MSE 59.27333 R-square 0.6309
Dep Mean 525.38298 Adj R-sq 0.6226
C.V. 11.28193
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob > |T|
INTERCEP 1 305.469730 26.52592383 11.516 0.0001
EX1 1 2.740897 0.31255237 8.769 0.0001
Plot of Wealth residuals versus EX1 to detect nonlinearity ... 2
19:22 Tuesday, February 3, 2004
Plot of WRESID*EX1. Legend: A = 1 obs, B = 2 obs, etc.
|
|
150 +
|
|
|
|
|
| A
100 + A
| A
|
| A
| A A
| A
| A
50 + A
|
| A A A A A A A A
R | A
e | A A A
s | A
i | A A
d 0 + A A
u | A A A
a | A
l | A A A AA
| A
|
| A
-50 +
| B A
| A
|
| A
|
| A
-100 + A
| A
|
|
|
| A A
|
-150 +
|
---+----------+----------+----------+----------+----------+----------+--
40 60 80 100 120 140 160
EX1
Stepwise regression ... crime dataset ... wealth vs police expenditures 3
Both EX1 and EX1-squared are entered ...
19:22 Tuesday, February 3, 2004
Stepwise Procedure for Dependent Variable W
Step 1 Variable EX1 Entered R-square = 0.63085220 C(p) = 12.31414903
DF Sum of Squares Mean Square F Prob>F
Regression 1 270183.34185858 270183.34185858 76.90 0.0001
Error 45 158099.76452440 3513.32810054
Total 46 428283.10638298
Parameter Standard Type II
Variable Estimate Error Sum of Squares F Prob>F
INTERCEP 305.46972954 26.52592383 465922.87567564 132.62 0.0001
EX1 2.74089703 0.31255237 270183.34185858 76.90 0.0001
Bounds on condition number: 1, 1
--------------------------------------------------------------------------------
Step 2 Variable EX1SQ Entered R-square = 0.70635898 C(p) = 3.00000000
DF Sum of Squares Mean Square F Prob>F
Regression 2 302521.61930531 151260.80965266 52.92 0.0001
Error 44 125761.48707767 2858.21561540
Total 46 428283.10638298
Parameter Standard Type II
Variable Estimate Error Sum of Squares F Prob>F
INTERCEP 83.49815183 70.19451350 4044.29080773 1.41 0.2406
EX1 8.12669513 1.62580398 71414.52619009 24.99 0.0001
EX1SQ -0.02917694 0.00867419 32338.27744673 11.31 0.0016
Bounds on condition number: 33.25941, 133.0376
--------------------------------------------------------------------------------
All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the model.
Summary of Stepwise Procedure for Dependent Variable W
Variable Number Partial Model
Step Entered Removed In R**2 R**2 C(p) F Prob>F
1 EX1 1 0.6309 0.6309 12.3141 76.9024 0.0001
2 EX1SQ 2 0.0755 0.7064 3.0000 11.3141 0.0016
==================================================================================
Residuals may also be used to determine whether variability in the data
is related to the magnitude of the outcome variable (i.e., heteroscedasticity).
If this appears to be true, a transformation of the outcome variable may be
warranted (see n54703.004 for more discussion on this.).
5. How Do You Compare Regression Models ?
Assume model with an outcome variable Y is believed to be related to predictor
X as follows:
Y = a + b * X + e,
where as usual e has a normal distribution with standard deviation sigma.
It may happen that this model holds in each of two strata, but that the
intercepts "a" and slopes "b" are different.
Two questions arise:
1. How do you detect this ?
2. What do you do about it ?
An example might be the following. The relationship of height and weight
may differ according to age group. This can be tested using a large
dataset such as that from the Multiple Risk Factor Intervention Trial.
*=================================================================== ;
options linesize = 80 ;
PROC REG DATA = SEL ;
MODEL WGTKG = HEIGHTCM ;
TITLE1 'MODEL 1: WEIGHTKG VS HEIGHTCM: ONE LINE' ;
RUN ;
PROC REG DATA = SEL ;
MODEL WGTKG = AGEGE50 HEIGHTCM ;
TITLE1 'MODEL 2: WEIGHTKG VS HEIGHTCM: TWO PARALLEL LINES, SEPARATE INTERCEPTS';
RUN ;
PROC REG DATA = SEL ;
MODEL WGTKG = AGEGE50 HEIGHTCM AGE50HGT ;
TITLE1 'MODEL 3: WEIGHTKG VS HEIGHTCM: TWO LINES, TWO INTERCEPTS, TWO SLOPES' ;
RUN ;
ENDSAS ;
------------------------------------------------------------------------
MODEL 1: WEIGHTKG VS HEIGHTCM: ONE LINE 1
19:09 Sunday, February 8, 2004
Model: MODEL1
Dependent Variable: WGTKG
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 1 42597433.044 42597433.044 3493.372 0.0001
Error 11649 142045417.63 12193.786387
C Total 11650 184642850.67
Root MSE 110.42548 R-square 0.2307
Dep Mean 850.15488 Adj R-sq 0.2306
C.V. 12.98887
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob > |T|
INTERCEP 1 -751.200974 27.11282728 -27.706 0.0001
HEIGHTCM 1 9.103589 0.15402464 59.105 0.0001
billings.sas (joanneb) 08FEB04 19:09
------------------------------------------------------------------------
MODEL 2: WEIGHTKG VS HEIGHTCM: TWO PARALLEL LINES, SEPARATE INTERCEPTS 2
19:09 Sunday, February 8, 2004
Model: MODEL1
Dependent Variable: WGTKG
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 2 43210854.606 21605427.303 1779.371 0.0001
Error 11648 141431996.07 12142.169992
C Total 11650 184642850.67
Root MSE 110.19152 R-square 0.2340
Dep Mean 850.15488 Adj R-sq 0.2339
C.V. 12.96135
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob > |T|
INTERCEP 1 -728.268559 27.24707991 -26.728 0.0001
AGEGE50 1 -15.381681 2.16407565 -7.108 0.0001
HEIGHTCM 1 9.002956 0.15434904 58.329 0.0001
billings.sas (joanneb) 08FEB04 19:09
------------------------------------------------------------------------
MODEL 3: WEIGHTKG VS HEIGHTCM: TWO LINES, TWO INTERCEPTS, TWO SLOPES 3
19:09 Sunday, February 8, 2004
Model: MODEL1
Dependent Variable: WGTKG
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 3 43240090.637 14413363.546 1187.194 0.0001
Error 11647 141402760.03 12140.70233
C Total 11650 184642850.67
Root MSE 110.18486 R-square 0.2342
Dep Mean 850.15488 Adj R-sq 0.2340
C.V. 12.96056
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob > |T|
INTERCEP 1 -758.286174 33.41393737 -22.694 0.0001
AGEGE50 1 73.626432 57.39857942 1.283 0.1996
HEIGHTCM 1 9.173180 0.18935060 48.445 0.0001
AGE50HGT 1 -0.507207 0.32684974 -1.552 0.1207
billings.sas (joanneb) 08FEB04 19:09
==================================================================================
The stratifying factor in this example is age group, coded as:
AGEGE50 = 0 : Age less than 50
AGEGE50 = 1 : Age 50 or greater.
Model 1 simply fits one straight line to weight (kg) as a function of
height (cm). The regression coefficient for height indicates that each
increase in height of 1 cm is associated with a weight increase of about
9.1 kg.
Model 2 fits two separate but parallel lines with different intercepts.
The equation of the line for men < 50 years old is:
WGTKG = -728.27 + 9.00 * HEIGHTCM
The equation of the line for men >= 50 years old is:
WGTKG = -743.65 + 9.00 * HEIGHTCM
Note here that the coefficient of AGEGE50 is -15.38, and the corresponding
t-statistic is -7.108 (p = .0001).
Model 3 fits two separate lines with different slopes and intercepts.
The equations of these lines are:
Men < 50 : WGTKG = -758.29 + 9.17 * HEIGHTCM
Men >= 50: WGTKG = -684.66 + 8.66 * HEIGHTCM
Here the coefficient of AGE50HGT is -.507, which indicates that
weight is not as strongly related to height in older men as it is in
younger. The t-statistic for this coefficient is -1.552 (p = .1207).
How do you test which of these models is best ?
Note that there is a hierarchy here: you go from simple to more
complicated models:
Model 1: 1 line; 2 parameters
Model 2: 2 lines, parallel (same slope): 3 parameters
Model 3: 2 lines, two different slopes: 4 parameters.
Or, to put it another way: the *model* statements in proc reg are:
Model 1: model weight = height ;
Model 2: model weight = agege50 height ;
Model 3: model weight = agege50 height age50hgt ;
This makes the hierarchical relationship very clear: the independent
variables in Model 1 are a subset of those in Models 2 and 3, and those
in Model 2 are a subset of those in Model 3.
What you might want to know here is: Is Model 2 better than Model 1?
Is Model 3 better than Model 2 ? Is Model 3 better than Model 1?
Here is how you carry out a test for hierarchically related models of
this kind:
1. Compute the residual [or error] sum of squares for Model 3: ReSS3.
2. Compute the residual sum of squares form Model 1 : ReSS1.
(ReSS1 - ReSS3) / 2
Compute F = ---------------------.
ReSS3 / (n - 4)
3. Compare this to an F-distribution with degrees of freedom (2, n - 4).
(Use tables).
Questions about this:
Q1. What is that "2" in the denominator of the numerator of F ?
A1. That is the difference in the number of parameters between
Model 1 and Model 3.
Q2. What is the (n - 4) term?
A2. That is total sample size (n) minus the number of parameters (4)
in the 'bigger' model.
Q3. Is F always positive ?
A3 If it isn't, you have made a mistake. The residual sum of
squares for the smaller model should always be greater than
the residual sum of squares for the larger one.
Q4. Say I am testing Model 3 versus Model 2. Can't I just look at the
t-statistic for the added variable, AGE50HT ?
A4. You can, but the F-test above is considered more appropriate.
Q5. What does the F test actually say for this example?
A5. ReSS1 = 142045417.63
ReSS3 = 141402760.03
(142045417.63 - 141402760.03)/2
F = -------------------------------- = 26.476,
141402760.03 / 11651
where F has df = (2, 11651). The p-value is: p < .0001
Q6. So what's the conclusion ?
A6. The two-line model fits better than the one-line model.
(Note, however, that there is not much difference in the
R-square).
Q7. Is it correct to say that since the slope of weight as
a function of height is not the same in the two age
groups, then there is an INTERACTION between age and height
as predictors of weight ?
A7. To say this, you need to check that the equal-slopes
model (Model 2) is significantly 'worse' than the
unequal-slopes model (Model 3). The appropriate F-statistic
is:
(ReSS2 - ReSS3)/1 (141431996.07 - 141402760.03)/1
F = -------------------- = -------------------------------
ReSS3/11651 141402760.03 / 11651
= 2.4089.
The p-value in this case (see notes n54703.006) is .1207.
Thus one would not reject the hypothesis that the slopes are
the same. You cannot conclude that the there is age-height
interaction.
==================================================================================
Problem 1.
This is a continuation of problem 1 in n54703.004. Again, consider the
data on mortality and water hardness from Chapter 2 of Der-Everitt.
(1) Carry out analysis for the model:
mortality = a + b*hardness + c*norsouth + e,
where norsouth = 1 if the city is northern, 0 if the city is southern,
and e is a normally distributed error term with mean 0 and standard
deviation sigma.
How can you tell whether norsouth an important variable? What effect does it have on
R-square ?
(2) Comment on collinearity of hardness and norsouth. What does it
mean to say that a continuous variable like hardness is
"collinear" with a dichotomous variable like norsouth?
(3) Carry out a stepwise regression with the predictors being hardness
and norsouth.
(4) Back to the simpler model:
mortality = a + b*hardness + e.
Compute the coefficients separately for northern and southern
cities.
(5) Test the following hypothesis: the slope of mortality versus hardness
is different for northern cities than for southern cities.
Problem 2.
In the dataset on crime in the U.S. versus other variables in Chapter 4,
consider the regression of crime rate R versus the variable Age. Do the
regression for 3 models:
Model 1 : R = a + b*Age + e
Model 2 : R = a + b*Age + c*S + e
Model 3 : R = a + b*Age + c*S + d*Age*S + e,
where S = southern state (0 = no, 1 = yes), and e represents the 'error' term
in each model.
Test for whether Model 2 is better than Model 1 and for whether Model 3
is better than Model 2, and for whether Model 3 is better than Model 1.
Also note the R-squares for the 3 models.
n54703.005 Last update: February 29, 2004.