PROC REG, II : Multivariate regression, collinearity, and other topics     n54703.005


Multivariate Regression:

     Multivariate regression is a natural generalization of simple linear
regression.  The model may be stated as

         Y = a0 + a1*X1 + a2*X2 + ... + ap*Xp + e,

where as before the error term e has mean 0 and standard deviation sigma, and
the variables X1, X2, ..., Xp are possible predictors of the outcome Y.  The
coefficients a0, a1, a2, ..., ap are unknown.

     The object of the regression is to estimate values of a0, a1, ..., ap
such that the resulting predicted values agree as closely as possible with
the observed data.  If there are two predicting variables, the model corresponds 
to a flat plane in 3-dimensional space which provides the best fit to a cloud of 
observed data points.  This is called a *linear model* because the expression on 
the right side of the equation is a linear function of the predictors.  It is 
reasonable to assume a linear model in many situations, but there are a great 
many others where a linear model is not reasonable.

     The following questions arise in dealing with multivariate linear models:

     1.  How can the data be displayed ?

     2.  What if some of the predictors are highly correlated with each other?
         Should they all be entered into the model?

     3.  What predictors should be entered into the model?  What subset of
         possible predictors provides an adequate fit to the data ?

     4.  What do you do with residuals in multivariate regression ?
         How do you look for nonlinearity ?

     5.  How can you compare regression models ?


We consider these in succession in the following:

1.  How can the data be displayed?

    It is a good idea to examine the relationship of the outcome
variable Y to each of the predictors.  An example of this is on page 84
of the Der-Everitt text.  The figure there is generated by a SAS macro
and shows small scatterplots of all the variables versus each other.
The 'outcome' variable for that dataset is R, the crime rate.  The
relationship of R to the other variables is shown in the leftmost
column in that figure.  This is one useful way to show multivariate
relationships, but essentially it just shows a kind of 2-dimensional
shadow of a cloud of points in (p + 1)-dimensional space (if there
are p predictors).  Since we humans are not able to visualize spaces of
dimension higher than 3, we cannot really display all the data for
this kind of model unless p = 1 or 2.

    Another graphical approach is like that shown in Chapter 2 of the
text: 3-D plots of smoothed bivariate densities.  This is accomplished
using proc kde (kernel density estimator) and proc g3d, and requires
access to graphics screens.  A program to do this for crime rate versus
unemployment in the Chapter 4 dataset on U.S. crime rates is given 
below.  The smoothed-density plot actually does not contain any more
information that a simple scatterplot.  It depicts the relationship
of two variables.

    The resulting 3-D plot may be seen at:

    http://www.biostat.umn.edu/%7Ejohn-c/5421/crimevsunemp2.grf

    However, it is possible to show 3-dimensional relationships: e.g.,
observed crime rates versus two covariates.  The following figure
shows this, and the program below generates the plot in the second
"proc g3d" section:

    http://www.biostat.umn.edu/%7Ejohn-c/5421/crime.vs.educ.unemp2.grf

========================================================================

footnote "~john-c/5421/chap4.sas &sysdate &systime" ;

FILENAME GRAPH 'gsas.grf' ;

OPTIONS  LINESIZE = 150 MPRINT ;

GOPTIONS
         RESET = GLOBAL
         ROTATE = LANDSCAPE
         FTEXT = SWISSB
         DEVICE = GIF
         GACCESS = SASGASTD
         GSFNAME = GRAPH
         GSFMODE = REPLACE
         GUNIT = PCT BORDER
         CBACK = WHITE
         HTITLE = 2 HTEXT = 1 ;

*===================================================================== ;        


/* Chapter 4 */

data uscrime;                          /* Chapter 4 */
*    infile 'n:\handbook2\datasets\uscrime.dat' expandtabs;
   infile cards expandtabs ;
  input R Age S Ed Ex0 Ex1 LF M N NW U1 U2 W X;
datalines ;
79.1 151       	1      	91      58      56      510     950     33  301  108  41  394  261
163.5   143     0       113    103      95      583     1012 13 102 96 36 557 194
57.8    142     1       89      45      44      533     969     18 219 94 33 318 250
196.9 136 0 121 149 141 577 994 157 80 102 39 673 167
123.4	141	0	121	109	101	591	985	18	30	91	20	578	174
68.2	121	0	110	118	115	547	964	25	44	84	29	689	126
96.3	127	1	111	82	79	519	982	4	139	97	38	620	168
155.5	131	1	109	115	109	542	969	50	179	79	35	472	206
85.6	157	1	90	65	62	553	955	39	286	81	28	421	239
70.5	140	0	118	71	68	632	1029	7	15	100	24	526	174
167.4	124	0	105	121	116	580	966	101	106	77	35	657	170
84.9	134	0	108	75	71	595	972	47	59	83	31	580	172
51.1	128	0	113	67	60	624	972	28	10	77	25	507	206
66.4	135	0	117	62	61	595	986	22	46	77	27	529	190
79.8	152	1	87	57	53	530	986	30	72	92	43	405	264
94.6	142	1	88	81	77	497	956	33	321	116	47	427	247
53.9	143	0	110	66	63	537	977	10	6	114	35	487	166
92.9	135	1	104	123	115	537	978	31	170	89	34	631	165
75.0	130	0	116	128	128	536	934	51	24	78	34	627	135
122.5	125	0	108	113	105	567	985	78	94	130	58	626	166
74.2	126	0	108	74	67	602	984	34	12	102	33	557	195
43.9	157	1	89	47	44	512	962	22	423	97	34	288	276
121.6	132	0	96	87	83	564	953	43	92	83	32	513	227
96.8	131	0	116	78	73	574	1038	7	36	142	42	540	176
52.3	130	0	116	63	57	641	984	14	26	70	21	486	196
199.3	131	0	121	160	143	631	1071	3	77	102	41	674	152
34.2	135	0	109	69	71	540	965	6	4	80	22	564	139
121.6	152	0	112	82	76	571	1018	10	79	103	28	537	215
104.3	119	0	107	166	157	521	938	168	89	92	36	637	154
69.6	166	1	89	58	54	521	973	46	254	72	26	396	237
37.3	140	0	93	55	54	535	1045	6	20	135	40	453	200
75.4	125	0	109	90	81	586	964	97	82	105	43	617	163
107.2	147	1	104	63	64	560	972	23	95	76	24	462	233
92.3	126	0	118	97	97	542	990	18	21	102	35	589	166
65.3	123	0	102	97	87	526	948	113	76	124	50	572	158
127.2	150	0	100	109	98	531	964	9	24	87	38	559	153
83.1	177	1	87	58	56	638	974	24	349	76	28	382	254
56.6	133	0	104	51	47	599	1024	7	40	99	27	425	225
82.6	149	1	88	61	54	515	953	36	165	86	35	395	251
115.1	145	1	104	82	74	560	981	96	126	88	31	488	228
88.0	148	0	122	72	66	601	998	9	19	84	20	590	144
54.2	141	0	109	56	54	523	968	4	2	107	37	489	170
82.3	162	1	99	75	70	522	996	40	208	73	27	496	224
103.0	136	0	121	95	96	574	1012	29	36	111	37	622	162
45.5	139	1	88	46	41	480	968	19	49	135	53	457	249
50.8	126	0	104	106	97	599	989	40	24	78	25	593	171
84.9	130	0	121	90	91	623	1049	3	22	113	40	588	160
;

run ;

proc print data = uscrime ;

proc kde data = uscrime out = bivest ;
     var r u2 ;
run ;

proc g3d data = bivest ;
     plot r * u2 = density ;
title1 'Plot of Bivariate Density: Crime Rate vs Unemp 35-39' ;
run ;

proc g3d data = uscrime ;
     scatter ed * u2 = r / shape = 'prism' ;
title1 'Plot of crime rates vs education level and unemployment' ;
run ;

========================================================================

 2.  What if some of the predictors are highly correlated with each other?
     Should they all be entered into the model?

     The plot on page 84 of the text shows strong correlations between
some of the variables: particularly EX0 and EX1 (both related to
expenditures for police), and between W and X (both related to income).  
Their correlation (obtainable from proc corr) is greater than 0.99.  
Entering highly correlated variables as predictors in a regression can
lead to erroneous conclusions and should be avoided.  'Highly correlated'
and 'collinear' mean essentially the same thing.  SAS includes tests for
collinearity.  One such is based on something called the 'Variance
Inflation Factor' (VIF).  This can be invoked in proc reg as follows (see
also p. 85 of the Der-Everitt text):

     proc reg data = uscrime ;
          model R = Age--X / vif ;
     run ;

     Printout for this is shown on page 86.  Note that variables EX0
and EX1 have very high VIFs: 94.6 and 98.6 resp.  A rule of thumb
is that VIFs > 10 are not good.  The VIF indicates that the variable
in question is highly correlated with the ensemble of other variables
in the analysis.

     What do you do if the VIF is very large for a given variable ?

     In this case, seeing the VIF for EX0 and EX1 and also noting that
they are very highly correlated, you should enter only one of them
in the model.  Given their close agreement, it does not matter
much which one you choose.  In general, one would take the approach of 
omitting variables from the model one at a time until there are none left 
which have a high VIF.

     Another more systematic approach to avoiding collinearity of
predictors is to use stepwise regression.  This will select a subset
of the predictors which does a good job of explaining most of the
variability in the outcome variable, without adding in variables
that are highly correlated with others.  This is discussed in more
detail below.


 3.  What predictors should be entered into the model?  What subset of
     possible predictors provides an adequate fit to the data ?

     An outcome variable Y may be strongly related to some of the other
variables in your dataset, and weakly related, or even unrelated, to
others.  You can enter many possible predictors in a multivariate regression,
and in general the procedure will give you back the usual output: an
ANOVA table, an F-test for overall effect of all the covariates on the
outcome, R-square, coefficient estimates, standard errors, etc..  However,
if variables are entered into the model which are really unrelated to the
outcome, the effect is to decrease the apparent effect of other variables
which are genuinely related.  That is, you pay a price for entering useless
predictors: the price is to underestimate the effects of the genuine
predictors.

     Proc reg incorporates ways to select "best" models: that is, models
which do not include variables which do not have strong relationships to
the outcome variable.  Proc reg makes use of *stepwise* algorithms, which
can enter or delete variables from a model depending on certain variable
selection criteria.  One approach, call *backward* stepwise selection, starts
by putting all the variables in the model and then omitting the least
useful variables one at a time until the variables left in the model all
contribute significantly to prediction of the outcome.  The other approach
is called *forward* stepwise regression, which starts with only the intercept term in
model and adds the candidate variables one at a time until all the
potentially useful variables have been added.  It may happen that a given
predictor X by itself is fairly strongly related to the outcome, but when
other variables are already in the model which are perhaps correlated with
X, the variable X itself does not make any useful addition.

     There is a variant of stepwise regression which is similar to 'forward'
except that variables which are entered in one step may be removed in a subsequent
step.  This gives quite a lot of flexibility for variable selection.  Below is
an example of the use of forward stepwise using data in the Lung Health Study:
the outcome variable is baseline FEV1, and the possible predictors are
other baseline covariates:

==================================================================================

PROC CORR DATA = SMOKE ;
     VAR S2FEVPOS AGE PACKYEAR GENDER HGTM WGTKG AFROAMER F10CIGS
         DRINKS COUGH30 ;
TITLE1 'Correlations of Baseline Variables in the Lung Health Study' ;

PROC REG DATA = SMOKE ;
     MODEL S2FEVPOS = AGE PACKYEAR GENDER HGTM WGTKG AFROAMER F10CIGS
                      DRINKS COUGH30 / VIF SELECTION = STEPWISE ;
TITLE1 'Proc reg: use of stepwise and VIF in Lung Health Study data' ;

endsas ;

--------------------------------------------------------------------------------

          Correlations of Baseline Variables in the Lung Health Study          1
                                                 17:45 Tuesday, February 3, 2004

                              Correlation Analysis

  10 'VAR' Variables:  S2FEVPOS AGE      PACKYEAR GENDER   HGTM     WGTKG   
                       AFROAMER F10CIGS  DRINKS   COUGH30 


                              Simple Statistics
 
 Variable                N             Mean          Std Dev              Sum

 S2FEVPOS             5885         2.745760         0.627641            16159
 AGE                  5887        48.468660         6.825130           285335
 PACKYEAR             5885        40.450127        19.126189           238049
 GENDER               5887         0.371157         0.483155      2185.000000
 HGTM                 5887         1.719552         0.089076            10123
 WGTKG                5886        76.005590        15.072866           447369
 AFROAMER             5887         0.038220         0.191743       225.000000
 F10CIGS              5887        31.272804        12.851235           184103
 DRINKS               5887         4.353321         5.537889            25628
 COUGH30              5887         0.422626         0.494019      2488.000000

                               Simple Statistics
 
 Variable          Minimum          Maximum     Label

 S2FEVPOS         1.150000         4.940000                                   
 AGE             34.000000        67.000000     AGE AT ENTRY INTO LHS         
 PACKYEAR                0       190.000000     PACK YEARS OF CIG SMOKING     
 GENDER                  0         1.000000     GENDER 0 = MEN, 1 = WOMEN     
 HGTM             1.420000         2.160000                                   
 WGTKG           38.400000       136.000000                                   
 AFROAMER                0         1.000000     ETHNIC GROUP: AFRO-AMER OR NOT
 F10CIGS          2.000000       120.000000     CIGS PER DAY AT SCREEN 1      
 DRINKS                  0        25.000000     DRINKS PER WEEK AMONG ALL     
 COUGH30                 0         1.000000                                   
 
 
              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 03FEB04 17:45

           Correlations of Baseline Variables in the Lung Health Study          2
                                                 17:45 Tuesday, February 3, 2004

                              Correlation Analysis

         Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0
         / Number of Observations  

                                S2FEVPOS       AGE  PACKYEAR    GENDER      HGTM

S2FEVPOS                         1.00000  -0.35781  -0.07002  -0.70189   0.76342
                                  0.0       0.0001    0.0001    0.0001    0.0001
                                    5885      5885      5883      5885      5885

AGE                             -0.35781   1.00000   0.41517   0.00938  -0.07931
AGE AT ENTRY INTO LHS             0.0001    0.0       0.0001    0.4720    0.0001
                                    5885      5887      5885      5887      5887

PACKYEAR                        -0.07002   0.41517   1.00000  -0.16390   0.08554
PACK YEARS OF CIG SMOKING         0.0001    0.0001    0.0       0.0001    0.0001
                                    5883      5885      5885      5885      5885

GENDER                          -0.70189   0.00938  -0.16390   1.00000  -0.70252
GENDER 0 = MEN, 1 = WOMEN         0.0001    0.4720    0.0001    0.0       0.0001
                                    5885      5887      5885      5887      5887

HGTM                             0.76342  -0.07931   0.08554  -0.70252   1.00000
                                  0.0001    0.0001    0.0001    0.0001    0.0   
                                    5885      5887      5885      5887      5887

WGTKG                            0.52249  -0.03590   0.12352  -0.56202   0.65363
                                  0.0001    0.0059    0.0001    0.0001    0.0001
                                    5884      5886      5884      5886      5886

AFROAMER                        -0.13084   0.01020  -0.09143   0.02474  -0.01591
ETHNIC GROUP: AFRO-AMER OR NOT    0.0001    0.4340    0.0001    0.0577    0.2224
                                    5885      5887      5885      5887      5887

F10CIGS                          0.06159  -0.06894   0.47318  -0.14237   0.09523
CIGS PER DAY AT SCREEN 1          0.0001    0.0001    0.0001    0.0001    0.0001
                                    5885      5887      5885      5887      5887

DRINKS                           0.10331   0.01647   0.02992  -0.12344   0.09372
DRINKS PER WEEK AMONG ALL         0.0001    0.2065    0.0217    0.0001    0.0001
                                    5885      5887      5885      5887      5887

COUGH30                         -0.04240  -0.02897   0.09957   0.00752  -0.02272
                                  0.0011    0.0262    0.0001    0.5641    0.0813
                                    5885      5887      5885      5887      5887
 
 
 
 
              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 03FEB04 17:45

           Correlations of Baseline Variables in the Lung Health Study          3
                                                 17:45 Tuesday, February 3, 2004

                              Correlation Analysis

         Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0
         / Number of Observations  

                                   WGTKG  AFROAMER   F10CIGS    DRINKS   COUGH30

S2FEVPOS                         0.52249  -0.13084   0.06159   0.10331  -0.04240
                                  0.0001    0.0001    0.0001    0.0001    0.0011
                                    5884      5885      5885      5885      5885

AGE                             -0.03590   0.01020  -0.06894   0.01647  -0.02897
AGE AT ENTRY INTO LHS             0.0059    0.4340    0.0001    0.2065    0.0262
                                    5886      5887      5887      5887      5887

PACKYEAR                         0.12352  -0.09143   0.47318   0.02992   0.09957
PACK YEARS OF CIG SMOKING         0.0001    0.0001    0.0001    0.0217    0.0001
                                    5884      5885      5885      5885      5885

GENDER                          -0.56202   0.02474  -0.14237  -0.12344   0.00752
GENDER 0 = MEN, 1 = WOMEN         0.0001    0.0577    0.0001    0.0001    0.5641
                                    5886      5887      5887      5887      5887

HGTM                             0.65363  -0.01591   0.09523   0.09372  -0.02272
                                  0.0001    0.2224    0.0001    0.0001    0.0813
                                    5886      5887      5887      5887      5887

WGTKG                            1.00000   0.00712   0.12452   0.02545  -0.00144
                                  0.0       0.5852    0.0001    0.0508    0.9120
                                    5886      5886      5886      5886      5886

AFROAMER                         0.00712   1.00000  -0.11793  -0.02648  -0.07729
ETHNIC GROUP: AFRO-AMER OR NOT    0.5852    0.0       0.0001    0.0422    0.0001
                                    5886      5887      5887      5887      5887

F10CIGS                          0.12452  -0.11793   1.00000   0.05608   0.17580
CIGS PER DAY AT SCREEN 1          0.0001    0.0001    0.0       0.0001    0.0001
                                    5886      5887      5887      5887      5887

DRINKS                           0.02545  -0.02648   0.05608   1.00000   0.03986
DRINKS PER WEEK AMONG ALL         0.0508    0.0422    0.0001    0.0       0.0022
                                    5886      5887      5887      5887      5887

COUGH30                         -0.00144  -0.07729   0.17580   0.03986   1.00000
                                  0.9120    0.0001    0.0001    0.0022    0.0   
                                    5886      5887      5887      5887      5887
 
 
 
              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 03FEB04 17:45

           Proc reg: use of stepwise and VIF in Lung Health Study data          4
                                                 17:45 Tuesday, February 3, 2004

               Stepwise Procedure for Dependent Variable S2FEVPOS

Step 1   Variable HGTM Entered      R-square = 0.58277189   C(p) =4107.9115432

                 DF         Sum of Squares      Mean Square          F   Prob>F

 Regression       1          1350.30512662    1350.30512662    8213.01   0.0001
 Error         5880           966.73373487       0.16441050
 Total         5881          2317.03886149

                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F

 INTERCEP      -6.50720345      0.10223574     666.05914651    4051.20   0.0001
 HGTM           5.38101003      0.05937623    1350.30512662    8213.01   0.0001

Bounds on condition number:            1,            1
--------------------------------------------------------------------------------

Step 2   Variable AGE Entered       R-square = 0.67130902   C(p) =1990.8694694

                 DF         Sum of Squares      Mean Square          F   Prob>F

 Regression       2          1555.44908864     777.72454432    6003.55   0.0001
 Error         5879           761.58977285       0.12954410
 Total         5881          2317.03886149

                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F

 INTERCEP      -4.88569345      0.09947821     312.47421230    2412.11   0.0001
 AGE           -0.02745985      0.00069005     205.14396202    1583.58   0.0001
 HGTM           5.21202416      0.05287638    1258.65595418    9716.04   0.0001

Bounds on condition number:     1.006492,     4.025966
--------------------------------------------------------------------------------

 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>  Steps 3-7 OMITTED.  <<<<<<<<<<<<<<<<<<<<<<<<<<<<

--------------------------------------------------------------------------------

Step 8   Variable PACKYEAR Entered  R-square = 0.75465759   C(p) =  8.01000201

                 DF         Sum of Squares      Mean Square          F   Prob>F

 Regression       8          1748.57096190     218.57137024    2258.12   0.0001
 Error         5873           568.46789959       0.09679344
 Total         5881          2317.03886149

                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F

 INTERCEP      -1.44889268      0.12197908      13.65675946     141.09   0.0001
 AGE           -0.02841031      0.00069827     160.23325283    1655.41   0.0001
 PACKYEAR      -0.00081371      0.00028272       0.80181738       8.28   0.0040
 GENDER        -0.47249446      0.01197224     150.76096651    1557.55   0.0001
 HGTM           3.42461517      0.06437664     273.91311523    2829.87   0.0001
 AFROAMER      -0.39917697      0.02136117      33.80078310     349.21   0.0001
 F10CIGS       -0.00269115      0.00038467       4.73734587      48.94   0.0001
 DRINKS         0.00226678      0.00073970       0.90898139       9.39   0.0022
 COUGH30       -0.04576476      0.00837485       2.89037675      29.86   0.0001

Bounds on condition number:     2.033289,     94.01067
--------------------------------------------------------------------------------

 
              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 03FEB04 17:45

           Proc reg: use of stepwise and VIF in Lung Health Study data          7
                                                 17:45 Tuesday, February 3, 2004

All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the model.

         Summary of Stepwise Procedure for Dependent Variable S2FEVPOS

        Variable        Number   Partial    Model
 Step   Entered Removed     In      R**2     R**2      C(p)          F   Prob>F
        Label

    1   HGTM                 1    0.5828   0.5828 4107.9115  8213.0103   0.0001
                                                
    2   AGE                  2    0.0885   0.6713 1990.8695  1583.5840   0.0001
        AGE AT ENTRY INTO LHS                   
    3   GENDER               3    0.0639   0.7352  462.7931  1419.2966   0.0001
        GENDER 0 = MEN, 1 = WOMEN               
    4   AFROAMER             4    0.0123   0.7476  169.3526   287.4031   0.0001
        ETHNIC GROUP: AFRO-AMER OR NOT          
    5   F10CIGS              5    0.0051   0.7527   49.9965   120.4542   0.0001
        CIGS PER DAY AT SCREEN 1                
    6   COUGH30              6    0.0012   0.7539   22.1442    29.7755   0.0001
                                                
    7   DRINKS               7    0.0004   0.7543   14.2924     9.8413   0.0017
        DRINKS PER WEEK AMONG ALL               
    8   PACKYEAR             8    0.0003   0.7547    8.0100     8.2838   0.0040
        PACK YEARS OF CIG SMOKING               
 
 
              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 03FEB04 17:45

           Proc reg: use of stepwise and VIF in Lung Health Study data          8
                                                 17:45 Tuesday, February 3, 2004

Model: MODEL1  
Dependent Variable: S2FEVPOS                                           

                              Analysis of Variance

                                 Sum of         Mean
        Source          DF      Squares       Square      F Value       Prob>F

        Model            8   1748.57096    218.57137     2258.122       0.0001
        Error         5873    568.46790      0.09679
        C Total       5881   2317.03886

            Root MSE       0.31112     R-square       0.7547
            Dep Mean       2.74558     Adj R-sq       0.7543
            C.V.          11.33153

                              Parameter Estimates

                       Parameter      Standard    T for H0:               
      Variable  DF      Estimate         Error   Parameter=0    Prob > |T|

      INTERCEP   1     -1.448893    0.12197908       -11.878        0.0001
      AGE        1     -0.028410    0.00069827       -40.687        0.0001
      PACKYEAR   1     -0.000814    0.00028272        -2.878        0.0040
      GENDER     1     -0.472494    0.01197224       -39.466        0.0001
      HGTM       1      3.424615    0.06437664        53.197        0.0001
      AFROAMER   1     -0.399177    0.02136117       -18.687        0.0001
      F10CIGS    1     -0.002691    0.00038467        -6.996        0.0001
      DRINKS     1      0.002267    0.00073970         3.064        0.0022
      COUGH30    1     -0.045765    0.00837485        -5.465        0.0001

                        Variance  Variable
      Variable  DF     Inflation     Label

      INTERCEP   1    0.00000000  Intercept                               
      AGE        1    1.37934114  AGE AT ENTRY INTO LHS                   
      PACKYEAR   1    1.77696126  PACK YEARS OF CIG SMOKING               
      GENDER     1    2.03328873  GENDER 0 = MEN, 1 = WOMEN               
      HGTM       1    1.99671021                                          
      AFROAMER   1    1.02011203  ETHNIC GROUP: AFRO-AMER OR NOT          
      F10CIGS    1    1.48491823  CIGS PER DAY AT SCREEN 1                
      DRINKS     1    1.01995800  DRINKS PER WEEK AMONG ALL               
      COUGH30    1    1.04004387                                          

 
              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 03FEB04 17:45
==================================================================================

 Notes on preceding printout:

     A 'proc corr' was run before the 'proc reg'.  This shows the correlations
of both the outcome variable (S2FEVPOS) and the other baseline variables
entered in the regression.  S2FEVPOS is strongly correlated with age
(r = -.358), gender (r = -.702), height (r = .763), and weight (r = .522).
It is significantly correlated with most of the other variables.  There are
a number of other strong correlations (gender and height, for example).
Only one of the variables are not included in the final model: WGTKG (weight
in kilograms).  If the variable HGTM (height in meters) is not been included
in the list, the stepwise selection process *will* include WGTKG.  It is *not*
correct to conclude that, if a stepwise selection process does not select
a given variable, then that variable is not strongly related to the outcome
variable.

     Note that the stepwise procedure enters variables into the model if their
significance level is < .1500.  This default can be changed in the procedure.

     Note also that the variance inflation factor is relatively small for
all variables entered into the model: the maximum VIF is 2.033, for GENDER.


 4.  What do you do with residuals in multivariate regression ?

     As with simple linear regression, proc reg can compute residuals
for multivariate regression.  It can be useful to plot the residuals
against either (1) single predictors that were entered into the regression,
or (2) the predicted value of the outcome variable.

     Plotting residuals against individual variables can indicate whether
the relationship between that predictor and the outcome is actually linear.
The residuals must average to 0, but this does not exclude the possibility
that the pattern of residuals might resemble, for example, a rainbow.

     Example: the scatterplot of W vs EX1 on page 84 of the text.  W is a measure
of wealth for each state.  EX1 is the per capita expenditure for police in the
year 1959.  The scatterplot indicates a curvilinear relationship between these
two variables.  If you regress W on EX1, the plot suggests that it might be
useful to add a quadratic term to the model.

      Model 1 (linear)     W = a + b*EX1 + e

      Model 2 (quadratic)  W = a + b*EX1 + c*EX1*EX1 + e.

     The following program-segment uses proc reg to analyze these data using Model
1 and Model 2.  The first analysis includes an output file of the W-residuals
from Model 1.  These residuals were then plotted against EX1, and the result
has a (very roughly) curvilinear shape.  The second analysis is a stepwise one, 
with both EX1 and EX1SQ ( = EX1*EX1) offered to the procedure.  The stepwise 
algorithm first selected EX1, and then selected EX1SQ.  This a strong indication 
that the relationship of W to EX1 is not linear (as suggested by the scatterplot).

==================================================================================

proc reg data = uscrime ;
     model w = ex1 ;
     output out = regout r = wresid ;
title1 'Regression of W (wealth) vs EX1 (police expenditures in 1959)' ;
title2 'plus computation of residuals' ;
run ;

options pagesize = 40 ;

proc plot data = regout ;
     plot wresid * ex1 ;
title1 'Plot of Wealth residuals versus EX1 to detect nonlinearity ...' ;

options pagesize = 60 ;

proc reg data = uscrime ;
     model w = ex1 ex1sq / selection = stepwise ;
     output out = regout r = wresid ;
title1 'Stepwise regression ... crime dataset ... wealth vs police expenditures' ;
title2 'Both EX1 and EX1-squared are entered ...' ;

endsas ;
--------------------------------------------------------------------------------
         Regression of W (wealth) vs EX1 (police expenditures in 1959)         1
                         plus computation of residuals
                                                 19:22 Tuesday, February 3, 2004

Model: MODEL1  
Dependent Variable: W                                                  

                              Analysis of Variance

                                 Sum of         Mean
        Source          DF      Squares       Square      F Value       Prob>F

        Model            1 270183.34186 270183.34186       76.902       0.0001
        Error           45 158099.76452   3513.32810
        C Total         46 428283.10638

            Root MSE      59.27333     R-square       0.6309
            Dep Mean     525.38298     Adj R-sq       0.6226
            C.V.          11.28193

                              Parameter Estimates

                       Parameter      Standard    T for H0:               
      Variable  DF      Estimate         Error   Parameter=0    Prob > |T|

      INTERCEP   1    305.469730   26.52592383        11.516        0.0001
      EX1        1      2.740897    0.31255237         8.769        0.0001

          Plot of Wealth residuals versus EX1 to detect nonlinearity ...        2
                                                 19:22 Tuesday, February 3, 2004

            Plot of WRESID*EX1.  Legend: A = 1 obs, B = 2 obs, etc.

       |
       |
   150 +
       |
       |
       |
       |
       |
       |                A
   100 +                       A
       |                         A
       |
       |                   A
       |                 A                         A
       |                   A
       |              A
    50 +                                 A
       |
       |   A      A  A   A  A         A       A     A
R      |                            A
e      |           A          A          A
s      |                                 A
i      |               A                           A
d    0 +          A        A
u      |      A                         A   A
a      |                                  A
l      |               A     A    A                               AA
       |                                                  A
       |
       |         A
   -50 +
       |          B   A
       |           A
       |
       |           A
       |
       |                      A
  -100 +                                                                  A
       |    A
       |
       |
       |
       |    A                                   A
       |
  -150 +
       |
       ---+----------+----------+----------+----------+----------+----------+--
         40         60         80         100        120        140        160

                                          EX1

                                                                                
    Stepwise regression ... crime dataset ... wealth vs police expenditures    3
                    Both EX1 and EX1-squared are entered ...
                                                 19:22 Tuesday, February 3, 2004

               Stepwise Procedure for Dependent Variable W       

Step 1   Variable EX1 Entered       R-square = 0.63085220   C(p) = 12.31414903

                 DF         Sum of Squares      Mean Square          F   Prob>F

 Regression       1        270183.34185858  270183.34185858      76.90   0.0001
 Error           45        158099.76452440    3513.32810054
 Total           46        428283.10638298

                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F

 INTERCEP     305.46972954     26.52592383  465922.87567564     132.62   0.0001
 EX1            2.74089703      0.31255237  270183.34185858      76.90   0.0001

Bounds on condition number:            1,            1
--------------------------------------------------------------------------------

Step 2   Variable EX1SQ Entered     R-square = 0.70635898   C(p) =  3.00000000

                 DF         Sum of Squares      Mean Square          F   Prob>F

 Regression       2        302521.61930531  151260.80965266      52.92   0.0001
 Error           44        125761.48707767    2858.21561540
 Total           46        428283.10638298

                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F

 INTERCEP      83.49815183     70.19451350    4044.29080773       1.41   0.2406
 EX1            8.12669513      1.62580398   71414.52619009      24.99   0.0001
 EX1SQ         -0.02917694      0.00867419   32338.27744673      11.31   0.0016

Bounds on condition number:     33.25941,     133.0376
--------------------------------------------------------------------------------

All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the model.

         Summary of Stepwise Procedure for Dependent Variable W       

        Variable        Number   Partial    Model
 Step   Entered Removed     In      R**2     R**2      C(p)          F   Prob>F

    1   EX1                  1    0.6309   0.6309   12.3141    76.9024   0.0001
    2   EX1SQ                2    0.0755   0.7064    3.0000    11.3141   0.0016

==================================================================================

     Residuals may also be used to determine whether variability in the data
is related to the magnitude of the outcome variable (i.e., heteroscedasticity).
If this appears to be true, a transformation of the outcome variable may be
warranted (see n54703.004 for more discussion on this.).


5.  How Do You Compare Regression Models ?

    Assume model with an outcome variable Y is believed to be related to predictor
    X as follows:

               Y = a + b * X  +  e,

    where as usual e has a normal distribution with standard deviation sigma.

    It may happen that this model holds in each of two strata, but that the
    intercepts "a" and slopes "b" are different.

    Two questions arise:

    1.  How do you detect this ?

    2.  What do you do about it ?

    An example might be the following.  The relationship of height and weight
    may differ according to age group.  This can be tested using a large
    dataset such as that from the Multiple Risk Factor Intervention Trial.

*=================================================================== ;

 options linesize = 80 ;

 PROC REG DATA = SEL ;
      MODEL WGTKG = HEIGHTCM ;
 TITLE1 'MODEL 1: WEIGHTKG VS HEIGHTCM: ONE LINE' ;
 RUN ;

 PROC REG DATA = SEL ;
      MODEL WGTKG = AGEGE50 HEIGHTCM ;
 TITLE1 'MODEL 2: WEIGHTKG VS HEIGHTCM: TWO PARALLEL LINES, SEPARATE INTERCEPTS';
 RUN ;

 PROC REG DATA = SEL ;
      MODEL WGTKG = AGEGE50 HEIGHTCM AGE50HGT ;
 TITLE1 'MODEL 3: WEIGHTKG VS HEIGHTCM: TWO LINES, TWO INTERCEPTS, TWO SLOPES' ;
 RUN ;

 ENDSAS ;
------------------------------------------------------------------------
                    MODEL 1: WEIGHTKG VS HEIGHTCM: ONE LINE                    1
                                                  19:09 Sunday, February 8, 2004

Model: MODEL1  
Dependent Variable: WGTKG                                              

                              Analysis of Variance

                                 Sum of         Mean
        Source          DF      Squares       Square      F Value       Prob>F

        Model            1 42597433.044 42597433.044     3493.372       0.0001
        Error        11649 142045417.63 12193.786387
        C Total      11650 184642850.67

            Root MSE     110.42548     R-square       0.2307
            Dep Mean     850.15488     Adj R-sq       0.2306
            C.V.          12.98887

                              Parameter Estimates

                       Parameter      Standard    T for H0:               
      Variable  DF      Estimate         Error   Parameter=0    Prob > |T|

      INTERCEP   1   -751.200974   27.11282728       -27.706        0.0001
      HEIGHTCM   1      9.103589    0.15402464        59.105        0.0001

 
 
                      billings.sas (joanneb) 08FEB04 19:09
------------------------------------------------------------------------
      MODEL 2: WEIGHTKG VS HEIGHTCM: TWO PARALLEL LINES, SEPARATE INTERCEPTS    2
                                                  19:09 Sunday, February 8, 2004

Model: MODEL1  
Dependent Variable: WGTKG                                              

                              Analysis of Variance

                                 Sum of         Mean
        Source          DF      Squares       Square      F Value       Prob>F

        Model            2 43210854.606 21605427.303     1779.371       0.0001
        Error        11648 141431996.07 12142.169992
        C Total      11650 184642850.67

            Root MSE     110.19152     R-square       0.2340
            Dep Mean     850.15488     Adj R-sq       0.2339
            C.V.          12.96135

                              Parameter Estimates

                       Parameter      Standard    T for H0:               
      Variable  DF      Estimate         Error   Parameter=0    Prob > |T|

      INTERCEP   1   -728.268559   27.24707991       -26.728        0.0001
      AGEGE50    1    -15.381681    2.16407565        -7.108        0.0001
      HEIGHTCM   1      9.002956    0.15434904        58.329        0.0001


 
                      billings.sas (joanneb) 08FEB04 19:09
------------------------------------------------------------------------
       MODEL 3: WEIGHTKG VS HEIGHTCM: TWO LINES, TWO INTERCEPTS, TWO SLOPES     3
                                                  19:09 Sunday, February 8, 2004

Model: MODEL1  
Dependent Variable: WGTKG                                              

                              Analysis of Variance

                                 Sum of         Mean
        Source          DF      Squares       Square      F Value       Prob>F

        Model            3 43240090.637 14413363.546     1187.194       0.0001
        Error        11647 141402760.03  12140.70233
        C Total      11650 184642850.67

            Root MSE     110.18486     R-square       0.2342
            Dep Mean     850.15488     Adj R-sq       0.2340
            C.V.          12.96056

                              Parameter Estimates

                       Parameter      Standard    T for H0:               
      Variable  DF      Estimate         Error   Parameter=0    Prob > |T|

      INTERCEP   1   -758.286174   33.41393737       -22.694        0.0001
      AGEGE50    1     73.626432   57.39857942         1.283        0.1996
      HEIGHTCM   1      9.173180    0.18935060        48.445        0.0001
      AGE50HGT   1     -0.507207    0.32684974        -1.552        0.1207


 
                      billings.sas (joanneb) 08FEB04 19:09

==================================================================================

     The stratifying factor in this example is age group, coded as:

         AGEGE50 = 0 :   Age less than 50
         AGEGE50 = 1 :   Age 50 or greater.

     Model 1 simply fits one straight line to weight (kg) as a function of
     height (cm).  The regression coefficient for height indicates that each
     increase in height of 1 cm is associated with a weight increase of about
     9.1 kg.

     Model 2 fits two separate but parallel lines with different intercepts.
     The equation of the line for men < 50 years old is:

           WGTKG = -728.27 + 9.00 * HEIGHTCM

     The equation of the line for men >= 50 years old is:

           WGTKG = -743.65 + 9.00 * HEIGHTCM

     Note here that the coefficient of AGEGE50 is -15.38, and the corresponding
     t-statistic is -7.108 (p = .0001).

     Model 3 fits two separate lines with different slopes and intercepts.
     The equations of these lines are:

         Men < 50 :  WGTKG = -758.29 + 9.17 * HEIGHTCM

         Men >= 50:  WGTKG = -684.66 + 8.66 * HEIGHTCM

     Here the coefficient of AGE50HGT is -.507, which indicates that
weight is not as strongly related to height in older men as it is in
younger.  The t-statistic for this coefficient is -1.552 (p = .1207).

     How do you test which of these models is best ?

     Note that there is a hierarchy here: you go from simple to more
complicated models:

     Model 1:  1 line;                          2 parameters
     Model 2:  2 lines, parallel (same slope):  3 parameters
     Model 3:  2 lines, two different slopes:   4 parameters.

     Or, to put it another way: the *model* statements in proc reg are:

     Model 1:    model  weight = height ;
     Model 2:    model  weight = agege50 height ;
     Model 3:    model  weight = agege50 height age50hgt ;

     This makes the hierarchical relationship very clear: the independent
variables in Model 1 are a subset of those in Models 2 and 3, and those
in Model 2 are a subset of those in Model 3.

     What you might want to know here is: Is Model 2 better than Model 1?
Is Model 3 better than Model 2 ?  Is Model 3 better than Model 1?

     Here is how you carry out a test for hierarchically related models of
this kind:

     1.  Compute the residual [or error] sum of squares for Model 3:  ReSS3.
     2.  Compute the residual sum of squares form Model 1          :  ReSS1.

                        (ReSS1 - ReSS3) / 2
         Compute F =   ---------------------.
                           ReSS3 / (n - 4)

     3.  Compare this to an F-distribution with degrees of freedom (2, n - 4).
         (Use tables).

     Questions about this:

       Q1.  What is that "2" in the denominator of the numerator of F ?

       A1.  That is the difference in the number of parameters between
            Model 1 and Model 3.


       Q2.  What is the (n - 4) term?

       A2.  That is total sample size (n) minus the number of parameters (4)
            in the 'bigger' model.


       Q3.  Is F always positive ?

       A3   If it isn't, you have made a mistake.  The residual sum of
            squares for the smaller model should always be greater than
            the residual sum of squares for the larger one.


       Q4.  Say I am testing Model 3 versus Model 2.  Can't I just look at the
            t-statistic for the added variable, AGE50HT ?

       A4.  You can, but the F-test above is considered more appropriate.


       Q5.  What does the F test actually say for this example?

       A5.  ReSS1 = 142045417.63
            ReSS3 = 141402760.03

                 (142045417.63 - 141402760.03)/2
            F =  -------------------------------- =  26.476,
                       141402760.03 / 11651

            where F has df = (2, 11651).  The p-value is:  p < .0001

       Q6.  So what's the conclusion ?

       A6.  The two-line model fits better than the one-line model.
            (Note, however, that there is not much difference in the
             R-square).


       Q7.  Is it correct to say that since the slope of weight as
            a function of height is not the same in the two age
            groups, then there is an INTERACTION between age and height
            as predictors of weight ?

       A7.  To say this, you need to check that the equal-slopes
            model (Model 2) is significantly 'worse' than the
            unequal-slopes model (Model 3).  The appropriate F-statistic
            is:

                   (ReSS2 - ReSS3)/1    (141431996.07 - 141402760.03)/1
            F =  -------------------- = -------------------------------
                      ReSS3/11651             141402760.03 / 11651

              =  2.4089.

        The p-value in this case (see notes n54703.006) is .1207.
        Thus one would not reject the hypothesis that the slopes are
        the same.  You cannot conclude that the there is age-height
        interaction.

==================================================================================
Problem 1.

     This is a continuation of problem 1 in n54703.004.  Again, consider the
data on mortality and water hardness from Chapter 2 of Der-Everitt.

     (1) Carry out analysis for the model:

               mortality = a + b*hardness + c*norsouth + e,

         where norsouth = 1 if the city is northern, 0 if the city is southern,
         and e is a normally distributed error term with mean 0 and standard
         deviation sigma.

         How can you tell whether norsouth an important variable?  What effect does it have on
         R-square ?

     (2) Comment on collinearity of hardness and norsouth.  What does it
         mean to say that a continuous variable like hardness is
         "collinear" with a dichotomous variable like norsouth?

     (3) Carry out a stepwise regression with the predictors being hardness
         and norsouth.

     (4) Back to the simpler model:

            mortality = a + b*hardness + e.

         Compute the coefficients separately for northern and southern
         cities.

     (5) Test the following hypothesis: the slope of mortality versus hardness
         is different for northern cities than for southern cities.

Problem 2.

     In the dataset on crime in the U.S. versus other variables in Chapter 4,
consider the regression of crime rate R versus the variable Age.  Do the
regression for 3 models:

           Model 1 :    R = a + b*Age + e
           Model 2 :    R = a + b*Age + c*S + e
           Model 3 :    R = a + b*Age + c*S + d*Age*S + e,

where S = southern state (0 = no, 1 = yes), and e represents the 'error' term
in each model.

     Test for whether Model 2 is better than Model 1 and for whether Model 3
is better than Model 2, and for whether Model 3 is better than Model 1.
Also note the R-squares for the 3 models.

n54703.005  Last update: February 29, 2004.