In the following I present pieces of two papers on the subject of how multicollinearity effects regression and how multicollinearity may be used as an advantage in creating latent variables underlying the collinear indicators. Both papers are available as .ps OR .pdf files from the biostat research report series.
ABSTRACT
This paper revisits the topic of multicollinearity but not from the point of view of detecting it or trying to fix it. Instead, we explore the different changes that occur to the estimated regression coefficients, the R2, and the estimated standard errors as the correlation varies from small to large. We demonstrate at what exact points of collinearity certain results are found in the simple bivariate regression of a continuous outcome variable on 2 continuous predictor variables. We present several figures to provide intuition about how the changes in the estimated regression coefficients, the R2, and the estimated standard errors occur smoothly as the collinearity changes. A new surprising result is also demonstrated that shows that after a certain level of collinearity between the predictors, the estimated standard errors associated with the estimated regression coefficient actually become smaller rather than larger.
EXERPT FROM INTRODUCTION
Multiple regression rather than separate simple linear regressions
is often used because researchers have some notion that they should
``adjust'' the effect of a predictor on an outcome by other
confounding predictors. A common phenomena related to the the idea
of ``adjusting'' that is seen when performing a multiple regression
is the following:
Scenario 1: A simple bivariate relationship between a single
predictor X1 and an outcome variable Y is found to be statistically
significant, but, after adjusting for some other predictor variable
X2, i.e., performing the multiple regression of Y on X1 and X2, the
effect of X1 on Y is close to zero and not significant anymore.
In general, researchers appear to be comfortable with this phenomena
and often then conclude that X1 does not have a significant effect
on the outcome Y once X2 has been taken into account. On the other
hand, the following scenario does not tend to sit well with most
researchers' intuition about the way that one variable's effect on
the outcome is ``adjusted'' by another variable:
Scenario 2: A simple bivariate relationship between a single
predictor X1 and an outcome variable Y is found to be statistically
significant and positive, but, after adjusting for some other
predictor variable X2, the effect of X1 on Y remains significant but is now
negative.
As will be discussed in detail in this paper, both the phenomena in Scenario 1 and 2 occur as a result of a certain amount of correlation or multicollinearity existing between the predictors X1 and X2.
![]() |
| Region | Name | range for r12 | effect on
|
effect on
|
effect on R2 |
I |
Cooperative suppression + Enhancement | (r12lower , 0) |
|
|
R2 > ry12 + ry22 |
II |
Redundancy | (0,
|
|
|
R2 < ry12 + ry22 |
III |
Net suppression | (
|
|
|
R2 < ry12 + ry22 |
IV |
Net suppression+Enhancement |
|
|
|
R2 > ry12 + ry22
|
A Comparison of Multiple Regression to Two Latent Variable
Techniques for Estimation and Prediction
by Ruifeng Li and Melanie M. Wall
ABSTRACT
In the areas of epidemiology, psychology, sociology, and other social and behavioral sciences, researchers often encounter situations where there are not only many variables contributing to a particular phenomenon, but also there are strong relationships among many of the predictor variables of interest. By using the traditional multiple regression on all the predictor variables, it is possible to have problems with interpretation and multicollinearity. As an alternative to multiple regression, we explore the use of a latent variable model which can address the relationship among the predictor variables. We consider two different methods for estimation and prediction for this model: one that uses multiple regression on factor score estimates and the other that uses structural equation modeling. The first method uses multiple regression but on a set of predicted underlying factors (i.e. factor scores) and the second method is a full multivariate maximum likelihood technique that incorporates the complete covariance structure of the data. In this paper, we will explain the model and each estimation method including how to do prediction. A data example will be used for demonstration, where respiratory disease death rates by county in Minnesota are predicted by five county level census variables. A simulation study is performed to evaluate the efficiency of prediction using the two latent variable modeling techniques compared to multiple regression.
EXERPT FROM INTRODUCTION
Multiple regression is one of the most widely used of all statistical methods. The two main uses of multiple regression are: estimating the effect that certain predictors have on the outcome when ``adjusting'' for other variables and predicting the outcome given a set of predictors. Despite its popularity, there are some disadvantages to this method. First, in order to have better estimation and prediction, it is very common to have more than 3 predictor variables, which makes identification and interpretation of the inter-relationships less straightforward, since our ability to visualize relationships is limited to 2 or 3 dimensions. One way to address this problem is to choose a smaller set of variables by using model selection methods. But, since most model selection criteria are highly data dependent, this does not allow the model to reflect the subject-matter knowledge. Another problem that can occur when using multiple regression stems from multicollinearity. When the predictor variables are highly correlated with one another, standard errors are inflated and typically most of the predictors are not statistically significant. Multicollinearity can commonly arise in fields where latent variables or factors are prominent, such as in epidemiology, psychology, sociology, and other social and behavioral sciences. In practice, when a variable is latent and thus not directly measurable, one chooses a variety of indicators of the latent variables which can be measured. Because they are all basically measuring the same latent variable, these observed variables are expected to be highly correlated. If all of these observed variables are put together into one large multiple regression, there are often too many predictor variables to interpret and also problems from multicollinearity.
A way of overcoming these shortcomings is by exploiting the highly correlated data structure and explicitly modeling it. Thus, the researcher can incorporate knowledge about the way in which predictor variables may be related to each other. Also, from an empirical point of view, it is possible to examine the correlation structure and hypothesize that variables which are highly correlated amongst themselves (but have relatively small correlations with other variables) are indicators of the same latent variable.
RESULTS FROM MULTIPLE REGRESSION
ON PREDICTORS FOR EXAMPLE DATA
| Parameter | Estimate | S.E. | T value | P-value |
| INTERCEPT | -8.03 | 0.40 | -20.22 | 0.00 |
| eduhs | 0.005 | 0.007 | 0.71 | 0.47 |
| medhhin | 0.000007 | 0.00001 | 0.72 | 0.47 |
| percapit | -0.00000002 | 0.00003 | -0.05 | 0.95 |
| pubwater | 0.001 | 0.002 | 0.46 | 0.64 |
| wood | 0.013 | 0.005 | 2.77 | 0.01 |
| R2: 0.23 | ||||
| eduhs | = | ||
| medhhin | = | ||
| percapit | = | ||
| pubwater | = | ||
| wood | = | ||
| RESP | = |
| Regression on factor scores | SEM | |||
| Parameter | Estimate (s.e.) | P-value | Estimate (s.e.) | P-value |
|
|
-7.68 (0.14) | <0.0001 | -7.28 (0.11) | <0.0001 |
|
|
0.00003 (0.00001) | 0.0100 | 0.00004 (0.00001) | 0.0010 |
|
|
0.010 (0.003) | 0.0006 | 0.013 (0.003) | <0.0001 |
| R2 | 0.14 | 0.21 | ||