Updated October 29, 2002

In the following I present pieces of two papers on the subject of how multicollinearity effects regression and how multicollinearity may be used as an advantage in creating latent variables underlying the collinear indicators. Both papers are available as .ps OR .pdf files from the biostat research report series.

Wall, M.M. (2001) "A close look at the effects of multicollinearity in the most simple case", Research Report No. 2001-023, Division of Biostatistics, University of Minnesota.

Li, R., and Wall, M.M. (2002) "A Comparison of Multiple Regression to Two Latent Variable Techniques for Estimation and Prediction", Research Report No. 2002-004, Division of Biostatistics, University of Minnesota.


A close look at the effects of multicollinearity in the most simple case
by Melanie M. Wall

ABSTRACT

This paper revisits the topic of multicollinearity but not from the point of view of detecting it or trying to fix it. Instead, we explore the different changes that occur to the estimated regression coefficients, the R2, and the estimated standard errors as the correlation varies from small to large. We demonstrate at what exact points of collinearity certain results are found in the simple bivariate regression of a continuous outcome variable on 2 continuous predictor variables. We present several figures to provide intuition about how the changes in the estimated regression coefficients, the R2, and the estimated standard errors occur smoothly as the collinearity changes. A new surprising result is also demonstrated that shows that after a certain level of collinearity between the predictors, the estimated standard errors associated with the estimated regression coefficient actually become smaller rather than larger.

EXERPT FROM INTRODUCTION

Multiple regression rather than separate simple linear regressions is often used because researchers have some notion that they should ``adjust'' the effect of a predictor on an outcome by other confounding predictors. A common phenomena related to the the idea of ``adjusting'' that is seen when performing a multiple regression is the following:

Scenario 1: A simple bivariate relationship between a single predictor X1 and an outcome variable Y is found to be statistically significant, but, after adjusting for some other predictor variable X2, i.e., performing the multiple regression of Y on X1 and X2, the effect of X1 on Y is close to zero and not significant anymore.

In general, researchers appear to be comfortable with this phenomena and often then conclude that X1 does not have a significant effect on the outcome Y once X2 has been taken into account. On the other hand, the following scenario does not tend to sit well with most researchers' intuition about the way that one variable's effect on the outcome is ``adjusted'' by another variable:

Scenario 2: A simple bivariate relationship between a single predictor X1 and an outcome variable Y is found to be statistically significant and positive, but, after adjusting for some other predictor variable X2, the effect of X1 on Y remains significant but is now negative.

As will be discussed in detail in this paper, both the phenomena in Scenario 1 and 2 occur as a result of a certain amount of correlation or multicollinearity existing between the predictors X1 and X2.


 
Figure: The symbols 1, 2, and $\bullet $ represent $\widehat{\beta_1^\prime}$, $\widehat{\beta_2^\prime}$, and R2 respectively for varying levels of multicollinearity (r12) when we fix ry1=.6 and ry2=.2
\begin{figure}\begin{center}
\epsfig{figure=/home/squid/melanie/Papers/Friedman/figure1.ps,height=3.5in,width=4.5in,angle=0,silent=}\end{center}
\end{figure}


 
Table 1: Regions of change for Multicollinearity
Region Name range for r12 effect on $\widehat{\beta_1^\prime}$ effect on $\widehat{\beta_2^\prime}$ effect on R2

I

Cooperative suppression + Enhancement (r12lower , 0) $\widehat{\beta_1^\prime} > r_{y1}$ $\widehat{\beta_2^\prime} >
r_{y2}$ R2 > ry12 + ry22

II

Redundancy (0, $\frac{r_{y2}}{r_{y1}}$) $0 <
\widehat{\beta_1^\prime} < r_{y1}$ $0 < \widehat{\beta_2^\prime} <
r_{y2}$ R2 < ry12 + ry22

III

Net suppression ( $\frac{r_{y2}}{r_{y1}}$, $\frac{2 r_{y1}
r_{y2}}{r_{y1}^2 + r_{y2}^2}$) $\widehat{\beta_1^\prime} > r_{y1}$ $\widehat{\beta_2^\prime} < 0$ R2 < ry12 + ry22

IV

Net suppression+Enhancement $(\frac{2 r_{y1} r_{y2}}{r_{y1}^2 +
r_{y2}^2}, r_{12}^{upper})$ $\widehat{\beta_1^\prime} > r_{y1}$ $\widehat{\beta_2^\prime} < 0$ R2 > ry12 + ry22



A Comparison of Multiple Regression to Two Latent Variable Techniques for Estimation and Prediction
by Ruifeng Li and Melanie M. Wall

ABSTRACT

In the areas of epidemiology, psychology, sociology, and other social and behavioral sciences, researchers often encounter situations where there are not only many variables contributing to a particular phenomenon, but also there are strong relationships among many of the predictor variables of interest. By using the traditional multiple regression on all the predictor variables, it is possible to have problems with interpretation and multicollinearity. As an alternative to multiple regression, we explore the use of a latent variable model which can address the relationship among the predictor variables. We consider two different methods for estimation and prediction for this model: one that uses multiple regression on factor score estimates and the other that uses structural equation modeling. The first method uses multiple regression but on a set of predicted underlying factors (i.e. factor scores) and the second method is a full multivariate maximum likelihood technique that incorporates the complete covariance structure of the data. In this paper, we will explain the model and each estimation method including how to do prediction. A data example will be used for demonstration, where respiratory disease death rates by county in Minnesota are predicted by five county level census variables. A simulation study is performed to evaluate the efficiency of prediction using the two latent variable modeling techniques compared to multiple regression.

EXERPT FROM INTRODUCTION

Multiple regression is one of the most widely used of all statistical methods. The two main uses of multiple regression are: estimating the effect that certain predictors have on the outcome when ``adjusting'' for other variables and predicting the outcome given a set of predictors. Despite its popularity, there are some disadvantages to this method. First, in order to have better estimation and prediction, it is very common to have more than 3 predictor variables, which makes identification and interpretation of the inter-relationships less straightforward, since our ability to visualize relationships is limited to 2 or 3 dimensions. One way to address this problem is to choose a smaller set of variables by using model selection methods. But, since most model selection criteria are highly data dependent, this does not allow the model to reflect the subject-matter knowledge. Another problem that can occur when using multiple regression stems from multicollinearity. When the predictor variables are highly correlated with one another, standard errors are inflated and typically most of the predictors are not statistically significant. Multicollinearity can commonly arise in fields where latent variables or factors are prominent, such as in epidemiology, psychology, sociology, and other social and behavioral sciences. In practice, when a variable is latent and thus not directly measurable, one chooses a variety of indicators of the latent variables which can be measured. Because they are all basically measuring the same latent variable, these observed variables are expected to be highly correlated. If all of these observed variables are put together into one large multiple regression, there are often too many predictor variables to interpret and also problems from multicollinearity.

A way of overcoming these shortcomings is by exploiting the highly correlated data structure and explicitly modeling it. Thus, the researcher can incorporate knowledge about the way in which predictor variables may be related to each other. Also, from an empirical point of view, it is possible to examine the correlation structure and hypothesize that variables which are highly correlated amongst themselves (but have relatively small correlations with other variables) are indicators of the same latent variable.


DATA EXAMPLE


FIVE PREDICTORS
(all on the county-level)


Ordinary Multiple Regression





RESULTS FROM MULTIPLE REGRESSION
ON PREDICTORS FOR EXAMPLE DATA



Parameter Estimate S.E. T value P-value
INTERCEPT -8.03 0.40 -20.22 0.00
eduhs 0.005 0.007 0.71 0.47
medhhin 0.000007 0.00001 0.72 0.47
percapit -0.00000002 0.00003 -0.05 0.95
pubwater 0.001 0.002 0.46 0.64
wood 0.013 0.005 2.77 0.01
R2: 0.23


MODEL FOR EXAMPLE DATA


eduhs        = $\displaystyle \beta_{10} + \beta_{11}~ SES + u_1$  
medhhin     = $\displaystyle \beta_{20} + \beta_{21}~ SES + u_2$  
percapit     = $\displaystyle \hspace{1.4in}SES + u_3$  
pubwater    = $\displaystyle \beta_{30} + \beta_{32}~ ruralness + u_4$  
wood           = $\displaystyle \hspace{1.4in}ruralness+u_5$  
RESP        = $\displaystyle \alpha_0+\alpha_1~SES+\alpha_2~ruralness+\epsilon$  


RESULTS FOR EXAMPLE DATA



  Regression on factor scores SEM
Parameter Estimate (s.e.) P-value Estimate (s.e.) P-value
$\alpha_0: intercept$ -7.68 (0.14) <0.0001 -7.28 (0.11) <0.0001
$\alpha_1: SES$ 0.00003 (0.00001) 0.0100 0.00004 (0.00001) 0.0010
$\alpha_2: ruralness$ 0.010 (0.003) 0.0006 0.013 (0.003) <0.0001
R2 0.14   0.21