Updated 9-24-02

Intro to Factor Analysis.

 

"Reducing the dimension of data" – What does this really mean?

Think of a line in 2-dimensions, 3-dimensions, >3 dimensions.

In 2-dimensions, the line can be represented by ordered pairs (x1,x2).

In 3-dimensions, the line can be represented by ordered triplets (x1,x2,x3)

In 4-dimensions, the line can be represented by ordered quadrats (x1,x2,x3,x4)

etc.

But actually, no matter how many dimensions there are, a line can be represented as a function of just one "underlying" variable.


Here this line can be represented as X2 = 2 X1 – 1.

 

But, instead it also can be represented by thinking of X1 and X2 both as functions of the same one variable f, that is.....

X1(f) = 2 + f

X2(f) = 3 + 2f

where f goes from negative infinity to positive infinity

When f = 0, we have the point (2,3)

When f = -1, we have the point (1,1)


Intuition:
1. X1 and X2 change because f changes

2. X1 and X2 are related because they are both related to f


The equations above as a function of f represent just one way to parameterize the linear relation between X1 and X2. NOTE, THIS IS NOT UNIQUE. In other words, there is more than one way to parameterize the exact same line.

For example,

X1 = 6 – 13 f

X2 = 11 – 26 f

gives the same relation between X1 and X2.

When f = 4/13, we have the point (2,3)

When f = 5/13, we have the point (1,1)

ALSO, another way to parameterize this same line is

X1 = f

X2 = 2f – 1

This parameterization might be considered the most simple way, that is, we fix one of the observed variables to be equal to the underlying variable.

The fact that the parameterization is not unique is related to the idea of "factor rotation" which will be talked about more later.


What about if we have 3-dimensions, i.e. 3 observed X variables. A line is defined by two restrictions of the 3-dimensional space.

For example X1 = 6 – 3 X2 and X3 = 3 + 2 X2 defines a line in three dimensions. Basically two intersecting planes. Another way to write this is

X1 = f

X2 = 2 – 1/3 f

X3 = 7 – 2/3 f

When f = 0, we get the point (0,2,7)

When f = 3, we get the point (3,1,5)


 

What about if we have data???????

X1 = 2 + f + e1 , Var(e1) = psi1

X2 = 3 + 2 f + e2 , Var(e2) = psi2

There is one dimension (or one factor) underlying this data. Most of the variability can be explained by a projection of the data onto a line.


What about if we have data in 3-dimensions???????

 

X1 = f + e1

X2 = 2 – 1/3 f + e2

X3 = 7 – 2/3 f + e3

So, there is one dimension underlying the data. In other words the data follow a line.


What if there are two dimensions underlying the data. -----> the data will follow a plane.

X1 = f1

X2 = f2

X3 = 3 + 2f1 - f2

And here's what it would look like if we had data with two underlying dimensions. (ignore the bends in the plane at the edges this is an artifact of the graphing software I was using).

X1 = f1 + e1

X2 = f2 + e2

X3 = 3 + 2f1 - f2 + e3


What if there are > 2 underlying dimensions?

Geometric interpretation is not clear anymore, but form of the equations is the same.

In general, we are going to need something other than data visualization and geometry in order to determine number of factors underlying data.....Thus, we are going to investigate the Covariance structure via eigenvalues and eigenvectors next.


Symmetric matrices can be factored in a way that gives more insight about their structure

Take A to be a symmetric $p \times p$ matrix. Then

\begin{displaymath}A = U \; D\; U^\prime
\end{displaymath}

where the columns of U contain the eigenvectors of Aand the diagonal matrix D contains the associated eigenvalues of A.





EXAMPLE Given the following 50 data points represented by (X1, X2)



\begin{figure}\begin{center}
\epsfig{figure=ps.out.0001.ps,height=2in,width=2.5in,angle=0,silent=}\end{center}\end{figure}

$\widehat{Var} \left( \begin{array}{c} X_1 \\ X_2 \end{array}\right) = S = \left...
...336 \\
1.3672 \end{array} \begin{array}{c} 1.3672 \\ 2.8274 \end{array}\right)$

$= \left( \begin{array}{c} 0.4602 \\
0.8878 \end{array} \begin{array}{c} 0.8878...
...2 \\
0.8878 \end{array} \begin{array}{c} 0.8878 \\ -0.4602 \end{array}\right)
$


So the 1st eigenvector is the direction (.4602, .8878) (i.e. the line with positive slope going through Figure 2). And the eigenvalue associated with this eigenvector is 3.571. This number is the variance of the data if it were projected onto the first eigenvalue. Compare this variance to the variance of X2. And the second eigenvector is the direction (.8878, -.4602) (i.e. the line with negative slope going through Figure 2). Its associated eigenvalue is .2250, that is, if we projected the data onto this eigenvector the variance would be .2250.


\begin{figure}\begin{center}
\epsfig{figure=ps.out.0003.ps,height=2in,width=2.5in,angle=0,silent=}\end{center}\end{figure}

Since the data is only two dimensional (i.e. only (X1, X2)) there are only two eigenvector/eigenvalue pairs.

If there are more than 3 dimensions graphical interpretation becomes very difficult but eigenvector/eigenvalue idea is extendable. PRINCIPLE COMPONENT ANALYSIS is just eigenvector/eigenvalue decomposition.