Table of Contents

Factor Analysis

Introduction

Factor Analysis is another powerful tool to understand what your data mean particularly when you have many variables. What Factor Analysis does is to try to find hidden variables which explain the behavior of your observed variables. Our interests here also lie in reducing the number of variables. So, we hope that we can find a smaller number of new variables which explain your data well. In this sense, it sounds very similar to PCA. Although the outcome is very similar in terms of reducing the number of variables, the approach to reduce the number of variable is different. I will explain this in the next section.

If you are a little more knowledgeable, you may have heard of the terms like Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA). EFA means that you don't really know what hidden variables (or factors) exist and how many they are. So you are trying to find them. CFA means that you already have some guesses or models for your hidden variables (or factors) and you want to check whether your models are correct. In many cases, your Factor Analysis is EFA, and I explain it in this page.

We are going to use a similar example in PCA. Let's say you have some data like this from your survey about what is important when they decide which computer to buy.

ParticipantPriceSoftwareAestheticsBrandFamilyFriend
P1653476
P2732223
P3644554
P4571367
P5775521
P6642345
P7572114
P8654475
P9356734
P10137524
P11266765
P12577677
P13245662
P14356523
P15165545
P16237756

For successfully doing Factor Analysis, we need more data than this example. If your want to find n factors, you want to have roughly 3n - 4n dimensions of data, and 5n - 10n samples. And Factor Analysis assumes the normality of the data, so it is not a great tool for ordinal data. However, in practice, we can use Factor Analysis on ordinal data if the scale is 5 or more, and data can be treated as interval data.

Through Factor Analysis, you want to find hidden variables (common factors) which may explain the responses you gained. For looking at how to do Factor Analysis in R, I would like to briefly explain the difference between PCA and FA.



Difference between Factor Analysis and PCA

The intuition of Principal Component Analysis is to find new combination of variables which form larger variances. Why are larger variances important? This is a similar concept of entropy in information theory. Let's say you have two variables. One of them (Var 1) forms N(1, 0.01) and the other (Var 2) forms N(1, 1). Which variable do you think has more information? Var 1 is always pretty much 1 whereas Var 2 can take a wider range of values, like 0 or 2. Thus, Var 2 has more chances to have various values than Var 1, which means Var 2's entropy is larger than Var 1's. Thus, we can say Var 2 contains more information than Var 1.

Although the example above just looks at one variable at one time, PCA tries to find linear combination of the variables which contain much information by looking at the variance. This is why the standard deviation is one of the important metrics to determine the number of new variables in PCA. Another interesting aspect of the new variables derived by PCA is that all new variables are orthogonal. You can think that PCA is rotating and translating the data such that the first axis contains the most information, and the second has the second most information, and so forth.

The intuition of Factor Analysis is to find hidden variables which affect your observed variables by looking at the correlation. If one variable is correlated with another variables, we can say that these two variables are generated from one hidden variable, so we can explain the phenomena with that one hidden variable instead of the two variable. Let's take a look at the correlation matrix of the data we have (see the code example below to create the data frame) before doing Factor Analysis.

cor(data)

And you get the correlation matrix.

Price Software Aesthetics Brand Friend Family Price 1.00000000 0.1856123 -0.63202219 -0.58026680 0.03082006 -0.06183118 Software 0.18561230 1.0000000 -0.14621516 -0.11858645 0.10096774 0.17657236 Aesthetics -0.63202219 -0.1462152 1.00000000 0.85285436 0.03989799 -0.06977360 Brand -0.58026680 -0.1185864 0.85285436 1.00000000 0.33316719 0.02662389 Friend 0.03082006 0.1009677 0.03989799 0.33316719 1.00000000 0.60727189 Family -0.06183118 0.1765724 -0.06977360 0.02662389 0.60727189 1.00000000

So, it looks like that Price has strong negative correlations with Aesthetics and Brand, and Friend has a strong correlation with Family. This means that we can expect that we will have two common factors, and one will be related to Price, Aesthetics, and Brand, and the other will be related to Friend and Family. Let's move on to Factor Analysis, and see what will happen.



R code example

In the following code example, I skipped some details, such as using varimax rotation or promax rotation (R uses varimax rotation by default). If you want to know more details, I recommend you to read other books or references for now. I may add these details later, but not sure…

First, we prepare the data.

Price <- c(6,7,6,5,7,6,5,6,3,1,2,5,2,3,1,2) Software <- c(5,3,4,7,7,4,7,5,5,3,6,7,4,5,6,3) Aesthetics <- c(3,2,4,1,5,2,2,4,6,7,6,7,5,6,5,7) Brand <- c(4,2,5,3,5,3,1,4,7,5,7,6,6,5,5,7) Friend <- c(7,2,5,6,2,4,1,7,3,2,6,7,6,2,4,5) Family <- c(6,3,4,7,1,5,4,5,4,4,5,7,2,3,5,6) data <- data.frame(Price, Software, Aesthetics, Brand, Friend, Family)

Factor Analysis is easy to do in R. Let's do Factor Analysis assuming that the number of the hidden variables is 1.

fa <- factanal(data, factor=1)

And you get the result.

Call: factanal(x = data, factors = 1) Uniquenesses: Price Software Aesthetics Brand Friend Family 0.567 0.977 0.126 0.167 0.974 1.000 Loadings: Factor1 Price -0.658 Software -0.152 Aesthetics 0.935 Brand 0.912 Friend 0.161 Family Factor1 SS loadings 2.190 Proportion Var 0.365 Test of the hypothesis that 1 factor is sufficient. The chi square statistic is 12.79 on 9 degrees of freedom. The p-value is 0.172

Here, the factor analysis is doing a null hypothesis test in which the null hypothesis is that the model described by the factor we have found predicts the data well. So, we have the chi-square goodness-of-fit, which is 12.8, and the p value is 0.17. This means, we cannot reject the null hypothesis, so the factor predicts the data well from the statistics perspective. This is why the result says “Test of the hypothesis that 1 factor is sufficient.” Let's take a look at the Factor Analysis with two factors.

fa <- factanal(data, factor=2) Call: factanal(x = data, factors = 2) Uniquenesses: Price Software Aesthetics Brand Friend Family 0.559 0.960 0.126 0.080 0.005 0.609 Loadings: Factor1 Factor2 Price -0.657 Software -0.161 0.119 Aesthetics 0.933 Brand 0.928 0.242 Friend 0.100 0.992 Family 0.620 Factor1 Factor2 SS loadings 2.207 1.453 Proportion Var 0.368 0.242 Cumulative Var 0.368 0.610 Test of the hypothesis that 2 factors are sufficient. The chi square statistic is 2.16 on 4 degrees of freedom. The p-value is 0.706

The p value gets larger, and the Cumulative portion of variance becomes 0.61 (with one variable, it is 0.37). So the model seems to be improved. Loadings shows the weights to calculate the hidden variables from the observed variables.

But obviously, the model gets improved if you have more variables, which shows the trade-off between the number of variables and the accuracy of the model. So, how should we decide how many factors we should pick up? This is the topic for the next section.



How many factors should we use?

We found the two factors in the example, which are:

Factor 1Factor 2
Price-0.657
Software-0.1610.119
Aesthetics0.933
Brand0.9280.242
Friend0.1000.992
Family 0.620

In the results of FA, some coefficients are missing, but this means these coefficients are just too small, and not necessary equal to zero. You can see the all coefficients by doing like fa$loadings[,1] with more precisions.

Although the goodness-of-fit tells you whether the current number of variables are sufficient or not, it does not tell whether the number of variables are large enough for describing the information that the original data have. For instance, why don't we try three factors instead of one or two factors? There are a few ways to answer this question.



Comprehensibility

This means whether you can explain your new variables in a sensible way. For example, Factor 1 has large weight on Price, Aesthetics and Brand, which may indicate whether people want practical aspects or fashionable aspects on their computers. Factor 2 has large weights on Friend and Family, which seems to mean that people around users have some effects on the computer purchase. Thus, both factors seem to have some meanings, and that's why we should keep them.

This is not really a mathematical way to determine the number of factors, but is a standard way to do. Because we want to find factors which explain something, we can just ignore factors which don't really make sense. This is probably intuitive, but I know you may argue that it is too subjective. So, we have more mathematical ways to determine the number of factors.



Cumulative variance

Similar to PCA, you can look at the cumulative portion of variance, and if that reaches some numbers, you can stop adding more factors. Deciding the threshold for the cumulative portion is kind of heuristic. It can be 80% similar to PCA. If your focus is on reducing the number of variables, it can be 50 - 60 %.



Kaiser criterion

The Kaiser rule is to discard components whose eigenvalues are below 1.0. This is also used in SPSS. You can easily calculate the eigenvalues from the correlation matrix.

ev <- eigen(cor(data)) ev$values 2.45701130 1.68900056 0.89157047 0.60583326 0.27285334 0.08373107

So, we can determine that the number of factors should be 2. One problem of Kaiser rule is that it often becomes too strict.



Scree plot

Another way to determine the number of factors is to use Scree plot. You plot the components on the X axis, and the eigenvalues on the Y axis and connect them with lines. You then try to find the spot where the slope of the line becomes less steep. So, how exactly should we find the spot like that? Again, it is kind of heuristic. In some cases (particularly when the number of your original variables are small like the example above), you can't find a clear spot like that (try to make a plot by using the following code). Nonetheless, it is good to know how to make a Scree plot.

The following procedure to make a Scree plot is based on this webpage. You also need nFactors package.

ev <- eigen(cor(data)) library(nFactors) ap <- parallel(subject=nrow(data),var=ncol(data),rep=100,cent=0.05) nS <- nScree(ev$values, ap$eigen$qevpea) plotnScree(nS)



Summary

Unfortunately, there is no definitive way to determine the number of factors. You probably try to do multiple ways explained here, and see how many factors each of them is suggesting. They should suggest you more or less the same number of factors, so you can have a pretty good idea of how many factors you should choose.