Principal Component Analysis (PCA)

Introduction

Principal Component Analysis (PCA) is a powerful tool when you have many variables and you want to look into things that these variables can explain. As the name of PCA suggests, PCA finds the combination of your variables which explains the phenomena. In this sense, PCA is useful when you want to reduce the number of the variables. One common scenario of PCA is that you have n variables and you want to combine them and make them 3 or 4 variables without losing much of the information that the original data have. More mathematically, PCA is trying to find some linear projections of your data which preserve the information your data have.

PCA is one of the methods you may want to try if you have lots of Likert data and try to understand what these data tell you. Let's say we asked the participants four 7-scale Likert questions about what they care about when choosing a new computer, and got the results like this.

Participant	Price	Software	Aesthetics	Brand
P1	6	5	3	4
P2	7	3	2	2
P3	6	4	4	5
P4	5	7	1	3
P5	7	7	5	5
P6	6	4	2	3
P7	5	7	2	1
P8	6	5	4	4
P9	3	5	6	7
P10	1	3	7	5
P11	2	6	6	7
P12	5	7	7	6
P13	2	4	5	6
P14	3	5	6	5
P15	1	6	5	5
P16	2	3	7	7

Price: A new computer is cheap to you (1: strongly disagree – 7: strongly agree),
Software: The OS on a new computer allows you to use software you want to use (1: strongly disagree – 7: strongly agree),
Aesthetics: The appearance of a new computer is appealing to you (1: strongly disagree – 7: strongly agree),
Brand: The brand of the OS on a new computer is appealing to you (1: strongly disagree – 7: strongly agree)

Now what you want to do is what combination of these four variables can explain the phenomena you observed. I will explain this with the example R code.

R code example

Let's prepare the same data shown in the table above.

Price <- c(6,7,6,5,7,6,5,6,3,1,2,5,2,3,1,2) Software <- c(5,3,4,7,7,4,7,5,5,3,6,7,4,5,6,3) Aesthetics <- c(3,2,4,1,5,2,2,4,6,7,6,7,5,6,5,7) Brand <- c(4,2,5,3,5,3,1,4,7,5,7,6,6,5,5,7) data <- data.frame(Price, Software, Aesthetics, Brand)

At this point, data looks pretty much the same as the table above. Now, we do PCA. In R, there are two functions for PCA: prcomp() and princomp(). prcomp() uses a correlation coefficient matrix, and princomp() uses a variance covariance matrix. But it seems that the results become similar in many cases (which I haven't formally tested, so be careful), and the results gained from princomp() have nice features, so here I use princomp().

pca <- princomp(data, cor=T) summary(pca, loadings=T)

And here is the result of the PCA.

Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation 1.5589391 0.9804092 0.6816673 0.37925777 Proportion of Variance 0.6075727 0.2403006 0.1161676 0.03595911 Cumulative Proportion 0.6075727 0.8478733 0.9640409 1.00000000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Price -0.523 0.848 Software -0.177 0.977 -0.120 Aesthetics 0.597 0.134 0.295 -0.734 Brand 0.583 0.167 0.423 0.674

I will explain how to interpret this result in the next section.

Interpretation of the results of PCA

Let's take a look at the table for loadings, which mean the coefficients for the “new” variables.

	Comp.1	Comp.2	Comp.3	Comp.4
Price	-0.523		0.848
Software	-0.177	0.977	-0.120
Aesthetics	0.597	0.134	0.295	-0.734
Brand	0.583	0.167	0.423	0.674

From the second table (loadings), PCA found four new variables which can explain the same information as the original four variables (Price, Software, Aesthetics, and Brand), which are Comp.1 to Comp.4. And Comp.1 is calculated as follows:

;#; ## Comp.1 = -0.523 * Price - 0.177 * Software + 0.597 * Aesthetics + 0.583 * Brand ## ;#;

Thus, PCA successfully found a new combination of the variables, which is good. The next thing we want to know is how much each of new variables has a power to explain the information that the original data have. For this, you need to look at Standard deviation, and Cumulative Proportion (of Variance) in the result.

	Comp.1	Comp.2	Comp.3	Comp.4
Standard deviation	1.56	0.98	0.68	0.38
Cumulative Proportion	0.61	0.85	0.96	1.00

Standard deviation means the standard deviation of the new variables. PCA calculates the combination of the variables such that new variables have a large standard deviation. Thus, generally a larger standard deviation means a better variable. A heuristics is that we take all the new variables whose standard deviations are roughly over 1.0 (so, we will take Comp.1 and Comp.2).

Another way to determine how many new variables we want to take is to look at cumulative proportion of variance. This means how much of the information that the original data have can be described by the combination of the new variables. For instance, with only Comp.1, we can describe 61% of the information the original data have. If we use Comp.1 and Comp2, we can describe 85% of them. Generally, 80% is considered as the number of the percentage which describes the data well. So, in this example, we can take Comp.1 and Comp.2, and ignore Comp.3 and Comp.4.

In this manner, we can decrease the number of the variables (in this example, from 4 variables to 2 variables). Your next task is to understand what the new variable means in the context of your data. As we have seen, the first new variable can be calculated as follows:

;#; ## Comp.1 = -0.523 * Price - 0.177 * Software + 0.597 * Aesthetics + 0.583 * Brand ## ;#;

It is a very good idea to plot the data to see what this new variable means. You can use scores to take the values of each variable modeled by PCA.

plot(pca$scores[,1]) barplot(pca$scores[,1])

With the graphs (sorry I was kinda lazy to upload the graph, but you can quickly generate it by yourself), you can see Participant 1 - 8 get negative values, and the other participants get positive values. It seems that this new variable indicates whether a user cares about Price and Software or Aesthetics and Brand for her computer. So, we probably can name this variable as “Feature/Fashion index” or something. There is no definitive answer for this part of PCA. You need to go through your data and make sense what the new variables mean by yourself.

PCA and Logistic regression

Once you have done the analysis with PCA, you may want to look into whether the new variables can predict some phenomena well. This is kinda like machine learning: Whether features can classify the data well. Let's say you have asked the participants one more thing, which OS they are using (Windows or Mac) in your survey, and the results are like this.

Participant	Price	Software	Aesthetics	Brand	OS
P1	6	5	3	4	0
P2	7	3	2	2	0
P3	6	4	4	5	0
P4	5	7	1	3	0
P5	7	7	5	5	1
P6	6	4	2	3	0
P7	5	7	2	1	0
P8	6	5	4	4	0
P9	3	5	6	7	1
P10	1	3	7	5	1
P11	2	6	6	7	0
P12	5	7	7	6	1
P13	2	4	5	6	1
P14	3	5	6	5	1
P15	1	6	5	5	1
P16	2	3	7	7	1

Here what we are going to do is to see whether the new variables given by PCA can predict the OS people are using. OS is 0 or 1 in our case, which means the dependent variable is binomial. Thus, we are going to do logistic regression. I will skip the details of logistic regression here. If you are interested, the details of logistic regression are available in a separate page.

First, we prepare the data about OS.

OS <- c(0,0,0,0,1,0,0,0,1,1,0,1,1,1,1,1)

Then, fit the first variable we found through PCA (i.e.. Comp.1) to a logistic function.

model <- glm(OS ~ pca$scores[,1], family=binomial) summary(model)

Now you get the logistic function model.

Call: glm(formula = OS ~ pca$scores[, 1], family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -2.19746 -0.44586 0.01932 0.60018 1.65268 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.08371 0.74216 -0.113 0.9102 pca$scores[, 1] 1.42973 0.62129 2.301 0.0214 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 22.181 on 15 degrees of freedom Residual deviance: 12.033 on 14 degrees of freedom AIC: 16.033 Number of Fisher Scoring iterations: 5

Let's see how well this model predicts the kind of OS. You can use fitted() function to see the prediction.

fitted(model) 1 2 3 4 5 6 7 0.15173723 0.04159449 0.34968733 0.04406133 0.25520745 0.07808633 0.02649166 8 9 10 11 12 13 14 0.21744454 0.89433079 0.93612411 0.91057994 0.73428648 0.85190931 0.76285170 15 16 0.78149889 0.96410841

These values represent the probabilities of being 1. For example, we can expect 15% chance that Participant 1 is using OS 1 based on the variable derived by PCA. Thus, in this case, Participant 1 is more likely to be using OS 0, which agrees with the survey response. In this way, PCA can be used with regression models for calculating the probability of a phenomenon or making a prediction.

Difference between PCA and Factor Analysis

A similar concept and name to PCA you may have heard of is Factor Analysis. I explain the difference between PCA and factor analysis in the factor analysis page.

Participant	Price	Software	Aesthetics	Brand
P1	6	5	3	4
P2	7	3	2	2
P3	6	4	4	5
P4	5	7	1	3
P5	7	7	5	5
P6	6	4	2	3
P7	5	7	2	1
P8	6	5	4	4
P9	3	5	6	7
P10	1	3	7	5
P11	2	6	6	7
P12	5	7	7	6
P13	2	4	5	6
P14	3	5	6	5
P15	1	6	5	5
P16	2	3	7	7

Participant	Price	Software	Aesthetics	Brand	OS
P1	6	5	3	4	0
P2	7	3	2	2	0
P3	6	4	4	5	0
P4	5	7	1	3	0
P5	7	7	5	5	1
P6	6	4	2	3	0
P7	5	7	2	1	0
P8	6	5	4	4	0
P9	3	5	6	7	1
P10	1	3	7	5	1
P11	2	6	6	7	0
P12	5	7	7	6	1
P13	2	4	5	6	1
P14	3	5	6	5	1
P15	1	6	5	5	1
P16	2	3	7	7	1

Participant	Price	Software	Aesthetics	Brand
P1	6	5	3	4
P2	7	3	2	2
P3	6	4	4	5
P4	5	7	1	3
P5	7	7	5	5
P6	6	4	2	3
P7	5	7	2	1
P8	6	5	4	4
P9	3	5	6	7
P10	1	3	7	5
P11	2	6	6	7
P12	5	7	7	6
P13	2	4	5	6
P14	3	5	6	5
P15	1	6	5	5
P16	2	3	7	7

Participant	Price	Software	Aesthetics	Brand	OS
P1	6	5	3	4	0
P2	7	3	2	2	0
P3	6	4	4	5	0
P4	5	7	1	3	0
P5	7	7	5	5	1
P6	6	4	2	3	0
P7	5	7	2	1	0
P8	6	5	4	4	0
P9	3	5	6	7	1
P10	1	3	7	5	1
P11	2	6	6	7	0
P12	5	7	7	6	1
P13	2	4	5	6	1
P14	3	5	6	5	1
P15	1	6	5	5	1
P16	2	3	7	7	1

Table of Contents