Correlation

Introduction

Correlation means the strength of the relationship between the variables, which is represented as a value between -1 and 1. For example, you can use correlation to see whether users who can type fast in a physical keyboard can also type fast in a software keyboard. Correlation does not seem to be common statistical methods used in HCI research, but it is the basis of other methods. So, understanding correlation is important, and fortunately correlation is fairly easy to understand.

There are three major kinds of metrics for correlation: Pearson's product-moment coefficient, Spearman's rank correlation coefficient, and Kendall tau rank correlation coefficient.

Pearson's product-moment coefficient: It measures how linear the relationship between the two variables are. It is a parametric test (so the data came from the normal distribution), and the data must be interval or ratio. However, it is known that Pearson's coefficient is a pretty robust metric, so you can use this unless your data are ordinal.
Spearman's rank correlation coefficient: This is a non-parametric test, and you can use this if you cannot assume the normality or your data are ordinal. Similarly to other non-parametric tests, it ranks the data and use the ranking for the test.
Kendall tau rank correlation coefficient: This is also a non-parametric test. You should use this if your data have many ties when you rank them.

Fortunately, all of these correlations can be calculated quite easily in R (you just need to use a different option for calculating each of the correlation). But before taking a look at the R code, let's talk about the effect size.

Effect size

For the Pearson's coefficient, its value also represents the effect size. So, you don't need to do anything additional.

	small size	medium size	large size
Pearson's r	0.1	0.3	0.5

I haven't figured out the effect size for other coefficients. I will put them here once I figure it out.

R code example

First, we need data.

x <- c(10,14,12,20,15,13,18,11,10) y <- c(22,21,25,35,28,29,31,19,17)

Then, do a correlation analysis. Here, we calculate the Pearson's coefficient.

cor.test(x,y,method="pearson")

Now you get the result.

Pearson's product-moment correlation data: x and y t = 4.6855, df = 7, p-value = 0.002246 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4900229 0.9724978 sample estimates: cor 0.8707668

The correlation is 0.87, which is considered as a strong correlation. The p value is less than 0.05. The null hypothesis of the correlation test is that there is no correlation between the two variables (r = 0). This result means that we reject the null hypothesis. If the p value is over 0.05, you have to say that you cannot find any correlation in your data.

If you want to calculate Spearman's rank correlation coefficient or Kendall tau rank correlation coefficient, you can do cor.test(x,y,method=“spearman”) or cor.test(x,y,method=“kendall”). If you have ties in your data for calculating Spearman's rank correlation coefficient or Kendall tau rank correlation, you will have some warning messages. This is because R cannot calculate the exact p value, and only calculates a p value by adjusting the test statistic (the estimate is scaled to zero mean and unit variance, and is approximately normally distributed).

How to report

You can report the results of your correlation analysis as follows: We found that the two variables were strongly correlated (Pearson's r(7) = 0.87, p < 0.01).

Correlation and Causality

You may think if you find a correlation, it means you also have the causality (e.g., x causes y). However, this is not necessarily true. The correlation does not directly mean the causality. It may have such a causality, but the correlation only tells you how strong a relationship between the two variables is.

Correlation vs. Regression

Another confusing thing is the difference between correlation and regression (particularly, linear regression). Both are quite similar in a sense that they look at a relationship between the two variables. But there are still some important differences.

You should do correlation if you want to know how well the two variables (x and y) are associated, and x and y are symmetrical. Symmetrical means that there is no notion that one of them precedes or causes the other. From the experimental design perspective, you did not control neither of the two variables in your experiment. For instance, suppose you measured two things: age of the users and average time of their computer usage in a day. Because you didn't control neither of them, you can do correlation and see the relationship between the age and time of computer usage.

If your x and y are not symmetrical, in other words, if you controlled only either of them (we usually define it as x), you must use regression. This is because the relationship between x and y should be described by saying how x explains y, and not how y explains x. For example, you got data of target acquisition tasks (target sizes and performance time). In this kind of studies, you likely controlled target sizes. So, you should do regression, and see how well target sizes can predict the performance time.

Table of Contents