Linear Regression

Introduction

Linear regression is a standard way to build a model of your variables. You want to do this when

You have two variables: one dependent variable and one independent variable. Both variables are interval.
You want to express the relationship between the dependent variable and independent variable in a form of a line. That is, you want to express the relationship like y = ax + b, where x and y are the independent and dependent variables, respectively.

One common usage of linear regression in HCI research is Fitts' law (and its variants). If your experiment includes a pointing task which allows the user to do a ballistic movement towards the target, the performance time you measured is likely to follow Fitts' law. In this case, you plot the index of difficulty (its logarithmic value, to be precise) along the x axis and the performance time along the y axis, and do linear regression. It is known that the goodness of fit is over 0.9 if your results follow Fitts' law.

Linear regression is also a core concept for t test and ANOVA. So understanding linear regression will help you understand other statistical methods. You can easily find the mathematical details of linear regression at many places (books, blogs, wikipedia, etc), so I don't go into those details here. Rather I want to show you how to do linear regression in R and how to interpret the results.

One thing you need to be careful about before doing linear regression is your dependent variable. If your dependent variable is binomial (i.e., the “yes” or “no” response), you probably should use logistic regression instead of linear regression. I have a separate page for it.

R code example

First of all, we need data. Let's think about a Fitts law task. You have the logarithmic ID (index of difficulty) and performance time.

Log_ID <- c(1.23,1.55,1.42,1.98,2.31,3.22,2.58,1.83,2.49,2.97,1.74,1.38,3.56,3.87,3.41) Time <- c(2.93,3.74,4.74,5.20,6.11,6.93,5.07,5.21,5.60,7.23,4.42,3.53,8.03,9.00,7.98)

I just randomly made the values for Log_ID. To generate the values for Time, I used the code Time ← 1 + 2*Log_ID + rnorm(15, 0, 0.5). So, we can expect that the result of the linear regression would be something like y = 1 + 2*x. But there is an error term (rnorm(15, 0, 0.5)), so the coefficients may not be exactly 1 and 2.

Let's do linear regression now. In R, you can use lm() function to do linear regression. But you need to tell R which variables x and y are. You can do it by using ~. In our case, we want to build a model like Time = a + b * Log_ID, so we can say Time ~ Log_ID.

model <- lm(Time ~ Log_ID) model

Then, you get the result.

Call: lm(formula = Time ~ Log_ID) Coefficients: (Intercept) Log_ID 1.020 1.982

Look at the coefficients section. The values below (Intercept) and Log_ID are a and b, respectively. As we expected, a and b are pretty close to 1 and 2. But this doesn't really mean whether the model we have just built represents our data. And unlike this example, we usually don't know what the underlying model is for your data. We need to know the goodness of fit for this purpose. The goodness of fits is also called R squared (R^2). To calculate this, we need to execute one more code in R.

summary(model)

Now, you get more detailed results.

Call: lm(formula = Time ~ Log_ID) Residuals: Min 1Q Median 3Q Max -1.06212 -0.35244 -0.04407 0.31836 0.90651 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.0196 0.4090 2.493 0.0269 * Log_ID 1.9816 0.1627 12.180 1.75e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5294 on 13 degrees of freedom Multiple R-squared: 0.9194, Adjusted R-squared: 0.9132 F-statistic: 148.4 on 1 and 13 DF, p-value: 1.747e-08

Look at the coefficients section. You can see the same coefficients in Estimate that we saw before. But you can also see the t values and p values in this case. What do they mean? Well, they are testing the null hypothesis such that each coefficient is equal to 0 with the t test. The degree of freedom in this t test is N-k-1 where N is the number of data (15 in this example), and k is the number of dependent variables (1 in this example). Because both p values are less than 0.05, this null hypothesis is rejected. This means that both a and b * Log_ID have significant effects on predicting the value of Time. Thus, your model needs to be Time = a + b * Log_ID, and not like Time = a or Time = b * Log_ID.

Another thing you need to look at is the goodness of fit, R squared. As you can see, we have two kinds of R squared: Multiple R-squared, and Adjusted R-squared. I know what you want to ask me: What is the difference and which one should we use? Multiple R-squared is the original goodness of fit, which ranges from 0 to 1. If this value is equal to 1, all the data perfectly fit into your model (i.e., all the data point are on the line calculated by the linear regression).

One problem of the non-adjusted R squared is that it is largely affected by the number of data points (in other words, the degree of freedom). We can think about a really extreme case: You have only two data points. You can clearly draw a line which connects the two data points, and R squared is 1. But obviously, you find that this may not be a good model if you get more data points. Another problem of the non-adjusted R squared is if you add more dependent variables to your model, it tends to be large (i.e., the data look like fitting to your model better). This is kind of a similar problem of overfitting if you are familiar with machine learning. This is more problematic in multiple regression, and I am explaining more details in the multiple regression page.

Adjusted R-squared is calculated to avoid the problems above. So, what you should look at is Adjusted R-squared, not Multiple R-squared. Please note that this can be less than 0 because of the adjustment. In our example, the adjusted R square is 0.91, which is high and indicates that the model we built predicts well.

Effect size

Fortunately, the effect size of the linear regression is described by R, so you don't really need to calculate the effect size. Here are the values which are considered the small, medium, and large effect sizes.

	small size	medium size	large size
R	0.1	0.3	0.5

Wtih MBESS package, we can calculate the confidence interval for this effect size.

library(MBESS) ci.R2(R2=0.9132, N=15, K=1)

N is the sample size, and K is the number of the factor, which is one in this case. You get the result like this.

$Lower.Conf.Limit.R2 [1] 0.7410273 $Prob.Less.Lower [1] 0.025 $Upper.Conf.Limit.R2 [1] 0.96807 $Prob.Greater.Upper [1] 0.025

Thus, the effect size is 0.91 with 95% CI = [0.74, 0.97].

How to report

There are several things you need to report in linear regression, but basically there are two things: statistics for your independent variable, and statistics for your model. The following example will give you an idea of how you are supposed to report the results.

The logarithmic value of the index of difficulty significantly predicted the performance time (b = 1.98, t(13) = 12.18, p < 0.01). The overall model with the logarithmic value of the index of difficulty also predicted the performance time very well (adjusted R2= 0.91, F(1, 13) = 148.4, p < 0.01).

Non-linear regression

Yes, you can do regression with non-linear models. One of the most common non-linear regression is logistic regression. But generally, it is rare to see regression other than linear regression in HCI research, and I don't have any good example of the cases in which you really need to use non-linear regression. If I have time, I will add an explanation and example for non-linear regression here.

Adding more variables to the regression

Another extension of linear regression is to have multiple variables for the model, which is often called multiple regression. Although the idea of multiple regression is similar to linear regression, there are more things you need to care about in multiple regression. Thus, I have a separate page for multiple regression.

Regression vs. Correlation

If you are familiar with correlation, you may wonder the difference between linear regression and correlation. I have the answer for you in the correlation page.

Table of Contents