Table of Contents

Logistic Regression


Introduction

Logistic Regression is a way to fit your data to the logistic function. The logistic function is as follows:

.

One unique characteristic of the logistic function is that it only takes values between 0 and 1. Logistic regression is generally used for the binary variable, such as “yes” or “no”. It also can be used to model the number of the trials of interest out of all the trials (like a success rate or error rate). Let's take a look at an example first and we will see how we can interpret the results.



R code example: Regression for binary data

We are going to use a similar example we use in Principal Component Analysis. Let's say we asked the participants two 7-scale Likert questions about what they care about when choosing a new computer, and which OS they are using (either Windows=0 or Mac=1) and got the results like this.

ParticipantPriceAestheticsOS
P1630
P2450
P3640
P4510
P5750
P6620
P7520
P8640
P9330
P10260
P11361
P12171
P13261
P14571
P15251
P16361
P17151
P18441
P19371
P20541

Your hypothesis is that “Price” and “Aesthetics” are somehow connected to the choice of OS. Let's first create the data frame. We also standardize the data. In this example, we already know that the theoretical center is 4, so we just subtract all the values by 4 (thus, the values range from - 3 to 3 with 0 at the center).

Price <- c(6,4,6,5,7,6,5,6,3,2,3,1,2,5,2,3,1,4,3,5) Price <- Price - 4 Aesthetics <- c(3,5,4,1,5,2,2,4,3,6,6,7,6,7,5,6,5,4,7,4) Aesthetics <- Aesthetics - 4 OS <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1) data <- data.frame(Price, Aesthetics, OS)

Now, we are going to do a logistic regression. The format is very similar to linear regression, and you basically just need to use glm() function. First we use Price as an independent variable.

model <- glm(OS ~ Price, data=data, family=binomial) summary(model) Call: glm(formula = OS ~ Price, family = binomial, data = data) Deviance Residuals: Min 1Q Median 3Q Max -1.947987 -0.623901 0.004467 0.841877 1.577686 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.0249 0.5573 -0.045 0.9644 Price -0.8799 0.3821 -2.303 0.0213 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 27.726 on 19 degrees of freedom Residual deviance: 19.758 on 18 degrees of freedom AIC: 23.758 Number of Fisher Scoring iterations: 4

Thus, our model is as follows.

.

So, what does the results tell us? Let's take a closer look at the model. Think about that price equals to 0, which is associated with “4” in the 7-scale Likert response. This means that a participant showed a neutral attitude towards the price of a computer as an important factor to decide the purchase. If so, the probability of OS = 1 (which means that the person chooses Mac), is exp(-0.0249) / (1 + exp(-0.0249)) = 0.494. So by default, people choose either Windows or Mac almost at the same probability.

However, its probability become small or large depending whether the person considers the price as an important factor. The probability of OS = 1 for the person with the response “7” is exp(-0.0249 - 0.8799 * 3) / (1 + exp(-0.0249 - 0.8799 * 3)) = 0.065, and the one for the person with the response “1” is exp(-0.0249 - 0.8799 * -3) / (1 + exp(-0.0249 + 0.8799 * -3)) = 0.931. This is kind of making sense because there are lots of cheap PCs in the market, and Windows is probably more price-competitive than Mac. As the results show, “Price” is statistically significant (p < 0.05), this means that the coefficient of Price is likely to be negative even if we consider its estimation error.

Another thing you should look at deviance. It is more or less like “residual standard deviation” in linear regression. It is a metric of the model error. Thus, a lower deviance means a better fit of the model. The result says that “Null deviance” is 27.726 and “Residual deviance”: 19.758. “Null deviance” means the deviance in the model without any independent variable (i.e., the model for the null hypothesis). “Residual deviance” means the deviance of the model we have created. Thus, the result indicates that our model shows an improvement to predict the probability of OS choice by about 8 point of the deviance. We usually see more than 1 decrease in the deviance when a meaningful factor is introduced to the model.

So how about adding Aesthetics to the model? Let's give it a try.

model <- glm(OS ~ Price + Aesthetics, data=data, family=binomial) summary(model) Call: glm(formula = OS ~ Price + Aesthetics, family = binomial, data = data) Deviance Residuals: Min 1Q Median 3Q Max -2.153058 -0.584998 0.009312 0.583771 1.687231 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.6168 0.7530 -0.819 0.4127 Price -0.5309 0.4123 -1.288 0.1979 Aesthetics 0.8846 0.5239 1.688 0.0913 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 27.726 on 19 degrees of freedom Residual deviance: 15.659 on 17 degrees of freedom AIC: 21.659 Number of Fisher Scoring iterations: 5

The deviance has been improved decently (about 3 points), so it seems that including Aesthetics means something. However, the coefficients are not statistically significant, so they can be zero (which means that they can have no effect on predicting the probability of OS = 1). Nonetheless, the sign of the coefficient for Price is still negative, which is the same as our observation on the model with only Price factor. The sign of the coefficient for Aesthetics is positive, which implies that people who cares about the appearance of the device more tend to buy Macs, which also seems sensible (well, I'm not saying the the appearance of PCs are not good. This is just hypothetical data, and I am a Windows user for more than 10 years :). And considering the improvement of deviance, we can say that this model is what we wanted to achieve.

In general, when you are trying to determine the model, you need to look at the statistical results (e.g., whether each factor is statistically significant or the deviance is decreased) and you need to make sense of the model (e.g., what each factor means in the context of the model and whether the direction (positive or negative) is sensible). It is not necessarily easy when you have many factors, and may take a lot of time.

If you want to know more about logistic regression, I recommend reading the following book. Data Analysis Using Regression and Multilevel/Hierarchical Models



R code example: Regression for the success/error rate

You can also use logistic regression to model the success or failure rate. Although the procedure is quite similar, there are several precautions you need to execute. Please see more details in the Generalized Linear Model page.



Goodness of fit (Effect size)

To describe how well the model predicts, we need to know the goodness of fit. In linear regression, an adjusted R-squared is the metric for the goodness of fit, but we cannot directly use the same metric for logistic regression. This is why we have deviance, and I think reporting the deviance is probably enough. The results also show AIC (Akaike Information Criterion), which also indicates the goodness of the model (a smaller AIC means a better model). You thus may want to report it as well.

But there are a few direct alternatives to R-squared, which are called pseudo R-squared. You can find more details about them here, but I describe two of them which are commonly used for logistic regression and fairly easy to calculate: Cox & Snell and Nagelkerke.

The intuition of Cox & Snell and Nagelkerke is to calculate the improvements of the prediction by the model you create from the base model (called the null model). We use the log-likelihood for this calculation. Let's see more details of how to calculate them.



Cox & Snell R-squared

Cox and Snell R-squared can be calculated as follows:

,

where "" and "" are the log-likelihood of the model and intercept (the model without any explaining variable), respectively, and N is the sample size (which is 20 in this example). The log-likelihood basically represents how likely the prediction by your model can happen in your data. If your model perfectly predict your data, the likelihood is 1, which means the -2 log-likelihood is 0. There is a convenient function to calculate the log-likelihood. For example, is:

model <- glm(OS ~ Price + Aesthetics, data=data, family=binomial) summary(model) logLik(result) 'log Lik.'-7.829328 (df=3)

You can calculate in a similar way, but you have to create a model without any variable. (You can also confirm the value of the null deviance by executing summary(model_null).)

model_null <- glm(OS ~ 1, data=data, family=binomial) logLik(model_null) 'log Lik.' -13.86294 (df=1)

Now, we can calculate Cox and Snell R-squared.

1 - exp(2 * (7.829328 - 13.86294)/20) 0.4530299



Nagelkerke R-squared

One of the problems of Cox & Snell R-squared is that the theoretical maximum is not always 1, which is a little weird for R-square. Nagelkerke R-squared solves this problem. It can be calculated as follows:

,

where and are the log-likelihood of the model and intercept (the model without any explaining variable), respectively, and N is the sample size (which is 20 in this example). Nagelkerke R-squared for the example is calculated as follows (see the previous section for calculating the -2 log-likelihood):

(1 - exp(2 * (7.829328 - 13.86294)/20)) / (1 - exp(2 * -13.86294/20)) 0.60404

Generally, Nagelkerke R-squared becomes larger than Cox and Snell R-squared. In this example, both pseudo R-squares tell us that the model predicts the success rate pretty well.