Generalized Linear Model (GLM)

Introduction

Generalized Linear Model (GLM) is a more generic framework that supports statistical analysis with some sort of linear formation of a model. You can think that GLM is like an upper class of linear regression and logistic regression. These regressions are special cases of GLM. In GLM, your model should look like below.

$f(y) = \sum_{i}\beta_{i}x_{i}$ .

where f is an function. In other words,

$y = f^{-1}(\sum_{i}\beta_{i}x_{i})$ .

where $f^{-1}$ is the inverse function of f. What this formula means: we first do linear calculation with the coefficients ( $\beta$ and independent variables (x). Then, we apply transformation to it by using the function $f^{-1}$ , and the result is the predicted dependent variable. So, if $f^{-1}(u) = u$ , it is linear regression. If $f^{-1}(u) = \exp(u) / (1 + \exp(u))$ , it is logistic regression. By the way, $f^{-1}(u) = \exp(u) / (1 + \exp(u))$ is called the inverse-logit function, and the logit function is x / (1 - x). So far so good? :)

The nice thing about GLM is that we can handle other data distributions than the normal distribution by putting an appropriate function to $f^{-1}$ as we do in logistic regression. With logistic regression, the data distribution of the dependent variable becomes the probability for binomial data. And there are some common data distributions and function used in GLM.

Poisson model: The data distribution of the dependent variable is Poisson, and $f^{-1}(u) = \exp(u)$ (this means that the transformation for the dependent variable is the exponential function). This model is used for count data which are non-negative integers.
logistic-binomial model: This uses the inverse-logit function for the transformation ( $f^{-1}$ ) and uses the binomial distribution for the dependent variable. This model is used when the dependent variable represents the number of trials of interest out of the all trials (e.g., the count of the trials in which the participants successfully completed the tasks).
Multinomial logit and probit model: This is an extended model of logistic and probit regressions for ordinal or categorical data with more than two options. This model uses the logit or probit function for transformation, and uses the multinomial distribution for the dependent variable.

There are more models used in GLM , but the the models above are probably useful for an analysis in the context of HCI research, so I am going to explain these models in this page.

One thing you should note that the methods explained in this page does not cover multilevel models (often referred as mixed-effect models). If I explain it with terms which are more familiar to HCI folks, you need to use slightly different procedures to accommodate repeated-measure factors. This will be explained in a separate page, but understanding GLM is definitely useful to understand multilevel models.

Poisson model

This model uses the Poisson distribution. It is well known that the Poisson distribution represents the probability of the number of events occurrence within a period of time. For instance, the number of car accidents in a city may follow the Poisson distribution.

R code example

Let's take a look at an example of the following hypothetical data. You have studied how many emails 20 smart phone users (Device A and Device B) sent to someone from their mobile devices during the period of the study. You have the average number of the emails per day for each participant, and the data are like this.

Participant	Device	Unlimited Data Plan	Email counts
P1	Device A	yes	10
P2	Device A	no	5
P3	Device A	yes	9
P4	Device A	yes	4
P5	Device A	yes	6
P6	Device A	yes	14
P7	Device A	no	2
P8	Device A	yes	3
P9	Device A	no	8
P10	Device A	yes	10
P11	Device B	no	11
P12	Device B	yes	15
P13	Device B	no	4
P14	Device B	yes	5
P15	Device B	yes	7
P16	Device B	yes	9
P17	Device B	yes	12
P18	Device B	no	6
P19	Device B	no	5
P20	Device B	no	8

In addition to the different devices, you also recorded whether each participant had an unlimited data plan, which you expected is likely to affect the number of emails sent from the mobile devices. So, now your hypothesis is that the number of emails sent from mobile devices can be influenced by the difference of the devices and whether the participants had an unlimited data plan. OK, it's time to do regression! First, prepare the data.

Device <- factor(c('A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B')) DataPlan <- factor(c('Y','N','Y','Y','Y','Y','N','Y','N','Y','N','Y','N','Y','Y','Y','Y','N','N','N')) Count <- c(10,5,9,4,6,14,2,3,8,10,11,15,4,5,7,9,12,6,5,8) data <- data.frame(Device, DataPlan, Count)

Please note that Device and DataPlan are treated as categorical data here. If your independent variable is continuous, you don't have to use factor(), of course. We the run Poisson regression.

model <- glm(Count ~ Device + DataPlan, data=data, family=poisson) summary(model) Call: glm(formula = Count ~ Device + DataPlan, family = poisson, data = data) Deviance Residuals: Min 1Q Median 3Q Max -1.9870 -0.9788 -0.1896 0.7300 1.9684 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.6702 0.1812 9.219 <2e-16 *** DeviceB 0.2187 0.1651 1.324 0.1855 DataPlanY 0.3923 0.1765 2.222 0.0263 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 32.725 on 19 degrees of freedom Residual deviance: 26.802 on 17 degrees of freedom AIC: 108.43 Number of Fisher Scoring iterations: 4

So the deviance has been improved by about 6 points, which is not too bad. And the model we have is as follows.

$Count \sim Poisson(exp(1.6702 + 0.2187 * DeviceB + 0.3923 * DataPlanY))$ .

where DeviceB = 1 if the participant used Device B, and otherwise 0, and DeviceY = 1 if the participant had an unlimited data plan, and otherwise 0. Now we look at the exponential part, which is

$u = exp(1.6702) * exp(0.2187 * DeviceB) * exp(0.3923 * DataPlanY)$ .

Let's take a deeper look at this model. First, we see the difference between Device A and Device B. If DeviceB = 1, u will have a exp(0.2187) = 1.24 times increment. So the difference of the devices may have an effect on encouraging people to send more emails; however, DeviceB is not statistically significant (p = 0.19), so we cannot strongly argue that there is such an effect, and we probably need more data.

How about the effect of the data plan? If DataPlanY = 1, u will have a exp(0.3923) = 1.48 times increment, which sounds something. Moreover, DataPlanY is statistically significant (p <0.05). Thus, we can think that an unlimited data plan has a positive effect on the number of emails sent from the mobile devices.

Offset

So, it seems that the data plan matters for the usage of emails on mobile devices, but there is a concern about this analysis, not in terms of the method itself, but in terms of the observation. Users who had unlimited data plans happened to use emails more often than users who don't have unlimited data plans. If this is the case, our analysis is not correct. If you have noticed this before, you are a very sharp reader :). Yes, it is a concern, and we have a solution for it. Let's say you also measure how many emails each participant sent from all types of devices they used (e.g., desktop machines, laptops, internet tablets, etc), which are shown as “Total email counts” in the table below.

Participant	Device	Unlimited Data Plan	Email counts	Total email counts
P1	Device A	yes	10	12
P2	Device A	no	5	15
P3	Device A	yes	9	13
P4	Device A	yes	4	5
P5	Device A	yes	6	8
P6	Device A	yes	14	17
P7	Device A	no	2	10
P8	Device A	yes	3	6
P9	Device A	no	8	22
P10	Device A	yes	10	14
P11	Device B	no	11	20
P12	Device B	yes	15	18
P13	Device B	no	4	11
P14	Device B	yes	5	11
P15	Device B	yes	7	12
P16	Device B	yes	9	13
P17	Device B	yes	12	28
P18	Device B	no	6	10
P19	Device B	no	5	14
P20	Device B	no	8	10

We use this “Total email count” as a baseline for each participant. This baseline is called offset in GLM. With the offset, the model becomes like this.

$y_{j} \sim Poisson(o_{j}\exp(\sum_{i}\beta_{i}x_{ji})) = Poisson(\exp(log(o_{j})\sum_{i}\beta_{i}x_{ji}))$ .

where j is the index of the data point (y_j means the dependent variable of j-th data point). This model allows us to have an adjustment (offset) for each data point. More intuitively, we are going to use the rate of the number of sent emails from the mobile devices over the total number of the sent emails. Thus, x_ji is the rate, and o_j is the total number of the sent emails. We can do this in the glm() function by using the offset option. But you have to calculate its logarithmic.

TotalCount <- c(12,15,13,5,8,17,10,6,22,14,20,18,11,11,12,13,28,10,14,10) data <- data.frame(Device, DataPlan, Count, TotalCount) model <- glm(Count ~ Device + DataPlan, data=data, family=poisson, offset=log(TotalCount)) summary(model) Call: glm(formula = Count ~ Device + DataPlan, family = poisson, data = data, offset = log(TotalCount)) Deviance Residuals: Min 1Q Median 3Q Max -1.5878 -0.5418 0.1302 0.6890 1.5710 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.81515 0.17070 -4.775 1.79e-06 *** DeviceB -0.01994 0.16235 -0.123 0.9022 DataPlanY 0.41366 0.17353 2.384 0.0171 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 18.135 on 19 degrees of freedom Residual deviance: 12.132 on 17 degrees of freedom AIC: 93.765 Number of Fisher Scoring iterations: 4

So, we still have a significant effect of the data plan.

Overdispersion

It's not completely done even if you consider the offset. Another problem you may have is overdispersion. We have built a model with some (often very) strong assumptions: we only have two factors and the distribution of the dependent variable follows the Poisson distribution. Thus, the predicted values by our model might be good enough to describe the variance we observed in the data. If this is the case, the model is not appropriate enough to use. This is what we call overdispersion, and often happens in Poisson regression.

To see whether the model suffers from overdispersion, we can calculate the sum of square of the standardized residuals (i.e., values of the difference between the observed values and predicted values), and compare it to the chi-squared distribution.

Here is the procedure. You first need to calculate the predicted value.

predict_count <- predict(model, type="response") z <- (Count - predict_count) / sqrt(predict_count) z 1 2 3 4 5 6 7 0.6944586 -0.6359551 0.1012997 0.3571653 0.2789212 0.7771620 -1.1530480 8 9 10 11 12 13 14 -0.5069579 -0.5565284 0.2056396 0.7887425 0.9282716 -0.3534704 -0.8253118 15 16 17 18 19 20 -0.3112302 0.1611358 -1.4864222 0.7977789 -0.4356561 1.7579941

Then, calculate the sum of square of z. If this is much larger than 1, it implies that you have overdispersion. We need two values: the sample size and the number of coefficients. In our case, they are 20 and 3, respectively.

sum(z^2)/(20 - 3) 0.7166947

We also can do a test with the chi-squared distribution.

1 - pchisq(sum(z^2), 20 - 3) 0.7888812

If this value is below 0.05, the test indicates that you have the dispersion. In this example, we don't seem to have the dispersion, which is good news.

What if we have the overdispersion? We then need to adjust the standard error for each coefficient by multiplying them by the root of the sum of square of z. For example, supposed that our sum of square of z turned out 10 in our example. The coefficient for DataPlanY is 0.41366 with the standard error 0.17353. This means that the coefficient is likely [0.23983, 0.58689]. With the adjustment for the overdispersion, the standard error becomes 0.17353 * sqrt(10) = 0.54875. Thus, the adjusted coefficient is likely [-0.13509, 0.96241], which means that it can be zero (which means that the effect of DataPlanY can be zero). Fortunately, you don't have to do this manually, and you can re-fit the model by using quasipoisson as follows.

model <- glm(Count ~ Device + DataPlan, data=data, family=quasipoisson, offset=log(TotalCount)) summary(model)

It also calculates the p value based on the adjustment for the overdispersion.

logistic-binomial model

This is basically the logistic regression, but we are going to use the binomial distribution for the dependent variable. In this way, we can model the number of trials of interest out of the all trials, like a success rate or an error rate.

R code example

Let's say you have a game application in which the participants can hit monsters by a mouse click on them. In your experiment, you set different time limits (e.g., 10, 20, 30, 40, 50, and 60 sec), and measured the success rate (i.e., the number of the monsters the participants could clicked during the trial out of the pre-defiend number of the monsters). Now, what you want to do is to create a model of the success rate with the game time limit. Let's say your data look like this.

Participant	Time	Clicked	Not-clicked	Success rate
P1	10 sec	10	90	0.10
P2	20 sec	23	77	0.23
P3	30 sec	40	60	0.40
P4	40 sec	70	30	0.70
P5	50 sec	82	18	0.82
P6	60 sec	96	4	0.96
P7	10 sec	12	88	0.12
P8	20 sec	20	80	0.20
P9	30 sec	35	65	0.35
P10	40 sec	58	42	0.58
P11	50 sec	76	24	0.76
P12	60 sec	90	10	0.90

(Well, you will have much more data in reality, but I put 12 samples here to make the example simple.) Let's try to do logistic regression for this data. First, we prepare the data in R.

Time <- c(10,20,30,40,50,60,10,20,30,40,50,60) Success <- c(10,23,40,70,82,96,12,20,35,58,76,90) Failure <- 100 - Success GameResult <- cbind(Success, Failure)

Note that we prepare both the numbers of “Success” and “Failure”, and make it into one variable, called “GameResult”. In this way, the logistic regression is going to use the binomial distribution. Now, we do the same thing with the glm() function.

model <- glm(GameResult ~ Time, family = binomial) summary(model)

You get the result.

Call: glm(formula = GameResult ~ Time, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.17893 -0.58216 0.02050 0.50723 1.84797 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.142800 0.191896 -16.38 <2e-16 *** Time 0.091572 0.005147 17.79 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 491.313 on 11 degrees of freedom Residual deviance: 10.378 on 10 degrees of freedom AIC: 68.458 Number of Fisher Scoring iterations: 4

You can do a similar analysis as described in the previous section. For instance, the model greatly improves the deviance, which indicates that this model is very powerful. Time is statistically significant on predicting the success rate.

This model also often suffers from the overdispersion, so you have to be careful about it. You can find more details on how to calculate the overdispersion in the previous section. You can use quasibinomial to adjust the model for overdispersion.

Poisson model or logistic-binomial model

The Poisson model is similar to the logistic-binomial model, particularly in a sense that both treat count data, and you may wonder how we should choose which model to use. Here is the rule of thumb.

If the dependent variable does not have a upper limit, you should use the Poisson model.
Otherwise, you should use the logistic-binomial model.

Multinomial logit model

(coming soon).

References

The content in this page is largely what I learned from the following book. Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill. It is the best book I have read about regression analysis (well, I haven't read that many books, but I'm sure this is an excellent book).

Table of Contents

Generalized Linear Model (GLM)

Introduction

Poisson model

R code example

Offset

Overdispersion

logistic-binomial model

R code example

Poisson model or logistic-binomial model

Multinomial logit model

References