Table of Contents

Generalized Linear Model (GLM)


Introduction

Generalized Linear Model (GLM) is a more generic framework that supports statistical analysis with some sort of linear formation of a model. You can think that GLM is like an upper class of linear regression and logistic regression. These regressions are special cases of GLM. In GLM, your model should look like below.

.

where f is an function. In other words,

.

where is the inverse function of f. What this formula means: we first do linear calculation with the coefficients ( and independent variables (x). Then, we apply transformation to it by using the function , and the result is the predicted dependent variable. So, if , it is linear regression. If , it is logistic regression. By the way, is called the inverse-logit function, and the logit function is x / (1 - x). So far so good? :)

The nice thing about GLM is that we can handle other data distributions than the normal distribution by putting an appropriate function to as we do in logistic regression. With logistic regression, the data distribution of the dependent variable becomes the probability for binomial data. And there are some common data distributions and function used in GLM.

There are more models used in GLM , but the the models above are probably useful for an analysis in the context of HCI research, so I am going to explain these models in this page.

One thing you should note that the methods explained in this page does not cover multilevel models (often referred as mixed-effect models). If I explain it with terms which are more familiar to HCI folks, you need to use slightly different procedures to accommodate repeated-measure factors. This will be explained in a separate page, but understanding GLM is definitely useful to understand multilevel models.



Poisson model

This model uses the Poisson distribution. It is well known that the Poisson distribution represents the probability of the number of events occurrence within a period of time. For instance, the number of car accidents in a city may follow the Poisson distribution.



R code example

Let's take a look at an example of the following hypothetical data. You have studied how many emails 20 smart phone users (Device A and Device B) sent to someone from their mobile devices during the period of the study. You have the average number of the emails per day for each participant, and the data are like this.

ParticipantDeviceUnlimited Data PlanEmail counts
P1Device Ayes10
P2Device Ano5
P3Device Ayes9
P4Device Ayes4
P5Device Ayes6
P6Device Ayes14
P7Device Ano2
P8Device Ayes3
P9Device Ano8
P10Device Ayes10
P11Device Bno11
P12Device Byes15
P13Device Bno4
P14Device Byes5
P15Device Byes7
P16Device Byes9
P17Device Byes12
P18Device Bno6
P19Device Bno5
P20Device Bno8

In addition to the different devices, you also recorded whether each participant had an unlimited data plan, which you expected is likely to affect the number of emails sent from the mobile devices. So, now your hypothesis is that the number of emails sent from mobile devices can be influenced by the difference of the devices and whether the participants had an unlimited data plan. OK, it's time to do regression! First, prepare the data.

Device <- factor(c('A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B')) DataPlan <- factor(c('Y','N','Y','Y','Y','Y','N','Y','N','Y','N','Y','N','Y','Y','Y','Y','N','N','N')) Count <- c(10,5,9,4,6,14,2,3,8,10,11,15,4,5,7,9,12,6,5,8) data <- data.frame(Device, DataPlan, Count)

Please note that Device and DataPlan are treated as categorical data here. If your independent variable is continuous, you don't have to use factor(), of course. We the run Poisson regression.

model <- glm(Count ~ Device + DataPlan, data=data, family=poisson) summary(model) Call: glm(formula = Count ~ Device + DataPlan, family = poisson, data = data) Deviance Residuals: Min 1Q Median 3Q Max -1.9870 -0.9788 -0.1896 0.7300 1.9684 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.6702 0.1812 9.219 <2e-16 *** DeviceB 0.2187 0.1651 1.324 0.1855 DataPlanY 0.3923 0.1765 2.222 0.0263 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 32.725 on 19 degrees of freedom Residual deviance: 26.802 on 17 degrees of freedom AIC: 108.43 Number of Fisher Scoring iterations: 4

So the deviance has been improved by about 6 points, which is not too bad. And the model we have is as follows.

.

where DeviceB = 1 if the participant used Device B, and otherwise 0, and DeviceY = 1 if the participant had an unlimited data plan, and otherwise 0. Now we look at the exponential part, which is

.

Let's take a deeper look at this model. First, we see the difference between Device A and Device B. If DeviceB = 1, u will have a exp(0.2187) = 1.24 times increment. So the difference of the devices may have an effect on encouraging people to send more emails; however, DeviceB is not statistically significant (p = 0.19), so we cannot strongly argue that there is such an effect, and we probably need more data.

How about the effect of the data plan? If DataPlanY = 1, u will have a exp(0.3923) = 1.48 times increment, which sounds something. Moreover, DataPlanY is statistically significant (p <0.05). Thus, we can think that an unlimited data plan has a positive effect on the number of emails sent from the mobile devices.



Offset

So, it seems that the data plan matters for the usage of emails on mobile devices, but there is a concern about this analysis, not in terms of the method itself, but in terms of the observation. Users who had unlimited data plans happened to use emails more often than users who don't have unlimited data plans. If this is the case, our analysis is not correct. If you have noticed this before, you are a very sharp reader :). Yes, it is a concern, and we have a solution for it. Let's say you also measure how many emails each participant sent from all types of devices they used (e.g., desktop machines, laptops, internet tablets, etc), which are shown as “Total email counts” in the table below.

ParticipantDeviceUnlimited Data PlanEmail countsTotal email counts
P1Device Ayes1012
P2Device Ano515
P3Device Ayes913
P4Device Ayes45
P5Device Ayes68
P6Device Ayes1417
P7Device Ano210
P8Device Ayes36
P9Device Ano822
P10Device Ayes1014
P11Device Bno1120
P12Device Byes1518
P13Device Bno411
P14Device Byes511
P15Device Byes712
P16Device Byes913
P17Device Byes1228
P18Device Bno610
P19Device Bno514
P20Device Bno810

We use this “Total email count” as a baseline for each participant. This baseline is called offset in GLM. With the offset, the model becomes like this.

.

where j is the index of the data point (y_j means the dependent variable of j-th data point). This model allows us to have an adjustment (offset) for each data point. More intuitively, we are going to use the rate of the number of sent emails from the mobile devices over the total number of the sent emails. Thus, x_ji is the rate, and o_j is the total number of the sent emails. We can do this in the glm() function by using the offset option. But you have to calculate its logarithmic.

TotalCount <- c(12,15,13,5,8,17,10,6,22,14,20,18,11,11,12,13,28,10,14,10) data <- data.frame(Device, DataPlan, Count, TotalCount) model <- glm(Count ~ Device + DataPlan, data=data, family=poisson, offset=log(TotalCount)) summary(model) Call: glm(formula = Count ~ Device + DataPlan, family = poisson, data = data, offset = log(TotalCount)) Deviance Residuals: Min 1Q Median 3Q Max -1.5878 -0.5418 0.1302 0.6890 1.5710 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.81515 0.17070 -4.775 1.79e-06 *** DeviceB -0.01994 0.16235 -0.123 0.9022 DataPlanY 0.41366 0.17353 2.384 0.0171 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 18.135 on 19 degrees of freedom Residual deviance: 12.132 on 17 degrees of freedom AIC: 93.765 Number of Fisher Scoring iterations: 4

So, we still have a significant effect of the data plan.



Overdispersion

It's not completely done even if you consider the offset. Another problem you may have is overdispersion. We have built a model with some (often very) strong assumptions: we only have two factors and the distribution of the dependent variable follows the Poisson distribution. Thus, the predicted values by our model might be good enough to describe the variance we observed in the data. If this is the case, the model is not appropriate enough to use. This is what we call overdispersion, and often happens in Poisson regression.

To see whether the model suffers from overdispersion, we can calculate the sum of square of the standardized residuals (i.e., values of the difference between the observed values and predicted values), and compare it to the chi-squared distribution.

Here is the procedure. You first need to calculate the predicted value.

predict_count <- predict(model, type="response") z <- (Count - predict_count) / sqrt(predict_count) z 1 2 3 4 5 6 7 0.6944586 -0.6359551 0.1012997 0.3571653 0.2789212 0.7771620 -1.1530480 8 9 10 11 12 13 14 -0.5069579 -0.5565284 0.2056396 0.7887425 0.9282716 -0.3534704 -0.8253118 15 16 17 18 19 20 -0.3112302 0.1611358 -1.4864222 0.7977789 -0.4356561 1.7579941

Then, calculate the sum of square of z. If this is much larger than 1, it implies that you have overdispersion. We need two values: the sample size and the number of coefficients. In our case, they are 20 and 3, respectively.

sum(z^2)/(20 - 3) 0.7166947

We also can do a test with the chi-squared distribution.

1 - pchisq(sum(z^2), 20 - 3) 0.7888812

If this value is below 0.05, the test indicates that you have the dispersion. In this example, we don't seem to have the dispersion, which is good news.

What if we have the overdispersion? We then need to adjust the standard error for each coefficient by multiplying them by the root of the sum of square of z. For example, supposed that our sum of square of z turned out 10 in our example. The coefficient for DataPlanY is 0.41366 with the standard error 0.17353. This means that the coefficient is likely [0.23983, 0.58689]. With the adjustment for the overdispersion, the standard error becomes 0.17353 * sqrt(10) = 0.54875. Thus, the adjusted coefficient is likely [-0.13509, 0.96241], which means that it can be zero (which means that the effect of DataPlanY can be zero). Fortunately, you don't have to do this manually, and you can re-fit the model by using quasipoisson as follows.

model <- glm(Count ~ Device + DataPlan, data=data, family=quasipoisson, offset=log(TotalCount)) summary(model)

It also calculates the p value based on the adjustment for the overdispersion.



logistic-binomial model

This is basically the logistic regression, but we are going to use the binomial distribution for the dependent variable. In this way, we can model the number of trials of interest out of the all trials, like a success rate or an error rate.



R code example

Let's say you have a game application in which the participants can hit monsters by a mouse click on them. In your experiment, you set different time limits (e.g., 10, 20, 30, 40, 50, and 60 sec), and measured the success rate (i.e., the number of the monsters the participants could clicked during the trial out of the pre-defiend number of the monsters). Now, what you want to do is to create a model of the success rate with the game time limit. Let's say your data look like this.

ParticipantTimeClickedNot-clickedSuccess rate
P110 sec10900.10
P220 sec23770.23
P330 sec40600.40
P440 sec70300.70
P550 sec82180.82
P660 sec9640.96
P710 sec12880.12
P820 sec20800.20
P930 sec35650.35
P1040 sec58420.58
P1150 sec76240.76
P1260 sec90100.90

(Well, you will have much more data in reality, but I put 12 samples here to make the example simple.) Let's try to do logistic regression for this data. First, we prepare the data in R.

Time <- c(10,20,30,40,50,60,10,20,30,40,50,60) Success <- c(10,23,40,70,82,96,12,20,35,58,76,90) Failure <- 100 - Success GameResult <- cbind(Success, Failure)

Note that we prepare both the numbers of “Success” and “Failure”, and make it into one variable, called “GameResult”. In this way, the logistic regression is going to use the binomial distribution. Now, we do the same thing with the glm() function.

model <- glm(GameResult ~ Time, family = binomial) summary(model)

You get the result.

Call: glm(formula = GameResult ~ Time, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.17893 -0.58216 0.02050 0.50723 1.84797 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.142800 0.191896 -16.38 <2e-16 *** Time 0.091572 0.005147 17.79 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 491.313 on 11 degrees of freedom Residual deviance: 10.378 on 10 degrees of freedom AIC: 68.458 Number of Fisher Scoring iterations: 4

You can do a similar analysis as described in the previous section. For instance, the model greatly improves the deviance, which indicates that this model is very powerful. Time is statistically significant on predicting the success rate.

This model also often suffers from the overdispersion, so you have to be careful about it. You can find more details on how to calculate the overdispersion in the previous section. You can use quasibinomial to adjust the model for overdispersion.



Poisson model or logistic-binomial model

The Poisson model is similar to the logistic-binomial model, particularly in a sense that both treat count data, and you may wonder how we should choose which model to use. Here is the rule of thumb.



Multinomial logit model

(coming soon).



References

The content in this page is largely what I learned from the following book. Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill. It is the best book I have read about regression analysis (well, I haven't read that many books, but I'm sure this is an excellent book).