Generalized Linear Model (GLM) is a more generic framework that supports statistical analysis with some sort of linear formation of a model. You can think that GLM is like an upper class of linear regression and logistic regression. These regressions are special cases of GLM. In GLM, your model should look like below.
where f is an function. In other words,
where is the inverse function of f. What this formula means: we first do linear calculation with the coefficients (
and independent variables (x). Then, we apply transformation to it by using the function
, and the result is the predicted dependent variable. So, if
, it is linear regression. If
, it is logistic regression. By the way,
is called the inverse-logit function, and the logit function is x / (1 - x). So far so good? :)
The nice thing about GLM is that we can handle other data distributions than the normal distribution by putting an appropriate function to as we do in logistic regression. With logistic regression, the data distribution of the dependent variable becomes the probability for binomial data. And there are some common data distributions and function used in GLM.
There are more models used in GLM , but the the models above are probably useful for an analysis in the context of HCI research, so I am going to explain these models in this page.
One thing you should note that the methods explained in this page does not cover multilevel models (often referred as mixed-effect models). If I explain it with terms which are more familiar to HCI folks, you need to use slightly different procedures to accommodate repeated-measure factors. This will be explained in a separate page, but understanding GLM is definitely useful to understand multilevel models.
This model uses the Poisson distribution. It is well known that the Poisson distribution represents the probability of the number of events occurrence within a period of time. For instance, the number of car accidents in a city may follow the Poisson distribution.
Let's take a look at an example of the following hypothetical data. You have studied how many emails 20 smart phone users (Device A and Device B) sent to someone from their mobile devices during the period of the study. You have the average number of the emails per day for each participant, and the data are like this.
Participant | Device | Unlimited Data Plan | Email counts |
---|---|---|---|
P1 | Device A | yes | 10 |
P2 | Device A | no | 5 |
P3 | Device A | yes | 9 |
P4 | Device A | yes | 4 |
P5 | Device A | yes | 6 |
P6 | Device A | yes | 14 |
P7 | Device A | no | 2 |
P8 | Device A | yes | 3 |
P9 | Device A | no | 8 |
P10 | Device A | yes | 10 |
P11 | Device B | no | 11 |
P12 | Device B | yes | 15 |
P13 | Device B | no | 4 |
P14 | Device B | yes | 5 |
P15 | Device B | yes | 7 |
P16 | Device B | yes | 9 |
P17 | Device B | yes | 12 |
P18 | Device B | no | 6 |
P19 | Device B | no | 5 |
P20 | Device B | no | 8 |
In addition to the different devices, you also recorded whether each participant had an unlimited data plan, which you expected is likely to affect the number of emails sent from the mobile devices. So, now your hypothesis is that the number of emails sent from mobile devices can be influenced by the difference of the devices and whether the participants had an unlimited data plan. OK, it's time to do regression! First, prepare the data.
Please note that Device and DataPlan are treated as categorical data here. If your independent variable is continuous, you don't have to use factor(), of course. We the run Poisson regression.
So the deviance has been improved by about 6 points, which is not too bad. And the model we have is as follows.
where DeviceB = 1 if the participant used Device B, and otherwise 0, and DeviceY = 1 if the participant had an unlimited data plan, and otherwise 0. Now we look at the exponential part, which is
Let's take a deeper look at this model. First, we see the difference between Device A and Device B. If DeviceB = 1, u will have a exp(0.2187) = 1.24 times increment. So the difference of the devices may have an effect on encouraging people to send more emails; however, DeviceB is not statistically significant (p = 0.19), so we cannot strongly argue that there is such an effect, and we probably need more data.
How about the effect of the data plan? If DataPlanY = 1, u will have a exp(0.3923) = 1.48 times increment, which sounds something. Moreover, DataPlanY is statistically significant (p <0.05). Thus, we can think that an unlimited data plan has a positive effect on the number of emails sent from the mobile devices.
So, it seems that the data plan matters for the usage of emails on mobile devices, but there is a concern about this analysis, not in terms of the method itself, but in terms of the observation. Users who had unlimited data plans happened to use emails more often than users who don't have unlimited data plans. If this is the case, our analysis is not correct. If you have noticed this before, you are a very sharp reader :). Yes, it is a concern, and we have a solution for it. Let's say you also measure how many emails each participant sent from all types of devices they used (e.g., desktop machines, laptops, internet tablets, etc), which are shown as “Total email counts” in the table below.
Participant | Device | Unlimited Data Plan | Email counts | Total email counts |
---|---|---|---|---|
P1 | Device A | yes | 10 | 12 |
P2 | Device A | no | 5 | 15 |
P3 | Device A | yes | 9 | 13 |
P4 | Device A | yes | 4 | 5 |
P5 | Device A | yes | 6 | 8 |
P6 | Device A | yes | 14 | 17 |
P7 | Device A | no | 2 | 10 |
P8 | Device A | yes | 3 | 6 |
P9 | Device A | no | 8 | 22 |
P10 | Device A | yes | 10 | 14 |
P11 | Device B | no | 11 | 20 |
P12 | Device B | yes | 15 | 18 |
P13 | Device B | no | 4 | 11 |
P14 | Device B | yes | 5 | 11 |
P15 | Device B | yes | 7 | 12 |
P16 | Device B | yes | 9 | 13 |
P17 | Device B | yes | 12 | 28 |
P18 | Device B | no | 6 | 10 |
P19 | Device B | no | 5 | 14 |
P20 | Device B | no | 8 | 10 |
We use this “Total email count” as a baseline for each participant. This baseline is called offset in GLM. With the offset, the model becomes like this.
where j is the index of the data point (y_j means the dependent variable of j-th data point). This model allows us to have an adjustment (offset) for each data point. More intuitively, we are going to use the rate of the number of sent emails from the mobile devices over the total number of the sent emails. Thus, x_ji is the rate, and o_j is the total number of the sent emails. We can do this in the glm() function by using the offset option. But you have to calculate its logarithmic.
So, we still have a significant effect of the data plan.
It's not completely done even if you consider the offset. Another problem you may have is overdispersion. We have built a model with some (often very) strong assumptions: we only have two factors and the distribution of the dependent variable follows the Poisson distribution. Thus, the predicted values by our model might be good enough to describe the variance we observed in the data. If this is the case, the model is not appropriate enough to use. This is what we call overdispersion, and often happens in Poisson regression.
To see whether the model suffers from overdispersion, we can calculate the sum of square of the standardized residuals (i.e., values of the difference between the observed values and predicted values), and compare it to the chi-squared distribution.
Here is the procedure. You first need to calculate the predicted value.
Then, calculate the sum of square of z. If this is much larger than 1, it implies that you have overdispersion. We need two values: the sample size and the number of coefficients. In our case, they are 20 and 3, respectively.
We also can do a test with the chi-squared distribution.
If this value is below 0.05, the test indicates that you have the dispersion. In this example, we don't seem to have the dispersion, which is good news.
What if we have the overdispersion? We then need to adjust the standard error for each coefficient by multiplying them by the root of the sum of square of z. For example, supposed that our sum of square of z turned out 10 in our example. The coefficient for DataPlanY is 0.41366 with the standard error 0.17353. This means that the coefficient is likely [0.23983, 0.58689]. With the adjustment for the overdispersion, the standard error becomes 0.17353 * sqrt(10) = 0.54875. Thus, the adjusted coefficient is likely [-0.13509, 0.96241], which means that it can be zero (which means that the effect of DataPlanY can be zero). Fortunately, you don't have to do this manually, and you can re-fit the model by using quasipoisson as follows.
It also calculates the p value based on the adjustment for the overdispersion.
This is basically the logistic regression, but we are going to use the binomial distribution for the dependent variable. In this way, we can model the number of trials of interest out of the all trials, like a success rate or an error rate.
Let's say you have a game application in which the participants can hit monsters by a mouse click on them. In your experiment, you set different time limits (e.g., 10, 20, 30, 40, 50, and 60 sec), and measured the success rate (i.e., the number of the monsters the participants could clicked during the trial out of the pre-defiend number of the monsters). Now, what you want to do is to create a model of the success rate with the game time limit. Let's say your data look like this.
Participant | Time | Clicked | Not-clicked | Success rate |
---|---|---|---|---|
P1 | 10 sec | 10 | 90 | 0.10 |
P2 | 20 sec | 23 | 77 | 0.23 |
P3 | 30 sec | 40 | 60 | 0.40 |
P4 | 40 sec | 70 | 30 | 0.70 |
P5 | 50 sec | 82 | 18 | 0.82 |
P6 | 60 sec | 96 | 4 | 0.96 |
P7 | 10 sec | 12 | 88 | 0.12 |
P8 | 20 sec | 20 | 80 | 0.20 |
P9 | 30 sec | 35 | 65 | 0.35 |
P10 | 40 sec | 58 | 42 | 0.58 |
P11 | 50 sec | 76 | 24 | 0.76 |
P12 | 60 sec | 90 | 10 | 0.90 |
(Well, you will have much more data in reality, but I put 12 samples here to make the example simple.) Let's try to do logistic regression for this data. First, we prepare the data in R.
Note that we prepare both the numbers of “Success” and “Failure”, and make it into one variable, called “GameResult”. In this way, the logistic regression is going to use the binomial distribution. Now, we do the same thing with the glm() function.
You get the result.
You can do a similar analysis as described in the previous section. For instance, the model greatly improves the deviance, which indicates that this model is very powerful. Time is statistically significant on predicting the success rate.
This model also often suffers from the overdispersion, so you have to be careful about it. You can find more details on how to calculate the overdispersion in the previous section. You can use quasibinomial to adjust the model for overdispersion.
The Poisson model is similar to the logistic-binomial model, particularly in a sense that both treat count data, and you may wonder how we should choose which model to use. Here is the rule of thumb.
(coming soon).
The content in this page is largely what I learned from the following book. Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill. It is the best book I have read about regression analysis (well, I haven't read that many books, but I'm sure this is an excellent book).