hcistats:nhst

Null Hypothesis Significance Testing (NHST) is a statistical method for testing whether the factor we are talking about has the effect on our observation. For example, a t test or an ANOVA test for comparing the means is a good example of NHST. It probably the most common statistical testing used in HCI.

The general procedure of NHST is as follows:

- Based on the research question, develop the first statistical hypothesis, called the null hypothesis.
- Develop another hypothesis, called the alternative hypothesis.
- Decide the level of statistical significance (usually 0.05), which is also the probability of Type I error.
- Run NHST and determine the p value under the null hypothesis. Reject the null hypothesis if the p value is smaller than the level of statistical significance you decided.

The null hypothesis is usually like “the difference of the means in the groups are equal to 0” or “the means of the two groups are the same”. And the alternative hypothesis is the counterpart of the null hypothesis. So, the procedure is fairly easy to follow, but there are several things you need to be careful about in NHST.

There are several reasons why NHST recently gets criticisms from researchers in other fields. The main criticism is that NHST is overrated. There are also some “myths” around NHST, and people often fail to understand what the results of NHST mean.

First, I explain these myths, and the explain what we can do instead of / in addition to NHST, particularly effect size. The following explanations are largely based on the references I read. I didn't copy and paste, but I didn't change a lot either. I also picked up some of them which probably are closely related to HCI research. I think they would help you understand the problems of NHST, but I encourage you to read the books and papers in the references section.

This book explains the problems of NHST well and presents alternative statistical methods we can try. Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research by Rex B. Kline. (amazon link, the chapter 1 is available as a PDF)

There is a great paper talking about some dangerous aspects of NHST. The Insignificance of Statistical Significance Testing by Johnson, Douglas H.

This is another paper talking about some myths around NHST. The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant by Gelman, A., and Stern, H.

Let's say you have done some kinds of NHST, like t test or ANOVA. And the results show you the p value. But, what does that p value mean? You may think that p is the probability that the null hypothesis holds with your data. This sounds reasonable and you may think that is why you reject the null hypothesis.

The truth is that this is not correct. Don't get upset. Most of the people actually think it is correct. What the p value means is **if we assume that the null hypothesis holds, we have a chance of p that the outcome can be as extreme as or even more extreme than we observed**. Let's say your p value is 0.01. This means you have only 1% chance that the outcome of your experiment is like your results or shows a even clearer difference if the null hypothesis holds. So, it really doesn't make sense to say that the null hypothesis is true. Then, let's reject the null hypothesis, and we say we have a difference.

The point is that **the p value does not directly mean how likely what the null hypothesis describes happens in your experiment**. It tells us how unlikely your observations happen if you assume that the null hypothesis holds. So, how did we decided “how unlikely” is significant? This is the second myth of NHST.

You probably already know that if p < 0.05, you say that there is a significant effect. Some people often say there is a marginal significant difference if p < 0.1. But where did these thresholds (0.05 and 0.1) come from? These thresholds are basically what Fisher proposed, and we really haven't changed it. It is probably a reasonable choice, but there is no theoretical reasoning for it. For example, why is p = 0.04 significant and p = 0.06 is not? Why is p = 0.0499 significant and p = 0.0501 is not? If they have the equal sample sizes, the cases of p = 0.0499 and p = 0.0501 have almost the same effects. Unfortunately, this 0.05 or 0.1 is more or less an arbitrary threshold, and just saying significant or not significant may not give us the whole picture of your results.

Another criticism that NHST has is that the test largely depends on the sample size. We can quickly test this in R.

Here, I create 10 samples from two normal distributions: One with mean=0 and SD=2, and one with mean=1 and SD=2. If I do a t test, the results are:

So, it is not significant. But what if I have 100 samples? Let's try.

Now, p < 0.05, and we have a significant difference. As you can see here, if you can increase the sample size, you can still find a significant difference even though the difference is really tiny. Of course, increasing the sample size is not always easy in HCI research (*e.g.*, recruiting more participants). But this means you can continue your experiment until you can find a significant difference without bending any rule for NHST. This is related to the next myth.

Another common misunderstanding is that the p value indicates the magnitude of an effect. For example, someone might say the effect with p = 0.001 has a stronger power than the effect with p = 0.01. This is not true. The p value has nothing with the magnitude of an effect. The p value is the conditional probability of the occurrence of the data you observed given the null hypothesis. And with NHST, we only can ask a dichotomous question, such as whether one interaction technique is faster than the other. Thus, the answer you will get through NHST is yes or no, and you don't know anything about “how much (the effect is)”.

Even if you have a significant result, it is “significant” only in the context of statistics, and it does not necessarily mean that it is meaningful or important in a practical meaning.

Let's say you are reviewing a paper that compares two techniques: one is what the authors developed, and the other is a conventional technique. Let's say, the performance time of some tasks was improved with a new interaction technique by 1% (SD=0.1%). In this case, the results would show a significant difference, but 1% improvement on the performance time may not be that important in practical settings. In another case, let's say, the performance time was improved with a new technique by 15% (SD=15%). In this case, the results wouldn't show a significant difference (because the standard deviation is too large). However, 15% improvement in average may be something. And, based on this work, other researchers might come up with a better technique in which the standard deviation of the performance time is smaller. Thus, the developed technique may be important enough to be shared even though the author couldn't find a significant difference. In such cases, I think that the paper should be accepted (although the authors may have to explain why the standard deviation was so large or how their technique could be improved).

Thus, we should not overrate statistically-significant results and underrate non-statistically-significant results, and should carefully interpret the results.

Psychology researchers seem to have noticed problems of NHST and now they require to add the effect size and confidence intervals to the results of NHST for a paper. This trend seems to be shared with biology and medical research. You can seem editorial policies about statistical tests in various fields here.

Other recommendations is to include all the necessary information for which other researchers can double-check the results. Confidence intervals as well as descriptive statistics (*e.g.*, means, standard deviations, etc) are considered important to report.

Recently some of the fields require the authors to report the confidence intervals of the effect size in addition to values above. You should look into what you are expected to report in your field before writing a paper.

So, you may want to ask “isn't NHST good and don't we even bother doing it?” or “should we stop believing the results of NHST?” My answer is we just need to be a little bit more careful to use it. If we can use NHST properly, it is a great tool to analyze your data if

**you understand the meaning of the p value correctly**(Myth 1, Myth 4),**you use the term “significant” properly**(Myth 5),**your null hypothesis is appropriate**, and**your research question can be answered by “yes” or “no”**.

Although the problems of NHST have been discussed in other fields fairly actively, I haven't really seen a big discussion about how we should use NHST or any kind of statistical methods in HCI. There are probably many different reasons. First, we assume that the authors did an analysis with the best knowledge they had at that moment, so we can still trust the results even if the procedure of the test was not perfect or some pieces of information are missing in a paper. Another point is that the evaluation of user interfaces cannot be performed by only statistical tests. For example, we often have qualitative data from interviews, surveys, or videos taken during the experiment. The design of a new interaction technique or a new interactive system itself may be worth sharing with other researchers and developers. Thus, the results gained from statistical tests themselves may not be as important as in other fields.

However, I also feel that **the current HCI papers (and maybe reviewers, who are also us) overrate NHST a little too much**. I have heard and seen that some papers got rejected simply because they couldn't find significant differences although they showed some interesting things. Again, **being statistically significant is not necessarily being important in a practical setting** (See Myth5 for more details).

Thus, we need to be a little more careful when we are interpreting the results of NHST. Even if the authors cannot find statistical differences, there may be some other things that could benefit or inspire other researchers in that paper.

As I discuss above, one drawback of the p value is that it does not give us any information about how large the effect is. For example, with a t test, we can know whether one technique is faster than the other, but we cannot tell how large the contribution of the techniques is to the improvement of performance time (*i.e.*, the size of the effect). And the p value is dependent of the sample size. So, if you just really look for a significant difference, you can simply run your study until you find it (although it might require hundreds or thousands of participants). Thus, we want to have some metrics which do not depend on the sample size, and show the size of the effect.

**Effect size** is the metric that indicates the magnitude of the effect caused by a factor. The effect size is not subject to the sample size, unlike the p value. Thus, the effect size can complement the information that the p value can show. Adding the effect size would not consume your paper space much, and the effect size for most of the tests is easy to calculate. So, reporting the effect size is a good practice to start. In this wiki, I include how to calculate the effect size for each statistical test for this purpose. A detailed discussion is available at the effect size page.

Another good practice is to include a graph or table. A graph is a wonderful visualization of your data, and helps the reader understand what your data look like very well. In most cases, bar graphs with error bars (either standard deviation or 95% confidence interval) or line plots work well. Although you can produce different graphs in R (see more details in this page, I prefer using Excel for producing graphs because it works better with Word. Another practice I do is to include 95% confidence intervals in graphs instead of standard deviations. If you find it hard to make a graph, try to make a table. A table is often a very efficient way to show some values or statistics you have. Try to find out your ways to show your data and stats test results effectively.

hcistats/nhst.txt · Last modified: 2014/03/29 00:20 by Koji Yatani

## Comments

Generally, NHST formulas can be thought of a simple formula (i.e., systematic variation divided by unsystematic variation). What you end up with is a ratio of good change (experimental effects) divided by bad change (noise), and so the higher the ratio (or test statistic) the more likely our results are going to be significant.

However, in myth five you claim that a difference in the denominator (a change in random chance / noise) can lead us to finding differences in terms of significant results. The example does not work well because aside from appropriately executing experimental methods, there is not much one can (ethically) do to alter how their sample data disperses. It would have made more sense by discussing how the systematic variance can change by selecting good measures, or using instruments which afford more powerful findings.

I think your overall point was we just need to be careful on over relying on statistical significance. Perhaps you could discuss how statistical significance is really just one part of navigating the quantitative world. We use it to answer a very specific question, namely, "does a statistical difference exist?". Although we tend to focus on this question, as you have suggested, the questions answered by other types of test are just as important to take into consideration (i.e., effect sizes answer the question, "does this difference matter?", which seems important to consider when seeing whether a statistical difference exists). In other words, the question itself has merit and is not really going anywhere. We just need to be careful with when we ask it and how we find and use the answer in conjunction with other statistical tests.

If you would rather not frame the discussion on how each test provides you with important statistical information, I would suggest re-framing this myth to reflect how NHST has been abused (i.e., researchers can over collect data, or introduce ways to restrict their unsystematic variation by seeking out a more heterogeneous sample).