hcistats:outlierdetection

When you get samples, it is likely to have some kinds of outliers in your data. Outliers mean datapoints which look very irregular compared to other datapoints. For example, if you have a series of datapoints like [1, 2, 1, 3, 2, 100], 100 is likely to be an outlier. Such outliers can happen for many different reasons: the system got some glitches for that particular trial, a participant wasn't concentrated on the task well, or she just did some random stuff. We want to remove such outliers because statistics we usually use can be greatly impacted by them. For example the mean of the datapoints above with the outlier is 18.2 whereas the one without the outlier is 1.8 (ten times different!).

However, **you have to justify the definition of outliers in a convincing manner.** You cannot just regard all datapoints you don't like as outliers, of course. In this page, I explain a couple of standard methods to detect outliers.

The most common method for outlier removal I have seen in HCI papers is to use standard deviations. This is quite easy to do, so I would recommend this first and see if you need additional measures.

First, we prepare a hypothetical dataset. This should be replaced with your actual data.

We have two possible outliers at the end (5 and 10). Now, let's calculate the tau for each value.

Now, let's set the threshold for the outliers. Usually, the tau is larger than either 2 or 3. This means that if a datapoint is beyond +/- 2 or 3 standard deviation from the mean, we regard it as an outlier. Here, we set the threshold as 3SD.

Now, we see that the last datapoint is regarded as an outlier. In practice, you don't need to see the intermediate state, but I just did it here for the sake of learning. You can now remove all outliers as follows and move on to appropriate analysis.

One might think of doing outlier removal more extensively. The example above runs only once and standard deviation is calculated with the dataset including outliers. This means that the dataset without outliers would have a different standard deviation, and thus we might have new outliers with this new SD. In the example above, we might also want to remove the datapoint of 5 as it also looks quite dissimilar to the other values.

One common way to do more “recursive” outlier removal is Smirnov-Grubb test (or just Grubb test). This is a null hypothesis significant test with the null hypothesis that there is no outlier. The alternative hypothesis is there is **at least one** outlier in the dataset. However, this test does not tell us how many outliers exist. But it can tell us the most distant datapoint. Let's see how the test works. We need to install the outliers package.

As the result shows, 10 is now considered as an outlier.

We then basically iterate this test with a dataset after an outlier is removed each time, and we then stop when the test does not show a significant result. To do this, we prepare a small function. The following function, created by Sam Dickson, was based on the discussion thread in the following webpage. http://stackoverflow.com/questions/22837099/how-to-repeat-the-grubbs-test-and-flag-the-outliers

Now, let's run this function.

So, we are going to remove the last two values.

hcistats/outlierdetection.txt · Last modified: 2014/08/14 05:21 by Koji Yatani

## Comments