Table of Contents

Outlier Detection and Removal


Introduction

When you get samples, it is likely to have some kinds of outliers in your data. Outliers mean datapoints which look very irregular compared to other datapoints. For example, if you have a series of datapoints like [1, 2, 1, 3, 2, 100], 100 is likely to be an outlier. Such outliers can happen for many different reasons: the system got some glitches for that particular trial, a participant wasn't concentrated on the task well, or she just did some random stuff. We want to remove such outliers because statistics we usually use can be greatly impacted by them. For example the mean of the datapoints above with the outlier is 18.2 whereas the one without the outlier is 1.8 (ten times different!).

However, you have to justify the definition of outliers in a convincing manner. You cannot just regard all datapoints you don't like as outliers, of course. In this page, I explain a couple of standard methods to detect outliers.



Method using the standard deviation

The most common method for outlier removal I have seen in HCI papers is to use standard deviations. This is quite easy to do, so I would recommend this first and see if you need additional measures.

First, we prepare a hypothetical dataset. This should be replaced with your actual data.

set.seed(123) x <- c(rnorm(18, 0, 1), 5, 10) data <- data.frame(x) data x 1 -0.56047565 2 -0.23017749 3 1.55870831 4 0.07050839 5 0.12928774 6 1.71506499 7 0.46091621 8 -1.26506123 9 -0.68685285 10 -0.44566197 11 1.22408180 12 0.35981383 13 0.40077145 14 0.11068272 15 -0.55584113 16 1.78691314 17 0.49785048 18 -1.96661716 19 5.00000000 20 10.00000000

We have two possible outliers at the end (5 and 10). Now, let's calculate the tau for each value.

data <- data.frame(data, abs((data[,1] - mean(data[,1])))/sd(data[,1])) colnames(data) <- c("x", "tau") data x tau 1 -0.56047565 0.5568723 2 -0.23017749 0.4292000 3 1.55870831 0.2622701 4 0.07050839 0.3129738 5 0.12928774 0.2902535 6 1.71506499 0.3227077 7 0.46091621 0.1620669 8 -1.26506123 0.8292206 9 -0.68685285 0.6057218 10 -0.44566197 0.5124926 11 1.22408180 0.1329247 12 0.35981383 0.2011467 13 0.40077145 0.1853150 14 0.11068272 0.2974450 15 -0.55584113 0.5550809 16 1.78691314 0.3504796 17 0.49785048 0.1477904 18 -1.96661716 1.1003977 19 5.00000000 1.5924557 20 10.00000000 3.5251394

Now, let's set the threshold for the outliers. Usually, the tau is larger than either 2 or 3. This means that if a datapoint is beyond +/- 2 or 3 standard deviation from the mean, we regard it as an outlier. Here, we set the threshold as 3SD.

data <- data.frame(data, abs((data[,1] - mean(data[,1])))/sd(data[,1]) > 3.0) colnames(data) <- c("x", "tau", "outlier") data x tau outlier 1 -0.56047565 0.5568723 FALSE 2 -0.23017749 0.4292000 FALSE 3 1.55870831 0.2622701 FALSE 4 0.07050839 0.3129738 FALSE 5 0.12928774 0.2902535 FALSE 6 1.71506499 0.3227077 FALSE 7 0.46091621 0.1620669 FALSE 8 -1.26506123 0.8292206 FALSE 9 -0.68685285 0.6057218 FALSE 10 -0.44566197 0.5124926 FALSE 11 1.22408180 0.1329247 FALSE 12 0.35981383 0.2011467 FALSE 13 0.40077145 0.1853150 FALSE 14 0.11068272 0.2974450 FALSE 15 -0.55584113 0.5550809 FALSE 16 1.78691314 0.3504796 FALSE 17 0.49785048 0.1477904 FALSE 18 -1.96661716 1.1003977 FALSE 19 5.00000000 1.5924557 FALSE 20 10.00000000 3.5251394 TRUE

Now, we see that the last datapoint is regarded as an outlier. In practice, you don't need to see the intermediate state, but I just did it here for the sake of learning. You can now remove all outliers as follows and move on to appropriate analysis.

data[data$outlier==FALSE,] x tau outlier 1 -0.56047565 0.5568723 FALSE 2 -0.23017749 0.4292000 FALSE 3 1.55870831 0.2622701 FALSE 4 0.07050839 0.3129738 FALSE 5 0.12928774 0.2902535 FALSE 6 1.71506499 0.3227077 FALSE 7 0.46091621 0.1620669 FALSE 8 -1.26506123 0.8292206 FALSE 9 -0.68685285 0.6057218 FALSE 10 -0.44566197 0.5124926 FALSE 11 1.22408180 0.1329247 FALSE 12 0.35981383 0.2011467 FALSE 13 0.40077145 0.1853150 FALSE 14 0.11068272 0.2974450 FALSE 15 -0.55584113 0.5550809 FALSE 16 1.78691314 0.3504796 FALSE 17 0.49785048 0.1477904 FALSE 18 -1.96661716 1.1003977 FALSE 19 5.00000000 1.5924557 FALSE



Smirnov‐Grubbs Test

One might think of doing outlier removal more extensively. The example above runs only once and standard deviation is calculated with the dataset including outliers. This means that the dataset without outliers would have a different standard deviation, and thus we might have new outliers with this new SD. In the example above, we might also want to remove the datapoint of 5 as it also looks quite dissimilar to the other values.

One common way to do more “recursive” outlier removal is Smirnov-Grubb test (or just Grubb test). This is a null hypothesis significant test with the null hypothesis that there is no outlier. The alternative hypothesis is there is at least one outlier in the dataset. However, this test does not tell us how many outliers exist. But it can tell us the most distant datapoint. Let's see how the test works. We need to install the outliers package.

library(outliers) set.seed(123) x <- c(rnorm(18, 0, 1), 5, 10) grubbs.test(x) Grubbs test for one outlier data: x G = 3.5251, U = 0.3115, p-value = 6.049e-05 alternative hypothesis: highest value 10 is an outlier

As the result shows, 10 is now considered as an outlier.

We then basically iterate this test with a dataset after an outlier is removed each time, and we then stop when the test does not show a significant result. To do this, we prepare a small function. The following function, created by Sam Dickson, was based on the discussion thread in the following webpage. http://stackoverflow.com/questions/22837099/how-to-repeat-the-grubbs-test-and-flag-the-outliers

grubbs.flag <- function(x) { outliers <- NULL test <- x grubbs.result <- grubbs.test(test) pv <- grubbs.result$p.value while(pv < 0.05) { outliers <- c(outliers,as.numeric(strsplit(grubbs.result$alternative," ")[[1]][3])) test <- x[!x %in% outliers] grubbs.result <- grubbs.test(test) pv <- grubbs.result$p.value } return(data.frame(X=x,Outlier=(x %in% outliers))) }

Now, let's run this function.

grubbs.flag(x) X Outlier 1 -0.56047565 FALSE 2 -0.23017749 FALSE 3 1.55870831 FALSE 4 0.07050839 FALSE 5 0.12928774 FALSE 6 1.71506499 FALSE 7 0.46091621 FALSE 8 -1.26506123 FALSE 9 -0.68685285 FALSE 10 -0.44566197 FALSE 11 1.22408180 FALSE 12 0.35981383 FALSE 13 0.40077145 FALSE 14 0.11068272 FALSE 15 -0.55584113 FALSE 16 1.78691314 FALSE 17 0.49785048 FALSE 18 -1.96661716 FALSE 19 5.00000000 TRUE 20 10.00000000 TRUE

So, we are going to remove the last two values.