Outlier Detection and Removal
Introduction
When you get samples, it is likely to have some kinds of outliers in your data. Outliers mean datapoints which look very irregular compared to other datapoints. For example, if you have a series of datapoints like [1, 2, 1, 3, 2, 100], 100 is likely to be an outlier. Such outliers can happen for many different reasons: the system got some glitches for that particular trial, a participant wasn't concentrated on the task well, or she just did some random stuff. We want to remove such outliers because statistics we usually use can be greatly impacted by them. For example the mean of the datapoints above with the outlier is 18.2 whereas the one without the outlier is 1.8 (ten times different!).
However, you have to justify the definition of outliers in a convincing manner. You cannot just regard all datapoints you don't like as outliers, of course. In this page, I explain a couple of standard methods to detect outliers.
Method using the standard deviation
The most common method for outlier removal I have seen in HCI papers is to use standard deviations. This is quite easy to do, so I would recommend this first and see if you need additional measures.
First, we prepare a hypothetical dataset. This should be replaced with your actual data.
set.seed(123)
x <- c(rnorm(18, 0, 1), 5, 10)
data <- data.frame(x)
data
x
1 -0.56047565
2 -0.23017749
3 1.55870831
4 0.07050839
5 0.12928774
6 1.71506499
7 0.46091621
8 -1.26506123
9 -0.68685285
10 -0.44566197
11 1.22408180
12 0.35981383
13 0.40077145
14 0.11068272
15 -0.55584113
16 1.78691314
17 0.49785048
18 -1.96661716
19 5.00000000
20 10.00000000
We have two possible outliers at the end (5 and 10). Now, let's calculate the tau for each value.
data <- data.frame(data, abs((data[,1] - mean(data[,1])))/sd(data[,1]))
colnames(data) <- c("x", "tau")
data
x tau
1 -0.56047565 0.5568723
2 -0.23017749 0.4292000
3 1.55870831 0.2622701
4 0.07050839 0.3129738
5 0.12928774 0.2902535
6 1.71506499 0.3227077
7 0.46091621 0.1620669
8 -1.26506123 0.8292206
9 -0.68685285 0.6057218
10 -0.44566197 0.5124926
11 1.22408180 0.1329247
12 0.35981383 0.2011467
13 0.40077145 0.1853150
14 0.11068272 0.2974450
15 -0.55584113 0.5550809
16 1.78691314 0.3504796
17 0.49785048 0.1477904
18 -1.96661716 1.1003977
19 5.00000000 1.5924557
20 10.00000000 3.5251394
Now, let's set the threshold for the outliers. Usually, the tau is larger than either 2 or 3. This means that if a datapoint is beyond +/- 2 or 3 standard deviation from the mean, we regard it as an outlier. Here, we set the threshold as 3SD.
data <- data.frame(data, abs((data[,1] - mean(data[,1])))/sd(data[,1]) > 3.0)
colnames(data) <- c("x", "tau", "outlier")
data
x tau outlier
1 -0.56047565 0.5568723 FALSE
2 -0.23017749 0.4292000 FALSE
3 1.55870831 0.2622701 FALSE
4 0.07050839 0.3129738 FALSE
5 0.12928774 0.2902535 FALSE
6 1.71506499 0.3227077 FALSE
7 0.46091621 0.1620669 FALSE
8 -1.26506123 0.8292206 FALSE
9 -0.68685285 0.6057218 FALSE
10 -0.44566197 0.5124926 FALSE
11 1.22408180 0.1329247 FALSE
12 0.35981383 0.2011467 FALSE
13 0.40077145 0.1853150 FALSE
14 0.11068272 0.2974450 FALSE
15 -0.55584113 0.5550809 FALSE
16 1.78691314 0.3504796 FALSE
17 0.49785048 0.1477904 FALSE
18 -1.96661716 1.1003977 FALSE
19 5.00000000 1.5924557 FALSE
20 10.00000000 3.5251394 TRUE
Now, we see that the last datapoint is regarded as an outlier. In practice, you don't need to see the intermediate state, but I just did it here for the sake of learning. You can now remove all outliers as follows and move on to appropriate analysis.
data[data$outlier==FALSE,]
x tau outlier
1 -0.56047565 0.5568723 FALSE
2 -0.23017749 0.4292000 FALSE
3 1.55870831 0.2622701 FALSE
4 0.07050839 0.3129738 FALSE
5 0.12928774 0.2902535 FALSE
6 1.71506499 0.3227077 FALSE
7 0.46091621 0.1620669 FALSE
8 -1.26506123 0.8292206 FALSE
9 -0.68685285 0.6057218 FALSE
10 -0.44566197 0.5124926 FALSE
11 1.22408180 0.1329247 FALSE
12 0.35981383 0.2011467 FALSE
13 0.40077145 0.1853150 FALSE
14 0.11068272 0.2974450 FALSE
15 -0.55584113 0.5550809 FALSE
16 1.78691314 0.3504796 FALSE
17 0.49785048 0.1477904 FALSE
18 -1.96661716 1.1003977 FALSE
19 5.00000000 1.5924557 FALSE
Smirnov‐Grubbs Test
One might think of doing outlier removal more extensively. The example above runs only once and standard deviation is calculated with the dataset including outliers. This means that the dataset without outliers would have a different standard deviation, and thus we might have new outliers with this new SD. In the example above, we might also want to remove the datapoint of 5 as it also looks quite dissimilar to the other values.
One common way to do more “recursive” outlier removal is Smirnov-Grubb test (or just Grubb test). This is a null hypothesis significant test with the null hypothesis that there is no outlier. The alternative hypothesis is there is at least one outlier in the dataset. However, this test does not tell us how many outliers exist. But it can tell us the most distant datapoint. Let's see how the test works. We need to install the outliers package.
library(outliers)
set.seed(123)
x <- c(rnorm(18, 0, 1), 5, 10)
grubbs.test(x)
Grubbs test for one outlier
data: x
G = 3.5251, U = 0.3115, p-value = 6.049e-05
alternative hypothesis: highest value 10 is an outlier
As the result shows, 10 is now considered as an outlier.
We then basically iterate this test with a dataset after an outlier is removed each time, and we then stop when the test does not show a significant result. To do this, we prepare a small function. The following function, created by Sam Dickson, was based on the discussion thread in the following webpage.
http://stackoverflow.com/questions/22837099/how-to-repeat-the-grubbs-test-and-flag-the-outliers
grubbs.flag <- function(x) {
outliers <- NULL
test <- x
grubbs.result <- grubbs.test(test)
pv <- grubbs.result$p.value
while(pv < 0.05) {
outliers <- c(outliers,as.numeric(strsplit(grubbs.result$alternative," ")[[1]][3]))
test <- x[!x %in% outliers]
grubbs.result <- grubbs.test(test)
pv <- grubbs.result$p.value
}
return(data.frame(X=x,Outlier=(x %in% outliers)))
}
Now, let's run this function.
grubbs.flag(x)
X Outlier
1 -0.56047565 FALSE
2 -0.23017749 FALSE
3 1.55870831 FALSE
4 0.07050839 FALSE
5 0.12928774 FALSE
6 1.71506499 FALSE
7 0.46091621 FALSE
8 -1.26506123 FALSE
9 -0.68685285 FALSE
10 -0.44566197 FALSE
11 1.22408180 FALSE
12 0.35981383 FALSE
13 0.40077145 FALSE
14 0.11068272 FALSE
15 -0.55584113 FALSE
16 1.78691314 FALSE
17 0.49785048 FALSE
18 -1.96661716 FALSE
19 5.00000000 TRUE
20 10.00000000 TRUE
So, we are going to remove the last two values.