User Tools

Site Tools


hcistats:outlierdetection

Outlier Detection and Removal


Introduction

When you get samples, it is likely to have some kinds of outliers in your data. Outliers mean datapoints which look very irregular compared to other datapoints. For example, if you have a series of datapoints like [1, 2, 1, 3, 2, 100], 100 is likely to be an outlier. Such outliers can happen for many different reasons: the system got some glitches for that particular trial, a participant wasn't concentrated on the task well, or she just did some random stuff. We want to remove such outliers because statistics we usually use can be greatly impacted by them. For example the mean of the datapoints above with the outlier is 18.2 whereas the one without the outlier is 1.8 (ten times different!).

However, you have to justify the definition of outliers in a convincing manner. You cannot just regard all datapoints you don't like as outliers, of course. In this page, I explain a couple of standard methods to detect outliers.



Method using the standard deviation

The most common method for outlier removal I have seen in HCI papers is to use standard deviations. This is quite easy to do, so I would recommend this first and see if you need additional measures.

First, we prepare a hypothetical dataset. This should be replaced with your actual data.

set.seed(123) x <- c(rnorm(18, 0, 1), 5, 10) data <- data.frame(x) data x 1 -0.56047565 2 -0.23017749 3 1.55870831 4 0.07050839 5 0.12928774 6 1.71506499 7 0.46091621 8 -1.26506123 9 -0.68685285 10 -0.44566197 11 1.22408180 12 0.35981383 13 0.40077145 14 0.11068272 15 -0.55584113 16 1.78691314 17 0.49785048 18 -1.96661716 19 5.00000000 20 10.00000000

We have two possible outliers at the end (5 and 10). Now, let's calculate the tau for each value.

data <- data.frame(data, abs((data[,1] - mean(data[,1])))/sd(data[,1])) colnames(data) <- c("x", "tau") data x tau 1 -0.56047565 0.5568723 2 -0.23017749 0.4292000 3 1.55870831 0.2622701 4 0.07050839 0.3129738 5 0.12928774 0.2902535 6 1.71506499 0.3227077 7 0.46091621 0.1620669 8 -1.26506123 0.8292206 9 -0.68685285 0.6057218 10 -0.44566197 0.5124926 11 1.22408180 0.1329247 12 0.35981383 0.2011467 13 0.40077145 0.1853150 14 0.11068272 0.2974450 15 -0.55584113 0.5550809 16 1.78691314 0.3504796 17 0.49785048 0.1477904 18 -1.96661716 1.1003977 19 5.00000000 1.5924557 20 10.00000000 3.5251394

Now, let's set the threshold for the outliers. Usually, the tau is larger than either 2 or 3. This means that if a datapoint is beyond +/- 2 or 3 standard deviation from the mean, we regard it as an outlier. Here, we set the threshold as 3SD.

data <- data.frame(data, abs((data[,1] - mean(data[,1])))/sd(data[,1]) > 3.0) colnames(data) <- c("x", "tau", "outlier") data x tau outlier 1 -0.56047565 0.5568723 FALSE 2 -0.23017749 0.4292000 FALSE 3 1.55870831 0.2622701 FALSE 4 0.07050839 0.3129738 FALSE 5 0.12928774 0.2902535 FALSE 6 1.71506499 0.3227077 FALSE 7 0.46091621 0.1620669 FALSE 8 -1.26506123 0.8292206 FALSE 9 -0.68685285 0.6057218 FALSE 10 -0.44566197 0.5124926 FALSE 11 1.22408180 0.1329247 FALSE 12 0.35981383 0.2011467 FALSE 13 0.40077145 0.1853150 FALSE 14 0.11068272 0.2974450 FALSE 15 -0.55584113 0.5550809 FALSE 16 1.78691314 0.3504796 FALSE 17 0.49785048 0.1477904 FALSE 18 -1.96661716 1.1003977 FALSE 19 5.00000000 1.5924557 FALSE 20 10.00000000 3.5251394 TRUE

Now, we see that the last datapoint is regarded as an outlier. In practice, you don't need to see the intermediate state, but I just did it here for the sake of learning. You can now remove all outliers as follows and move on to appropriate analysis.

data[data$outlier==FALSE,] x tau outlier 1 -0.56047565 0.5568723 FALSE 2 -0.23017749 0.4292000 FALSE 3 1.55870831 0.2622701 FALSE 4 0.07050839 0.3129738 FALSE 5 0.12928774 0.2902535 FALSE 6 1.71506499 0.3227077 FALSE 7 0.46091621 0.1620669 FALSE 8 -1.26506123 0.8292206 FALSE 9 -0.68685285 0.6057218 FALSE 10 -0.44566197 0.5124926 FALSE 11 1.22408180 0.1329247 FALSE 12 0.35981383 0.2011467 FALSE 13 0.40077145 0.1853150 FALSE 14 0.11068272 0.2974450 FALSE 15 -0.55584113 0.5550809 FALSE 16 1.78691314 0.3504796 FALSE 17 0.49785048 0.1477904 FALSE 18 -1.96661716 1.1003977 FALSE 19 5.00000000 1.5924557 FALSE



Smirnov‐Grubbs Test

One might think of doing outlier removal more extensively. The example above runs only once and standard deviation is calculated with the dataset including outliers. This means that the dataset without outliers would have a different standard deviation, and thus we might have new outliers with this new SD. In the example above, we might also want to remove the datapoint of 5 as it also looks quite dissimilar to the other values.

One common way to do more “recursive” outlier removal is Smirnov-Grubb test (or just Grubb test). This is a null hypothesis significant test with the null hypothesis that there is no outlier. The alternative hypothesis is there is at least one outlier in the dataset. However, this test does not tell us how many outliers exist. But it can tell us the most distant datapoint. Let's see how the test works. We need to install the outliers package.

library(outliers) set.seed(123) x <- c(rnorm(18, 0, 1), 5, 10) grubbs.test(x) Grubbs test for one outlier data: x G = 3.5251, U = 0.3115, p-value = 6.049e-05 alternative hypothesis: highest value 10 is an outlier

As the result shows, 10 is now considered as an outlier.

We then basically iterate this test with a dataset after an outlier is removed each time, and we then stop when the test does not show a significant result. To do this, we prepare a small function. The following function, created by Sam Dickson, was based on the discussion thread in the following webpage. http://stackoverflow.com/questions/22837099/how-to-repeat-the-grubbs-test-and-flag-the-outliers

grubbs.flag <- function(x) { outliers <- NULL test <- x grubbs.result <- grubbs.test(test) pv <- grubbs.result$p.value while(pv < 0.05) { outliers <- c(outliers,as.numeric(strsplit(grubbs.result$alternative," ")[[1]][3])) test <- x[!x %in% outliers] grubbs.result <- grubbs.test(test) pv <- grubbs.result$p.value } return(data.frame(X=x,Outlier=(x %in% outliers))) }

Now, let's run this function.

grubbs.flag(x) X Outlier 1 -0.56047565 FALSE 2 -0.23017749 FALSE 3 1.55870831 FALSE 4 0.07050839 FALSE 5 0.12928774 FALSE 6 1.71506499 FALSE 7 0.46091621 FALSE 8 -1.26506123 FALSE 9 -0.68685285 FALSE 10 -0.44566197 FALSE 11 1.22408180 FALSE 12 0.35981383 FALSE 13 0.40077145 FALSE 14 0.11068272 FALSE 15 -0.55584113 FALSE 16 1.78691314 FALSE 17 0.49785048 FALSE 18 -1.96661716 FALSE 19 5.00000000 TRUE 20 10.00000000 TRUE

So, we are going to remove the last two values.


Comments

Weizi Li, 2014/08/24 02:40
Seems like these two methods are all assuming the normality. How about outliers removal of non-normal distributions?
Koji Yatani, 2014/08/28 06:59
Weizi, I think it would be tricky to remove outliers without assuming some kinds of distributions. If you are very sure that you can assume non-normal distributions, you can calculate the likelihood for each sample given the distribution you assume, and remove ones which show a very low likelihood. But in many cases, you are not 100% sure which distributions you can assume, and just want to stick with the normal distribution (with the assumption of central limit theorem, for example). So start with methods above unless you have strong confidence in the true distribution of samples.
Weizi Li, 2015/04/23 15:32
Thanks a lot! Sorry I just saw your reply, :)
Denniszer, 2017/03/29 07:30
kosten strattera 40 mg
<a href= http://brandimroz.bloggsite.se >flagyl ovules dosage</a>
[url="http://jadalipfor.e-monsite.com"]metronidazol creme bijsluiter[/url]
amoxicilline clavulaanzuur fk
<a href= http://rodrigoase.bloggo.nu/amoxicilline/ >amoxicilline eg 500 mg</a>
[url="http://brandimroz.bloggsite.se"]flagyl bijsluiter hond[/url]
propranolol migraines
WilliamCaw, 2017/03/31 10:12
cipramil 10 mg preis
<a href= https://benitocape.jimdo.com >aknemycin lösung anwendung</a>
[url=https://agnusdinap.yolasite.com]nifurantin 500[/url]
gabapentin nebenwirkungen erfahrungen
<a href= https://agnusdinap.yolasite.com >nifurantin b6 nebenwirkungen</a>
[url=http://lonnielenn.edublogs.org/2017/03/27/azithromycin/]azithromycin pille yasmin[/url]
zineryt rezeptfrei kaufen
SamuelBar, 2017/04/02 08:58
exforge 5mg 160mg novartis
<a href= http://delmydavid.blogg.se >ibux eller paracet mot tannverk</a>
[url=http://trinidadpi.bloggsite.se]zovirax tabletter alkohol[/url]
dalacin 150 mg pris
<a href= http://project196270.tilda.ws >pergotime bivirkninger etter eggløsning</a>
[url=http://www.devote.se/grettaflam/]flagyl une seule prise[/url]
proscar (finasteride) and avodart (dutasteride)
Reubengom, 2017/04/02 14:36
qlaira menstruação marrom
<a href= http://jonsuchars.oneminutesite.it >ciprofloxacino 500mg valor</a>
[url=http://andreahatt.wifeo.com]comprar excedrin migraña[/url]
augmentin xarope para que serve
<a href= http://alvamacgre.ayosport.com/article-15552-priligy >priligy comprar</a>
[url=http://kelleymcea.onlc.be]nexium 40mg generico preço[/url]
qlaira bayer valor
Williamfrems, 2017/04/04 06:46
diclofenaco preço drogasil
<a href= http://www.imxprs.com/free/tarenchern/tarenchern >donde comprar excedrin en medellin</a>
[url=http://kaleighdzi.myfreesites.net]augmentin 500mg/125mg comprimate filmate[/url]
ciprofloxacino colirio 0 3
<a href= http://www.breannawei-60.webself.net >anticoncepcional qlaira generico</a>
[url=http://lauriceeag.freeblog.site]ciprofloxacino posologia[/url]
albendazol dose unica valor
Michaelwramb, 2017/04/07 03:27
priligy generique dapoxetine
<a href= http://elwoodeast.soup.io >nolvadex 20 mg bodybuilding</a>
[url=http://leroypusey.startspot.nl]tolexine 100 avis[/url]
nolvadex pct buy
<a href= http://leroypusey.startspot.nl >tolexine 50 acné</a>
[url=http://project200559.tilda.ws]amoxicilline sandoz 750[/url]
amoxicilline sandoz prix
Matthewdug, 2017/04/07 07:07
nolvadex 20 mg forum
<a href= https://adrianneya.yolasite.com >clomid et duphaston grossesse multiple</a>
[url=http://jolenemale.onlc.be]nolvadex prix tunisie[/url]
amoxicilline 250 conservation
<a href= https://adrianneya.yolasite.com >clomid 100mg success</a>
[url=http://gitabeursk.site123.me/about]posologie dermipred 10[/url]
dermipred 5 chien et chat
JamesIrome, 2017/04/10 07:12
amoxicillina sandoz prezzo
<a href= http://alphonsore.freeblog.site >imiquimod 3.75</a>
[url=http://jaquelineb.spruz.com]tadalafil 5 mg online[/url]
celecoxib a cosa serve
<a href= http://audryaboud.site123.me/about >levetiracetam dosaggio plasmatico</a>
[url=http://www.onweb.fr?company_id=17319]amoxicillina prezzo con ricetta[/url]
imiquimod crema effetti collaterali
Richardhag, 2017/04/11 08:00
janumet 50/1000 mg
<a href= http://kingkouale.tribalpages.com >levetiracetam 500</a>
[url=http://cruzliedy.guildomatic.com]celecoxib cane[/url]
levetiracetam dosaggio
<a href= http://elishakoew.ayosport.com/article-15572-zovirax >zovirax cream price</a>
[url=http://cruzliedy.guildomatic.com]celecoxib prezzo[/url]
etoricoxib 60
Please enter your comment. You cannot remove your comments by yourself. So double-check before you submit.:
If you can't read the letters on the image, download this .wav file to get them read to you.
 
hcistats/outlierdetection.txt · Last modified: 2014/08/14 05:21 by Koji Yatani

Page Tools