Tuesday 28 February 2017

The MAD Method of Outlier Detection (Leys et al., 2013): Play Along in R


What is the median absolute deviation? It's the median deviation from the median for a vector of Xs, multiplied by a constant of (usually) 1.4826. What does it look like in research? Well, imagine that you're interested in extraversion in schools. It's the median difference between the extraversion score of each member of the class and the median extraversion across the class, multiplied by the magic number 1.4826. 

Working it out is straight forward. You can do it in four steps. First, find the median of a vector of Xs (median classroom extraversion score). Second, subtract this median from each X (each child's extraversion score). Third, find the median of these differences. Fourth, multiply by 1.4826.

It's good to know what the MAD is because it is the foundation of a method of outlier detection that was recently touted as a successor to, and improvement over, the standard deviation method common in psychological research (Leys, Klein, Bernard, & Licata, 2013). All you need to do after you have median standard deviation is to multiply it by 2, 2.5, or 3. Depending on how conservative you're feeling. Then add and subtract the resulting value from the median. Any Xs falling outside of this window are outliers.

In a flash report published in JESP, Leys and colleagues argue that the MAD method should be preferred to the standard deviation method. This is because the detection of outliers in the standard deviation method is heavily influenced by the presence of outliers. So using it is a bit like installing a fire alarm in your kitchen whose ability to detect a fire decreases substantially as the number of separate fires in the room increases. Which doesn't sound like a good idea.

My analogy not theirs. And I'm not even sure it's a good one, but I'll go with it for lack of any better ideas. 



Anyway. Enough of the dodgy analogies. Below is an R script for the MAD method of outlier detection for the computational example given by Leys et al. (2013). It can be easily modified for use with your own data, just change the numbers in the vector at the top, or direct it to use a column from a data frame imported in to R.

Here's the paper.



No comments:

Post a Comment