| Tukey's Outlier Filter | Close |
|
In general, data samples contain outliers, and the larger the sample size the higher the probability of getting at least one outlier (called one-wild). When outliers are present, estimates of average and standard deviation are distorted; therefore, these estimates should not be used to make inferences about central tendency and data spread, or to produce confidence intervals about the central value of the data distribution. Two strategies can be pursued when outliers are detected. First, we can use quartiles (or confidence intervals based on quartiles, such as the Sign test for the median) because they are less sensitive to outliers. Second, we can remove the detected outliers, and then produce estimates of average, standard deviation, and confidence intervals using the data without outliers. Tukey proposed a method to set aside outliers based on the following rule: Set aside observation Y from the computation of the average and the standard deviation if: Y < (Q1 - 1.5 IQR) or Y > (Q3 + 1.5 IQR), where Q1 is the lower quartile, Q3 is the upper quartile, and IQR = (Q3 - Q1) is the interquartile range. In general, boxplots are programmed to identify outlier cutoff or fences according to Tukey's rule. See David Hoaglin, Frederick Mosteller, and John Tukey (editors), Understanding Robust and Exploratory Data Analysis, New York, John Wiley & Sons, 1983, pp. 39, 54, 62, 223. http://en.wikipedia.org/wiki/John_W._Tukey and http://en.wikipedia.org/wiki/Outliers |
|