There are several ways to find outliers in the pattern of a data set, one of which is the Interquartile Range (IQR) method.
The interquartile range, often abbreviated IQR, is the difference between the 25th percentile (Q1) and the 75th percentile (Q3) in a dataset. It measures the spread of the middle 50% of values and shows how the data is spread about the median. It is also less susceptible than the range to outliers and can, therefore, be more helpful.
What are outliers?
Outliers are values at the extreme ends of a dataset. Some may represent true values from natural variation in the population. Other outliers may result from incorrect data entry, equipment malfunctions, or other measurement errors.
An outlier isn’t always a form of erroneous or incorrect data, so you have to be careful with them in data cleansing. What you should do with an outlier depends on its most likely cause.
Types of outliers
True outliers
True outliers should always be retained in your dataset because these just represent natural variations in your sample.
An example of a true outlier is when you measure 100-meter running times for a representative sample of 560 college students. Your data are normally distributed with a couple of outliers on either end. Most values are centered around the middle, as expected. But these extreme values also represent natural variations because a variable like running time is influenced by many other factors.
True outliers are also present in variables with skewed distributions where many data points are spread far from the mean in one direction. It’s important to select appropriate statistical tests or measures when you have a skewed distribution or many outliers.
Other outliers
Outliers that don’t represent true values can come from many possible sources:
- Data entry or processing errors
- Measurement errors
- Unrepresentative sampling
An example of other outliers is when you repeat your running time measurements for a new sample. For one of the participants, you accidentally start the timer midway through their sprint. You record this timing as their running time.
This data point is a big outlier in your dataset because it’s much lower than all of the other times.
This type of outlier is problematic because it’s inaccurate and can distort your research results if you calculate the average running time for all participants using this data. The average is much lower when you include the outlier compared to when you exclude it. Your standard deviation also increases when you include the outlier, so your statistical power is lower as well.
In practice, it can be difficult to tell different types of outliers apart. While you can use calculations and statistical methods to detect outliers, classifying them as true or false is usually a subjective process.
Understanding IQR
Any set of data can be described by its five-number summary. These five numbers, which give you the information you need to find patterns and outliers, consist of (in ascending order):
- The minimum or lowest value of the dataset
- The first quartile Q1, which represents a quarter of the way through the list of all data
- The median of the data set, which represents the midpoint of the whole list of data
- The third quartile Q3, which represents three-quarters of the way through the list of all data
- The maximum or highest value of the data set.
These five numbers tell a person more about their data than looking at the numbers all at once could, or at least make this much easier. For example, the range, which is the minimum subtracted from the maximum, is one indicator of how spread out the data is in a set.
However, note that the range is highly sensitive to outliers. If an outlier is also a minimum or maximum, the range will not be an accurate representation of the breadth of a data set.
Range would be difficult to extrapolate otherwise. Similar to the range but less sensitive to outliers is the interquartile range. The interquartile range is calculated in much the same way as the range. All you do to find it is subtract the first quartile from the third quartile:
IQR = Q3 – Q1.
The interquartile range shows how the data is spread about the median. It is less susceptible than the range to outliers and can, therefore, be more helpful.
How to find outliers with IQR
Though it’s not often affected much by them, the interquartile range can be used to detect outliers. This is done using these steps:
- Calculate the interquartile range for the data.
- Multiply the interquartile range (IQR) by 1.5 (a constant used to discern outliers).
- Add 1.5 x (IQR) to the third quartile. Any number greater than this is a suspected outlier.
- Subtract 1.5 x (IQR) from the first quartile. Any number less than this is a suspected outlier.
Remember that the interquartile rule is only a rule of thumb that generally holds but does not apply to every case. In general, you should always follow up your outlier analysis by studying the resulting outliers to see if they make sense. Any potential outlier obtained by the interquartile method should be examined in the context of the entire set of data.
A practical example of how to find outliers with IQR
This tutorial provides a step-by-step example of how to find outliers in a dataset using the IQR method.
Step 1: Create the Data
Suppose we have the following dataset:
Step 2: Identify the First and Third Quartile
The first quartile turns out to be 5 and the third quartile turns out to be 20.75.
Thus, the interquartile range turns out to be 20.75 -5 = 15.75.
Step 3: Find the Lower and Upper Limits
The lower limit is calculated as:
Lower limit = Q1 – 1.5*IQR = 5 – 1.5*15.75 = -18.625
And the upper limit is calculated as:
Upper limit = Q3 + 1.5*IQR = 20.75 + 1.5*15.75 = 44.375
Step 4: Identify the Outliers
The only observation in the dataset with a value less than the lower limit or greater than the upper limit is 46. Thus, this is the only outlier in this dataset.
Note: You can use this Outlier Boundary Calculator to automatically find the upper and lower boundaries for outliers in a given dataset.
Other ways of calculating outliers
Apart from IQR, you can use several other methods to find outliers depending on your time and resources.
Sorting method
You can sort quantitative variables from low to high and scan for extremely low or extremely high values. Flag any extreme values that you find.
This is a simple way to check whether you need to investigate certain data points before using more sophisticated methods.
Example: Sorting method
Your dataset for a pilot experiment consists of 8 values.
180 | 156 | 9 | 176 | 163 | 1827 | 166 | 171 |
You sort the values from low to high and scan for extreme values.
9 | 156 | 163 | 166 | 171 | 176 | 180 | 1872 |
Statistical outlier detection
Statistical outlier detection involves applying statistical tests or procedures to identify extreme values.
You can convert extreme data points into z scores that tell you how many standard deviations away they are from the mean.
If a value has a high enough or low enough z score, it can be considered an outlier. As a rule of thumb, values with a z score greater than 3 or less than –3 are often determined to be outliers.
Using visualizations
You can use software to visualize your data with a box plot, or a box-and-whisker plot, so you can see the data distribution at a glance. This type of chart highlights minimum and maximum values (the range), the median, and the interquartile range for your data.
Many computer programs highlight an outlier on a chart with an asterisk, and these will lie outside the bounds of the graph.
How to deal with outliers
Once you’ve identified outliers, you’ll decide what to do with them. Your main options are removing or retaining them from your dataset. This is similar to the choice you’re faced with when dealing with missing data.
For each outlier, think about whether it’s a true value or an error before deciding.
- Does the outlier line up with other measurements taken from the same participant?
- Is this data point completely impossible or can it reasonably come from your population?
- What’s the most likely source of the outlier? Is it a natural variation or an error?
In general, you should try to accept outliers as much as possible unless it’s clear that they represent errors or bad data.
Remove outliers
Outlier removal means deleting extreme values from your dataset before you perform statistical analyses. You aim to delete any dirty data while retaining true extreme values.
It’s a tricky procedure because it’s often impossible to tell the two types apart for sure. Deleting true outliers may lead to a biased dataset and an inaccurate conclusion.
For this reason, you should only remove outliers if you have legitimate reasons for doing so. It’s important to document each outlier you remove and your reasons so that other researchers can follow your procedures.
Retain outliers
Just like with missing values, the most conservative option is to keep outliers in your dataset. Keeping outliers is usually the better option when you’re not sure if they are errors.
With a large sample, outliers are expected and more likely to occur. However, each outlier has less of an effect on your results when your sample is large enough. The central tendency and variability of your data won’t be as affected by a couple of extreme values when you have a large number of values.
If you have a small dataset, you may also want to retain as much data as possible to make sure you have enough statistical power. If your dataset ends up containing many outliers, you may need to use a statistical test that’s more robust to them. Non-parametric statistical tests perform better for these data.
Recommended Articles
- How to Unblock a Website: Ultimate Guide
- How to Change Your Xbox Name: Complete Guide
- How to Go Live on TikTok in 2023: Easy Step-By-Step Guide
- How to Use Amazon Assistant: Easy Guide
- Why Is My Chromebook Screen Black: Causes & Fixes
- How to Block Someone on PayPal: All You Need