{"id":15290,"date":"2023-11-23T13:56:35","date_gmt":"2023-11-23T13:56:35","guid":{"rendered":"https:\/\/businessyield.com\/tech\/?p=15290"},"modified":"2023-11-23T13:56:36","modified_gmt":"2023-11-23T13:56:36","slug":"how-to-find-outliers-with-iqr-easy-guide","status":"publish","type":"post","link":"https:\/\/businessyield.com\/tech\/how-to\/how-to-find-outliers-with-iqr-easy-guide\/","title":{"rendered":"How To Find Outliers With IQR: Easy Guide","gt_translate_keys":[{"key":"rendered","format":"text"}]},"content":{"rendered":"\n
There are several ways to find outliers in the pattern of a data set, one of which is the Interquartile Range (IQR) method. <\/p>\n\n\n\n
The interquartile range, often abbreviated IQR, is the difference between the 25th percentile (Q1) and the 75th percentile (Q3) in a dataset. It measures the spread of the middle 50% of values and shows how the data is spread about the median. It is also less susceptible than the range to outliers and can, therefore, be more helpful.<\/p>\n\n\n\n
Outliers are values at the extreme ends of a dataset. Some may represent true values from natural variation in the population. Other outliers may result from incorrect data entry, equipment malfunctions, or other\u00a0measurement errors.<\/p>\n\n\n\n
An outlier isn\u2019t always a form of erroneous or incorrect data, so you have to be careful with them in\u00a0data cleansing. What you should do with an outlier depends on its most likely cause.<\/p>\n\n\n\n
True outliers should always be retained in your dataset because these just represent natural variations in your\u00a0sample<\/a>. <\/p>\n\n\n\n An example of a true outlier is when you measure 100-meter running times for a representative sample of 560 college students. Your data are\u00a0normally distributed\u00a0with a couple of outliers on either end. Most values are centered around the middle, as expected. But these extreme values also represent natural variations because a variable like running time is influenced by many other factors.<\/p>\n\n\n\n True outliers are also present in variables with skewed distributions where many data points are spread far from the\u00a0mean\u00a0in one direction. It\u2019s important to select\u00a0appropriate statistical tests\u00a0or measures when you have a\u00a0skewed\u00a0distribution or many outliers.<\/p>\n\n\n\n Outliers that don\u2019t represent true values can come from many possible sources:<\/p>\n\n\n\n An example of other outliers is when you repeat your running time measurements for a new sample. For one of the participants, you accidentally start the timer midway through their sprint. You record this timing as their running time.<\/p>\n\n\n\n This data point is a big outlier in your dataset because it\u2019s much lower\u00a0than\u00a0all of the other times.<\/p>\n\n\n\n This type of outlier is problematic because it\u2019s inaccurate and can distort your\u00a0research results if you calculate the average running time for all participants using this data. The average is much lower when you include the outlier compared to when you exclude it.\u00a0 Your\u00a0standard deviation\u00a0also increases when you include the outlier, so your\u00a0statistical power\u00a0is lower as well.<\/p>\n\n\n\n In practice, it can be difficult to tell different types of outliers apart. While you can use calculations and statistical methods to detect outliers, classifying them as true or false is usually a subjective process.<\/p>\n\n\n\n Any set of data can be described by its\u00a0five-number summary. These five numbers, which give you the information you need to find patterns and outliers, consist of (in ascending order):<\/p>\n\n\n\n These five numbers tell a person more about their data than looking at the numbers all at once could, or at least make this much easier. For example, the\u00a0range, which is the minimum subtracted from the maximum, is one indicator of how spread out the data is in a set.<\/p>\n\n\n\n However, note that the range is highly sensitive to outliers. If an outlier is also a minimum or maximum, the range will not be an accurate representation of the breadth of a data set.<\/p>\n\n\n\n Range would be difficult to extrapolate otherwise. Similar to the range but less sensitive to outliers is the interquartile range. The\u00a0interquartile range\u00a0is calculated in much the same way as the range. All you do to find it is subtract the first quartile from the third quartile:<\/p>\n\n\n\n IQR = Q<\/em>3<\/sub> \u2013 Q<\/em>1<\/sub>.<\/p>\n<\/blockquote>\n\n\n\n The interquartile range shows how the data is spread about the median. It is less susceptible than the range to outliers and can, therefore, be more helpful.<\/p>\n\n\n\n Though it’s not often affected much by them, the interquartile range can be used to detect outliers. This is done using these steps:<\/p>\n\n\n\n Remember that the interquartile rule is only a rule of thumb that generally holds but does not apply to every case. In general, you should always follow up your outlier analysis by studying the resulting outliers to see if they make sense. Any potential outlier obtained by the interquartile method should be examined in the context of the entire set of data.<\/p>\n\n\n\n This tutorial provides a step-by-step example of how to find outliers in a dataset using the IQR method.<\/p>\n\n\n\n Suppose we have the following dataset:<\/p>\n\n\n The first quartile turns out to be 5<\/strong> and the third quartile turns out to be 20.75<\/strong>.<\/p>\n\n\n Thus, the interquartile range turns out to be 20.75 -5 = 15.75<\/strong>.<\/p>\n\n\n\n The lower limit is calculated as:<\/p>\n\n\n\n Lower limit = Q1 \u2013 1.5*IQR = 5 \u2013 1.5*15.75 = -18.625<\/strong><\/p>\n\n\n\n And the upper limit is calculated as:<\/p>\n\n\n\n Upper limit = Q3 + 1.5*IQR = 20.75 + 1.5*15.75 = 44.375<\/strong><\/p>\n\n\n The only observation in the dataset with a value less than the lower limit or greater than the upper limit is 46<\/strong>. Thus, this is the only outlier in this dataset.<\/p>\n\n\n Note:<\/strong>\u00a0You can use this\u00a0Outlier Boundary Calculator<\/a>\u00a0to automatically find the upper and lower boundaries for outliers in a given dataset.<\/p>\n\n\n\n Apart from IQR, you can use several other methods to find outliers depending on your time and resources.<\/p>\n\n\n\n You can\u00a0sort\u00a0quantitative variables\u00a0from low to high and scan for extremely low or extremely high values. Flag any extreme values that you find.<\/p>\n\n\n\n This is a simple way to check whether you need to investigate certain data points before using more sophisticated methods.<\/p>\n\n\n\n Example: Sorting method<\/p>\n\n\n\n Your dataset for a pilot experiment consists of 8 values.<\/p>\n\n\n\nOther outliers<\/strong><\/h4>\n\n\n\n
\n
Understanding IQR<\/strong><\/span><\/h2>\n\n\n\n
\n
\n
How to find outliers with IQR<\/strong><\/span><\/h2>\n\n\n\n
\n
A practical example of how to find outliers with IQR<\/strong><\/span><\/h2>\n\n\n\n
Step 1: Create the Data<\/strong><\/h3>\n\n\n\n
<\/figure><\/div>\n\n\nStep 2: Identify the First and Third Quartile<\/strong><\/h3>\n\n\n\n
<\/figure><\/div>\n\n\nStep 3: Find the Lower and Upper Limits<\/strong><\/h3>\n\n\n\n
<\/figure><\/div>\n\n\nStep 4: Identify the Outliers<\/strong><\/h3>\n\n\n\n
<\/figure><\/div>\n\n\nOther ways of calculating outliers<\/strong><\/span><\/h2>\n\n\n\n
Sorting method<\/strong><\/span><\/h3>\n\n\n\n