An Outlier is defined as a point (or set of points) in the dataset that do not follow the dominant pattern of the data. Due to wide variety of outlier sources, from variability in the measurement to high power noise in data collection, it is almost impossible to find a real world dataset without outliers. Hence it is absolutely vital to make sure that the data modeling is robust to the outlier. A robust machine learning algorithm is defined as the one that provides high performance even in the existence of outliers. Without robust learning the insight to the data is biased and does not have the correct model.
In this post, I would like to touch the surface of outlier detection and removal by introducing Random Sample Consensus. RANSAC is a a non-deterministic iterative algorithm that estimates the parameter of a (supervised) machine learning algorithm from a dataset that contains outliers. For that, RANSAC divides the points in the dataset into two subsets: 1- outlier 2- inlier. Then it uses the inliers to create the ML model.
Generally speaking, RANSAC starts by selecting a subset of points as hypothetical inliers. The size of this subset is selected big enough to fit the ML model. For example for linear regression we need at least n+1 points where n is the dimension of the features. After fitting the model to the hypothetical inliers, RANSAC checks which elements in the original dataset are consistent with the model instantiated with the estimated parameters and, if it is the case, it updates the current subset. The RANSAC algorithm iteratively repeats until the inlier subset is large enough (large enough is an input to the algorithm) or reaching to the end of the iteration.
RANSAC algorithm is designed based on two fundamental assumptions.
There are enough inliers points to agree on a good model
The outliers will not vote consistently for any single model. This is vital because otherwise the outliers will consistency create their own model and the iteration will fall into local optimum.
Example:
(codes in Python using scikit-learn module) Let study the performance of RANSAC in a linear regression problem. Let create a simple 1D dataset that has 10% outliers as depicted in the following plot:
A dataset with 10% outlier for linear regression y=2x+1.
Clearly, in above plot x has linear relation with y (y=2x+1) with small points that do not fall into this pattern. Let first fit a linear model as follows:
The last line, creates a boolean array to describe whether the corresponding point in feature vector is an outlier or not. If we look at the linear regression coefficients we can see that it is closer to the actual:
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.