An Outlier is defined as a point (or set of points) in the dataset that do not follow the dominant pattern of the data. Due to wide variety of outlier sources, from variability in the measurement to high power noise in data collection, it is almost impossible to find a real world dataset without outliers. Hence it is absolutely vital to make sure that the data modeling is robust to the outlier. A robust machine learning algorithm is defined as the one that provides high performance even in the existence of outliers. Without robust learning the insight to the data is biased and does not have the correct model.
In this post, I would like to touch the surface of outlier detection and removal by introducing Random Sample Consensus. RANSAC is a a non-deterministic iterative algorithm that estimates the parameter of a (supervised) machine learning algorithm from a dataset that contains outliers. For that, RANSAC divides the points in the dataset into two subsets: 1- outlier 2- inlier. Then it uses the inliers to create the ML model.
Generally speaking, RANSAC starts by selecting a subset of points as hypothetical inliers. The size of this subset is selected big enough to fit the ML model. For example for linear regression we need at least n+1 points where n is the dimension of the features. After fitting the model to the hypothetical inliers, RANSAC checks which elements in the original dataset are consistent with the model instantiated with the estimated parameters and, if it is the case, it updates the current subset. The RANSAC algorithm iteratively repeats until the inlier subset is large enough (large enough is an input to the algorithm) or reaching to the end of the iteration.
RANSAC algorithm is designed based on two fundamental assumptions.
There are enough inliers points to agree on a good model
The outliers will not vote consistently for any single model. This is vital because otherwise the outliers will consistency create their own model and the iteration will fall into local optimum.
(codes in Python using scikit-learn module) Let study the performance of RANSAC in a linear regression problem. Let create a simple 1D dataset that has 10% outliers as depicted in the following plot:
A dataset with 10% outlier for linear regression y=2x+1.
Clearly, in above plot x has linear relation with y (y=2x+1) with small points that do not fall into this pattern. Let first fit a linear model as follows:
The last line, creates a boolean array to describe whether the corresponding point in feature vector is an outlier or not. If we look at the linear regression coefficients we can see that it is closer to the actual: