# Using RANSAC for Robust Regression

This article is originally published at https://hameddaily.blogspot.com/

*robust machine learning algorithm*is defined as the one that provides high performance even in the existence of outliers. Without robust learning the insight to the data is biased and does not have the correct model.

**.**

*Random Sample Consensus***is a a non-deterministic iterative algorithm that estimates the parameter of a (supervised) machine learning algorithm from a dataset that contains outliers. For that, RANSAC divides the points in the dataset into two subsets: 1- outlier 2- inlier. Then it uses the inliers to create the ML model.**

*RANSAC**hypothetical inliers*. The size of this subset is selected big enough to fit the ML model. For example for linear regression we need at least n+1 points where n is the dimension of the features. After fitting the model to the hypothetical inliers, RANSAC checks which elements in the original dataset are consistent with the model instantiated with the estimated parameters and, if it is the case, it updates the current subset. The RANSAC algorithm iteratively repeats until the inlier subset is large enough (large enough is an input to the algorithm) or reaching to the end of the iteration.

- There are enough inliers points to agree on a good model
- The outliers will not vote consistently for any single model. This is vital because otherwise the outliers will consistency create their own model and the iteration will fall into local optimum.

Let study the performance of RANSAC in a linear regression problem. Let create a simple 1D dataset that has 10% outliers as depicted in the following plot:

A dataset with 10% outlier for linear regression y=2x+1. |

Clearly, in above plot

*x*has linear relation with

*y*(

*y=2x+1*) with small points that do not fall into this pattern. Let first fit a linear model as follows:

from sklearn import linear_model

clt = linear_model.LinearRegression()

clt.fit(feature, y)

yhat = clt.predict(feature)

print clt.coef_

print clt.intercept_

--

[ 2.17529381]

3.34794649807

Ordinary linear regression. The model is biased toward the outliers. |

**robust linear regression**.

clt_ransac = linear_model.RANSACRegressor(linear_model.LinearRegression())

clt_ransac.fit(feature, y)

yhat_ransac = clt_ransac.predict(feature)

inlier_mask = clt_ransac.inlier_mask_

*feature*vector is an outlier or not. If we look at the linear regression coefficients we can see that it is closer to the actual:

print clt_ransac.estimator_.coef_

print clt_ransac.estimator_.intercept_

--

[[ 1.98733376]]

[ 1.57728374]

RANSAC divides the dataset into inliers and outliers. The inliers is used to fit the linear regression. |

A dataset with 50% outlier. |

RANSAC is strongly robust to the size of the outliers population as long as the inliers consistently vote for single model. |

[[ 1.90050698]]

[ 1.47968426]

Thanks for visiting r-craft.org

This article is originally published at https://hameddaily.blogspot.com/

Please visit source website for post related comments.