Multivariate Linear Regression – Part 3 – Refactoring – Python ML – OOP Basics
This article is originally published at https://www.stoltzmaniac.com
Goal of this post:
- Move beyond single linear regression into multiple linear regression by utilizing gradient descent
- Refactor using inheritance
- Reconfigure our
pytest
to include the general case
What we are leaving for the next post:
- Refactoring and utilizing inheritance
- Creating useful exceptions
- Adding p-values to model results
In the previous post, we were able to easily solve for coefficients / betas / weights of our model. Due to the fact this could be solved via a straightforward analytical solution, it did not make sense to utilize more complicated techniques to solve for weights. However, as we enter the multidimensional world, things are slightly more complicated. We will use numpy
and gradient descent to solve for the weights.
This process requires a lot of code. I will only go over the more complicated sections and leave the easier parts for you to dig through on GitHub.
Here we go!
First, we know that regression does not just apply to “linear regression” and certainly goes beyond “single linear regression.” There are a number of other algorithms that complete regression tasks: logistic regression, random forest, neural networks, the list goes on… So we want to create a base class
that will contain all of the important data points and methods that will apply across all of these machine learning techniques. What do we mean by this?
All regression techniques must have the following input:
- Dependent / target / class / “y” variable
- Independent / predictor / “x” variable(s)
- Split for test / train data
- Random seed
- Max number of iterations
- Learning rate
That means, we can utilize this as a base class. Every method beyond this can inherit those characteristics in order to allow us to keep DRY code and take advantage of OOP.
Here is our base class Regression
:
import numpy as np class Regression: def __init__( self, independent_vars: np.ndarray, dependent_var: np.ndarray, iterations: int, learning_rate: float, train_split: float, seed: int, ): """ :param independent_vars: np.ndarray :param dependent_var: np.array (one dimensional) :param iterations: int :param learning_rate: float :param train_split: float (0 < value < 1) :param seed: int """ # Check data types if type(seed) != int: raise ValueError(f"seed value not an int") if type(iterations) != int or iterations <= 0: raise ValueError(f"Invalid iterations value") type_check_arrays = [ type(independent_vars) == np.ndarray, type(dependent_var) == np.ndarray, ] if not all(type_check_arrays): raise ValueError(f"Type(s) of data for Regression class are not accurate") if ( not type(learning_rate) == float and type(train_split) == float and train_split <= 1 ): raise ValueError(f"learning_rate or train_split not acceptable input(s)") if dependent_var.shape[0] != independent_vars.shape[0]: raise ValueError( f"Dimension(s) of data for Regression class are not accurate" ) all_data = np.column_stack((independent_vars, dependent_var)) all_data = all_data.astype("float") np.random.seed(seed) np.random.shuffle(all_data) split_row = round(all_data.shape[0] * train_split) train_data = all_data[:split_row] test_data = all_data[split_row:] # Train self.independent_vars_train = train_data[:, :-1] self.dependent_var_train = train_data[:, -1:] # Test self.independent_vars_test = test_data[:, :-1] self.dependent_var_test = test_data[:, -1:] self.learning_rate = learning_rate self.iterations = iterations self.cost = []
You’ll notice that the first chunk of code defines what the class
will need for input. The next bit simply checks that the input is acceptable. Once we get into the numpy
data manipulations is where things get interesting. Due to the fact we are splitting the data, we combine the columns so the dependent and independent variables are lined up for test/train split and randomization. There are many ways to do this, but this is very clear and can be easily understood. Next, we initialize some self
variables and add self.cost
because it is nice to be able to have an idea of the cost reduction.
Our `Regression` class is complete and has everything we need to move forward with any type of regression. Since we are looking at linear regression, we will start by making a class for the general case. Previously, we created a SingleLinearRegression
class. However, that methodology does not work for multiple variables. So rather than having a MultipleLinearRegression
class as well (which could be done). A simple LinearRegression
class handles both cases. In theory, you could add a switch statement to utilize the analytical solution when there’s only one variable, but we are going to use gradient descent to solve both problems. Feel free to modify the formula if that is important to you.
class LinearRegression(Regression): def __init__( self, independent_vars, dependent_var, iterations, learning_rate, train_split, seed, ): """ All inherited from Regression class """ super().__init__( independent_vars, dependent_var, iterations, learning_rate, train_split, seed, ) # Add ones column to allow for beta 0 self.independent_vars_train = np.c_[ np.ones(len(self.independent_vars_train), dtype="int64"), self.independent_vars_train, ] self.independent_vars_test = np.c_[ np.ones(len(self.independent_vars_test), dtype="int64"), self.independent_vars_test, ] # Initialize betas if len(self.independent_vars_train.shape) == 1: self.B = np.zeros(1) else: self.B = np.zeros(self.independent_vars_train.shape[1]) # Automatically fit self.fit_gradient_descent() def predict(self, values_to_predict: np.ndarray): predicted_values = values_to_predict.dot(self.B).flatten() return predicted_values def find_gradient(self): estimate = self.predict(self.independent_vars_train) error = self.dependent_var_train.flatten() - estimate gradient = -(1.0 / len(self.independent_vars_train)) * error.dot( self.independent_vars_train ) self.cost.append(np.power(error, 2)) return gradient def fit_gradient_descent(self): for i in range(self.iterations): gradient = self.find_gradient() self.B = self.B - (self.learning_rate * gradient) def calculate_r_squared(self, independent_vars, dependent_var): sum_sq_r = np.sum((dependent_var - self.predict(independent_vars)) ** 2) sum_sq_t = np.sum((dependent_var - dependent_var.mean()) ** 2) return 1 - (sum_sq_r / sum_sq_t) def __str__(self): return f""" Model Results ------------- Betas: {[i for i in zip(range(len(self.B)), self.B)]} R^2 Train: {self.calculate_r_squared(self.independent_vars_train, self.dependent_var_train[:, 0])} R^2 Test: {self.calculate_r_squared(self.independent_vars_test, self.dependent_var_test[:, 0])} """
You will notice now that we have inherited the base class (Regression
) by simply defining our class as LinearRegression(Regression)
. In doing so, we do not need to redefine the input parameters. Another thing to note is that our LinearRegression
class does not require any additional parameters to be passed. This is not always the case. If you needed to increase the number of inputs, the __init__
method allows for more variables to be passed.
After instantiation, we see that our data needs a bit of fine tuning which would not be the case for other models (which is why it was not included in the base class). Our independent variables need an extra np.ones
column to be tacked on in order to calculate the intercept / beta 0 term. We apply this to both the test and train variables.
We also initialize the weights / betas / coefficients right after that in order to ensure we have the additional weight (vectors and matrices are sensitive to dimensions). If you are not familiar with numpy
and linear algebra, you should make that a priority because almost all machine learning depends on it. Our predict
method is self-explanatory. Moving on to the find_gradient
method, we see that after a gradient is calculated, we are reducing its impact by the learning_rate
(also referred to as alpha
). We can then change the weights in our fit_gradient_descent
method. We continue to do so for the amount of iterations given. Each iteration reduces the cost and brings our model closer to the reaching the global minimum. In many other algorithms we are only concerned with local minima, but here we will always tend to converge at the global minimum.
Rather than iterating a set number of times, a better solution tends to be setting a cost threshold at which you are willing to accept the error. This is not hard to implement but requires some exit conditions in case your model is not converging. We may follow up with this approach in the future.
The next step we care about, is R^2
. In a general sense, this is how well your model “fits” the data. The calculation is laid out by the calculate_r_squared
method and is self-explanatory.
Our __str__
method is understood by Python and is what comes out when you print your instance
. It is intended to be for the end user to consume rather than for developer’s use. In our case, we print the betas / weights / coefficients in the order they line up with the independent variable input (starting with our beta 0 term for an intercept). We also see test and train R^2 values in order to help us see how the value compares across data sets and increase our understanding of expectations for future performance.
Let’s take a look at how our tests are structured! Within our tests directory, we now see another set of test data and another regression test (for multiple). Our conftest.py
sets up our data just like before.
Taking a look at our test_multiple_linear_regression.py
file:
import numpy as np import pytest from regression import LinearRegression @pytest.fixture(scope="module") def multiple_linear_regression_model(multiple_linear_regression_data): linear_regression_model = LinearRegression( independent_vars=multiple_linear_regression_data["independent_vars"], dependent_var=multiple_linear_regression_data["dependent_var"], iterations=10000, learning_rate=0.001, train_split=0.7, seed=123, ) return linear_regression_model def test_multiple_linear_regression_data_passing_correctly( multiple_linear_regression_model, multiple_linear_regression_data ): """ Setup linear regression model :return: """ assert ( multiple_linear_regression_model.independent_vars_train.all() == multiple_linear_regression_data["independent_vars"].all() ) assert ( multiple_linear_regression_model.dependent_var_train.all() == multiple_linear_regression_data["dependent_var"].all() ) assert type(multiple_linear_regression_model.independent_vars_train) == np.ndarray assert type(multiple_linear_regression_model.dependent_var_train) == np.ndarray def test_multiple_linear_regression_coefficients(multiple_linear_regression_model): """ Test regression model coefficients :return: """ print(multiple_linear_regression_model) expected_coefficients = [ (0, 0.31701131823834194), (1, 0.019291066504884043), (2, 0.03215158052302674), (3, 0.7237572134362669), ] no_of_betas = len(multiple_linear_regression_model.B) for n in range(no_of_betas): assert ( pytest.approx(expected_coefficients[n][1], 0.001) == multiple_linear_regression_model.B[n] ) def test_multiple_linear_regression_r_squared(multiple_linear_regression_model): """ Test regression model r_squared :return: """ # Train Data train_r_squared = multiple_linear_regression_model.calculate_r_squared( multiple_linear_regression_model.independent_vars_train, multiple_linear_regression_model.dependent_var_train[:, 0], ) test_r_squared = multiple_linear_regression_model.calculate_r_squared( multiple_linear_regression_model.independent_vars_test, multiple_linear_regression_model.dependent_var_test[:, 0], ) assert pytest.approx(train_r_squared, 0.001) == 0.5179372162803452 assert pytest.approx(test_r_squared, 0.001) == 0.2796260458401886
We have solved our regression problems and know what the results should be. However, due to the fact that float
data types are very precise, we utilize pytest.approx
to round to a specific set of significant digits. The same setup is written for single linear regression as well.
When running our tests, we can see that everything passes!
Whew! That was a long post, hope you found it useful if you made it this far. You should be able to think of a lot more test cases to run and will likely have to adjust your class
to handle edge cases.
Thanks for visiting r-craft.org
This article is originally published at https://www.stoltzmaniac.com
Please visit source website for post related comments.