Single Linear Regression – Part 2 – Testing – Python ML – OOP Basics
This article is originally published at https://www.stoltzmaniac.com
We have now entered part 2 of our series on object oriented programming in Python for machine learning. If you have not already done so, you may want to check out the previous post –> Part 1.
Goal of this post:
- Fit a model to find coefficients
- Find the RMSE, R^2, slope and intercept of the model
- Test our model using
pytest
What we are leaving for the next post:
- Refactoring and utilizing inheritance
- Creating useful exceptions
- Adding p-values to model results
Here we go!
This post will go over some of the basics of testing your OOP model. We will be using the pytest
package to do our testing. The package is documented with great examples here. Testing is a key part of feeling comfortable with writing code and modifying your model.
Moving on, it’s worth noting that we have slightly changed the directory structure of this project in minor ways:
We have also added to the requirements.txt
because we installed both pytest
and pandas
.
Let’s take a look at our new regression.py
file:
- Added
numpy
– a popular library to help deal with manipulating vectors and matrices - Removed
predict
from the current class (for now) - Added
b1, b0, predicted_values, fit()
which allow us to fit a model once we create an instance - Defined parameters for the mean of input data
- Created the
fit()
method to solve for coefficients - Created
predict()
method to utilize the model - Created
root_mean_squared_error()
andr_squared()
methods to assess the fit - Changed the
__str__()
method to reflect the values in a straightforward way
import numpy as np class SingleLinearRegression: def __init__(self, independent_var: np.array, dependent_var: np.array): """ Complete a single linear regression. :param independent_var: list :param dependent_var: list """ self.independent_var = independent_var self.dependent_var = dependent_var self.b1 = None self.b0 = None self.predicted_values = None self.fit() @property def independent_var_mean(self): return np.mean(self.independent_var) @property def dependent_var_mean(self): return np.mean(self.dependent_var) def fit(self): # Format: independent_var_hat = b1*dependent_var + b0 x_minus_mean = [x - self.independent_var_mean for x in self.independent_var] y_minus_mean = [y - self.dependent_var_mean for y in self.dependent_var] b1_numerator = sum([x * y for x, y in zip(x_minus_mean, y_minus_mean)]) b1_denominator = sum([(x - self.independent_var_mean) ** 2 for x in self.independent_var]) self.b1 = b1_numerator / b1_denominator self.b0 = self.dependent_var_mean - (self.b1 * self.independent_var_mean) def predict(self, values_to_predict: np.ndarray): predicted_values = values_to_predict * self.b1 + self.b0 return predicted_values def root_mean_squared_error(self): dependent_var_hat = self.predict(self.independent_var) sum_of_res = np.sum((dependent_var_hat - self.dependent_var) ** 2) rmse = np.sqrt(sum_of_res / len(dependent_var_hat)) return rmse def r_squared(self): dependent_var_hat = self.predict(self.independent_var) sum_of_sq = np.sum((self.dependent_var - self.dependent_var_mean) ** 2) sum_of_res = np.sum((self.dependent_var - dependent_var_hat) ** 2) return 1 - (sum_of_res / sum_of_sq) def __str__(self): return f""" Model Results ------------- b1: {round(self.b1, 2)} b0: {round(self.b0, 2)} RMSE: {round(self.root_mean_squared_error(), 2)} R^2: {round(self.r_squared(), 2)} """
You will notice that our math is relatively straightforward. The usage of list comprehensions allows us to easily complete our calculations utilizing the order of operations. You may also notice that the function immediately fits a model to the data when it’s instantiated. This is not a common practice, however, it works well for our use case.
Now that we have our model in place, there is plenty of testing to do. This is not the test driven development (TDD) way of doing things. I think TDD is a great way to work but it is very rigid and not conducive to blog posts. Let’s expand our directory structure to see how we will test our model. Here’s a quick breakdown:
tests
-> folder is easily discovered bypytest
and houses everything related to our testingmy_test_data
-> holds acsv
with sample data to which we know what the results should beregression
-> holds tests related to theregression.py
fileconftest.py
-> a setup forpytest
that allows us to easily pass “global” style variables and setup parameters for use throughout the tests
In our case, when we run pytest
the first thing it will run is conftest.py
. This is a “configuration” that allows us to pull our data one time, rather than having to read the csv
for each test function. While it is not a big deal in our case, you could imagine having to do this for a very large test suite that may require a lot of database queries. Here is what’s going on:
pytest.fixture(scope='session')
-> sets the variablesimple_linear_regression_data
as a “global” variable. This can be called for the session. This variable can now be used and passed to all functions in the test suite. It will contain thecsv
data and is returned in a dictionary, in our case
import pytest import pandas as pd import numpy as np @pytest.fixture(scope='session') def single_linear_regression_data() -> dict: """ Setup test data for :return: """ df = pd.read_csv('my_test_data/my_test_data.csv') yield { 'dependent_var': np.array(df['dependent_var']), 'independent_var': np.array(df['independent_var']) } return print('single_linear_regression_data fixture finished.')
Next, we will look at the only file we have written that contains tests – test_single_linear_regression.py
. Because we know what each method should return, we will ensure those results are accurate. There are a lot more tests that should be run to check these modules (i.e. checking to see what happens when data of different types are passed in, expect errors in data, utilizing null values, etc.) For demonstration purposes, we will simply test cases that we know to be accurate, but feel free to add on to this. Here’s what’s going on:
pytest.fixture(scope='module')
-> creates an instance of ourSingleLinearRegression
model utilizing thesingle_linear_regression_data
defined inconftest.py
test_single_linear_regression_data_passing_correctly
-> checks to see that data within the model is of the right type and that all input data matches what is stored in the modeltest_single_linear_regression_fit
-> checks that the model that was fit has the same coefficients as expected (to a certain degree of accuracy)test_single_linear_regression_rmse
andtest_single_linear_regression_r_squared
-> check that the calculated values match (to a certain degree of accuracy)
import numpy as np import pytest from regression import SingleLinearRegression @pytest.fixture(scope='module') def reg_model(single_linear_regression_data): linear_regression_model = SingleLinearRegression(independent_var=single_linear_regression_data['independent_var'], dependent_var=single_linear_regression_data['dependent_var']) return linear_regression_model def test_single_linear_regression_data_passing_correctly(reg_model, single_linear_regression_data): """ Setup linear regression model :return: """ assert(reg_model.independent_var.all() == single_linear_regression_data['independent_var'].all()) assert(reg_model.dependent_var.all() == single_linear_regression_data['dependent_var'].all()) assert(type(reg_model.independent_var) == np.ndarray) assert(type(reg_model.dependent_var) == np.ndarray) def test_single_linear_regression_fit(reg_model): """ Test regression model coefficients :return: """ assert(pytest.approx(reg_model.b1, 0.01) == 1.14) assert(pytest.approx(reg_model.b0, 0.01) == 0.43) def test_single_linear_regression_rmse(reg_model): """ Test regression model root mean squared error :return: """ assert(pytest.approx(reg_model.root_mean_squared_error(), 0.02) == 0.31) def test_single_linear_regression_r_squared(reg_model): """ Test regression model r_squared :return: """ assert(pytest.approx(reg_model.r_squared(), 0.01) == 0.52)
These tests can be run in a multitude of ways. My favorite way is to utilize an IDE’s built in functionality that allows you to run tests independently, only a handful of tests, or all of the tests. In general, PyCharm is my favorite tool for the job. After running it, you should see all tests passing and the output will look something like this (results will vary depending on options being passed to pytest
)
There we have it, all tests passed in 0.03 seconds! We will continue to move forward and write tests as we move along to ensure functionality. This will be especially important when we refactor in the next post!
Thanks for visiting r-craft.org
This article is originally published at https://www.stoltzmaniac.com
Please visit source website for post related comments.