Advent of 2022, Day 16 – MLflow in action with xgboost
This article is originally published at https://tomaztsql.wordpress.com
In the series of Azure Machine Learning posts:
- Dec 01: What is Azure Machine Learning?
- Dec 02: Creating Azure Machine Learning Workspace
- Dec 03: Understanding Azure Machine Learning Studio
- Dec 04: Getting data to Azure Machine Learning workspace
- Dec 05: Creating compute and cluster instances in Azure Machine Learning
- Dec 06: Environments in Azure Machine Learning
- Dec 07: Introduction to Azure CLI and Python SDK
- Dec 08: Python SDK namespaces for workspace, experiments and models
- Dec 09: Python SDK namespaces for environment, and pipelines
- Dec 10: Connecting to client using Python SDK namespaces
- Dec 11: Creating Pipelines with Python SDK
- Dec 12: Creating jobs
- Dec 13: Automated ML
- Dec 14: Registering the models
- Dec 15: Getting to know MLflow
Yesterday we have looked into how to start the MLflow configurations and today, let’s put this to the test.
We will create a new notebook and use Heart dataset (link to dataset) to toy around. We will also import xgboost classifier to asses the accuracy of the presence of heart disease in the patient. We will be using a categorical (integer) variable with values from 0 (no presence) to 4 (strong presence) and attempt to classify based on 15+ attributes (out of more than 70 attributes).
The ipynb notebook will be available on the Github.
#importing mlflow functions import mlflow mlflow.set_experiment(experiment_name="heart-condition-classifier") #getting the data import pandas as pd file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv" df = pd.read_csv(file_url) #some data engineering df["thal"] = df["thal"].astype("category").cat.codes #split train and test from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( df.drop("target", axis=1), df["target"], test_size=0.4 ) #logging the steps mlflow.xgboost.autolog() #training the model from xgboost import XGBClassifier model = XGBClassifier(use_label_encoder=False, eval_metric="logloss") # start the mlflow run run = mlflow.start_run() #start fitting the model model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False) #logging some extra metrics y_pred = model.predict(X_test) from sklearn.metrics import accuracy_score, recall_score, fbeta_score, confusion_matrix accuracy = accuracy_score(y_test, y_pred) recall = recall_score(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) #closing mlflow mlflow.end_run() run = mlflow.get_run(run.info.run_id) client = mlflow.tracking.MlflowClient()
We can also do the logging with preprocessing.
# Reload the dataset df = pd.read_csv(file_url) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( df.drop("target", axis=1), df["target"], test_size=0.3 ) #using ordinal encoder import numpy as np from sklearn.preprocessing import OrdinalEncoder # creating transformation and using Logloss on xbgoost classifies from sklearn.compose import ColumnTransformer from xgboost import XGBClassifier encoder = ColumnTransformer( [ ( "cat_encoding", OrdinalEncoder( categories="auto", handle_unknown="use_encoded_value", unknown_value=np.nan, ), ["thal"], ) ], remainder="passthrough", verbose_feature_names_out=False, ) model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
With mulitple runs, you can also check the performance of models with desired metrics. This is an example of logloss validation and comparison between two runs.
Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Azure-Machine-Learning
Happy Advent of 2022!
Please visit source website for post related comments.