Azure / Machine learning / Python / R News

Advent of 2022, Day 16 – MLflow in action with xgboost

by tomaztsql · December 17, 2022

This article is originally published at https://tomaztsql.wordpress.com

In the series of Azure Machine Learning posts:

Dec 01: What is Azure Machine Learning?
Dec 02: Creating Azure Machine Learning Workspace
Dec 03: Understanding Azure Machine Learning Studio
Dec 04: Getting data to Azure Machine Learning workspace
Dec 05: Creating compute and cluster instances in Azure Machine Learning
Dec 06: Environments in Azure Machine Learning
Dec 07: Introduction to Azure CLI and Python SDK
Dec 08: Python SDK namespaces for workspace, experiments and models
Dec 09: Python SDK namespaces for environment, and pipelines
Dec 10: Connecting to client using Python SDK namespaces
Dec 11: Creating Pipelines with Python SDK
Dec 12: Creating jobs
Dec 13: Automated ML
Dec 14: Registering the models
Dec 15: Getting to know MLflow

Yesterday we have looked into how to start the MLflow configurations and today, let’s put this to the test.

We will create a new notebook and use Heart dataset (link to dataset) to toy around. We will also import xgboost classifier to asses the accuracy of the presence of heart disease in the patient. We will be using a categorical (integer) variable with values from 0 (no presence) to 4 (strong presence) and attempt to classify based on 15+ attributes (out of more than 70 attributes).

The ipynb notebook will be available on the Github.

#importing mlflow functions
import mlflow
mlflow.set_experiment(experiment_name="heart-condition-classifier")

#getting the data
import pandas as pd
file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
df = pd.read_csv(file_url)

#some data engineering
df["thal"] = df["thal"].astype("category").cat.codes

#split train and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("target", axis=1), df["target"], test_size=0.4
)

#logging the steps
mlflow.xgboost.autolog()

#training the model
from xgboost import XGBClassifier

model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")

#  start the  mlflow run
run = mlflow.start_run()

#start fitting the model
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

#logging some extra metrics
y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score, recall_score, fbeta_score, confusion_matrix

accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

#closing mlflow
mlflow.end_run()
run = mlflow.get_run(run.info.run_id)
client = mlflow.tracking.MlflowClient()

We can also do the logging with preprocessing.

# Reload the dataset
df = pd.read_csv(file_url)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("target", axis=1), df["target"], test_size=0.3
)

#using ordinal encoder

import numpy as np
from sklearn.preprocessing import OrdinalEncoder

# creating transformation and using Logloss on xbgoost classifies
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier

encoder = ColumnTransformer(
    [
        (
            "cat_encoding",
            OrdinalEncoder(
                categories="auto",
                handle_unknown="use_encoded_value",
                unknown_value=np.nan,
            ),
            ["thal"],
        )
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
)

model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")

With mulitple runs, you can also check the performance of models with desired metrics. This is an example of logloss validation and comparison between two runs.

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Azure-Machine-Learning

Happy Advent of 2022!

Thanks for visiting r-craft.org
This article is originally published at https://tomaztsql.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Advent of 2022, Day 16 – MLflow in action with xgboost

You may also like...

Categories

Advent of 2022, Day 16 – MLflow in action with xgboost

You may also like...

Data Mining OCR PDFs — Using pdftabextract to liberate tabular data from scanned documents

Mapping the world with tweets

To Loop or Not to Loop?

Categories