R News

Update Your Machine Learning Pipeline With vetiver and Quarto

by RStudio | Open source & professional software for data science teams on RStudio · September 13, 2022

This article is originally published at https://www.rstudio.com/blog/

Machine learning operations (MLOps) are a set of best practices for running machine learning models successfully in production environments. Data scientists and system administrators have expanding options for setting up their pipeline. However, while many tools exist for preparing data and training models, there is a lack of streamlined tooling for tasks like putting a model in production, maintaining the model, or monitoring performance.

Enter vetiver, an open-source framework for the entire model lifecycle. Vetiver provides R and Python programmers with a fluid, unified way of working with machine learning models.

Our Solutions Engineering team developed a Shiny app for Washington D.C.’s Capital Bikeshare program a few years ago. This app provides real-time predictions of the number of bikes available at stations across the city. The end-to-end machine learning pipeline feeding the app uses R to import and modify data, save it in a pin, develop a model, then move the model to a deployable location. Alex Gold delivered a presentation on this workflow in 2020.

Sam Edwardes updated the project to apply Quarto and the new vetiver framework. Previously, we used R Markdown and a combination of one-off functions and scripts for each MLOps task. Using the latest from RStudio:

Quarto provides a refreshed look and language-agnostic tool for computational documents. Like R Markdown documents, the Quarto documents are available on RStudio Connect.
The pipeline now uses vetiver to train, pin, monitor, and deploy the model.
- This streamlines the code and makes the MLOps pipeline easier to maintain.
- By using vetiver across the organization, we have a consistent way to perform MLOps tasks.
- Deploying the model as an API endpoint using vetiver allows us to reuse the machine learning model for other apps or use cases.

We will walk through the updated pipeline below. To see the entire project, check out the Bike Predict page on solutions.rstudio.com.

Building A Predictive Web App With Shiny

The Shiny app predicts the number of bikes at a station in the near future based on real-time streaming data from an API. The steps involved are:

Write the latest station status data from the Capital Bikeshare API to a database
Join the station status data with the station information dataset and tidy the data
Train the model using this tidied dataset
Save and version the model to Connect as a pin using vetiver
Use the vetiver model card template to document essential facts and considerations of the deployed model
Use functions provided by vetiver to document and monitor model performance
Use the API endpoint to serve predictions to a Shiny app interactively
Make the Shiny app available to anybody interested in the predictions

The project shows an exciting set of capabilities, combining open source with RStudio’s professional products.

RStudio Workbench is a centralized, server-based environment for working with code.
RStudio Connect publishes and schedules data science assets like pins, APIs, and Quarto reports.
RStudio Package Manager (RSPM) controls and distributes packages throughout an organization.

Creating An End-to-End Machine Learning Pipeline

1. Create a custom package for pulling data

Capital Bikeshare has an API that publishes real-time system data. We created a set of helper functions for pulling the data. To increase efficiency, we wanted to reuse and share these functions.

For that, we created the bikehelpR package to house, document, and test the functions we used. To deploy the package, we used RSPM. RSPM makes it easy to create a package and have it available via install.packages() for everybody on our team.

2. Extract, transform, load process in R

The first step of the pipeline pulls the latest data from the Capital Bikeshare API using the bikehelpR package. We write the raw data to the Content Database’s bike_raw_data and bike_station_info tables.

The station info is also written to a pin. This pin will be accessed by the Shiny app so that it can extract the bike station info without connecting to the database. Read more about “production-izing” Shiny with pins.

ETL Step 1 - Raw Data Refresh Quarto Document

3. Tidy and join datasets

We tidy the bike_raw_data table using tidyverse packages. Then, we join it with the bike_station_info table and write the output into the Content Database’s bike_model_data table.

ETL Step 2 - Tidy Data Quarto Document

4. Train and deploy the model

We use the bike_model_data table to train and evaluate a random forest model. The model is saved to RStudio Connect as a pin (using vetiver) and then it is converted into an API endpoint (also using vetiver). By using vetiver to pin and deploy our model, we ensure a consistent approach across the organization for how we pin, version, and deploy machine learning models. Then, we deploy the API to RStudio Connect.

Model Step 1 - Train and Deploy Model

5. Create a model card

Next, we evaluate the training and evaluation data using various methods. Vetiver’s model card template helps document essential facts and considerations of the deployed model.

Model Step 2 - Model Card

6. Monitor model metrics

We can document model performance using vetiver and write the metrics to a pin on RStudio Connect. With these functions, we can monitor for model performance degradation. Using vetiver to monitor model performance again ensures a consistent approach to model governance across teams.

Model Step 3 - Model Metrics

7. Deploy a Shiny app that displays real-time predictions

We use the API endpoint to serve predictions to a Shiny app interactively. Clicking on a station shows us a line graph of the time and predicted number of bikes.

Link to Shiny App

8. Create project dashboard

This project is composed of many different tasks. We wanted a single place to share the full context and content with others. We created a dashboard made with connectwidgets to link to the entire project. This makes it easy for anybody new to the Bike Share app to understand its purpose and steps involved.

Link to Dashboard

See the entire updated pipeline here:

Learn More

We hope that you enjoyed this example of using vetiver, pins, and RStudio Connect to create an end-to-end machine learning pipeline. Folks in machine-learning-heavy contexts can use vetiver to streamline their work and easily “production-ize” content.

Review the Bike Share pipeline code on GitHub.
Check out this project and other RStudio product workflows on solutions.rstudio.com.

Join Julia Silge and Isabel Zimmerman to learn more about MLOps with vetiver in Python and R at the RStudio Enterprise Meetup on September 20th!

Add the event to your calendar

Thanks for visiting r-craft.org
This article is originally published at https://www.rstudio.com/blog/
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Update Your Machine Learning Pipeline With vetiver and Quarto

You may also like...

Categories

Update Your Machine Learning Pipeline With vetiver and Quarto

Building A Predictive Web App With Shiny

Creating An End-to-End Machine Learning Pipeline

Learn More

You may also like...

Obtaining tokens with AzureAuth inside a Shiny app

Fast and {furrr}-ious: real time economic monitoring using R

BioC2019 recount workshop

Categories