R / R News

Introduction to dplyr

by Quantargo Blog · June 16, 2020

This article is originally published at https://www.quantargo.com/blog

Learn what dplyr does
Get an overview of Select, Filter and Sort
Learn what Joins, Aggregations and Pipelines are

What is dplyr

There’s the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data.
Anthony Goldbloom, Founder and CEO of Kaggle

Having clean data in any Data Science project is super important, because the results only get as good as is the data correct. Cleaning data is also the part which usually consumes most of the time and causes the biggest pains for data scientists. R already offers a broad set of tools and functions to manipulate data frames. However, due to its long history, the available base R toolset is fragmented and hard to use for new users.

The dplyr package facilitates the data tranformation process through a consistent collection of functions. These functions support different transformations on data frames, including

filter rows
select columns
sort data
aggregate data

Multiple data frames can also be joined together by common attribute values.

The consistency of dplyr functions improves usability and enables user to connect transformations together to form data pipelines. These pipelines can also be seen as a high-level query language—much like e.g. the SQL language for database queries. Additionally, it is even possible to translate created data pipelines to other backends including databases.

Quiz: dplyr Facts

Which of the below statements are correct?

dplyr provides a consistent set of functions for data visualization
dplyr functions can be connected to data pipelines
dplyr queries can be translated to database queries
dplyr supports data transformations like aggregations and joins
dplyr is built for vector transformations

Start Quiz

Function Framework

Every data transformation function in dplyr accepts a data frame as its first input parameter and returns the transformed data frame back as an output. A blueprint for a typical dplyr function looks like this:

transformed <- dplyr_function(my_data_frame, 
                              param_one, 
                              param_two, 
                              ...)

The dplyr_function can be customized further through additional arguments (param_one, param_two) placed after the first data frame parameter (my_data_frame).

The real power of dplyr comes with the pipe operator %>% which allows users to concatenate dplyr functions to data pipelines. The pipe injects the resulting data frame from the previous calculation as the first argument of next one. A data transformation consisting of three functions looks like

dplyr_function_three(
  dplyr_function_two(
    dplyr_function_one(my_data_frame)))

but can be written with the pipe as

my_data_frame %>%
  dplyr_function_one() %>%
  dplyr_function_two() %>%
  dplyr_function_three()

The different reading order of data transformation functions in actual transformation order makes pipelines easier to read than nested function calls.

Quiz: Valid Functions

dplyr_function specifies the transformation function, param_one the parameter for the dplyr function and input_data_frame the data frame to be transformed. Which of the code lines below are valid according to the dplyr function framework?

dplyr_function(param_one, input_data_frame)
dplyr_function(input_data_frame, param_one)
input_data_frame(dplyr_function, param_one)
param_one(dplyr_function, input_data_frame)

Start Quiz

Introduction to dplyr is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Thanks for visiting r-craft.org
This article is originally published at https://www.quantargo.com/blog
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Introduction to dplyr

You may also like...

Categories

Introduction to dplyr

What is dplyr

Quiz: dplyr Facts

Function Framework

Quiz: Valid Functions

You may also like...

Ensemble learning for time series forecasting in R

ASHG18 tweet summary day 3

Tune and interpret decision trees for #TidyTuesday wind turbines

Categories