The 3 Doors of Data Transformation
This article is originally published at https://www.quantargo.com/blog
This course covers the three most popular package ecosystems for data transformation in R: base R, tidyverse and data.table. You will see which options are better suited for specific use cases in terms of stability, features, speed and consistency.
- Get familiar with the main approaches for data handling in R
- Understand the advantages and disadvantages of each option
Data can come in many shapes and formats from various sources. The first step before any statistical analysis can be done, is to transform the data to the most suitable format. Depending on the use case, this step might require different packages.
In R, there exist three different package ecosystems to transform data, namely base R, tidyverse and data.table. Although functions can easily be combined across these ecosystems, it is not always possible due to subtle differences.
The most important difference lies in the fact, that each ecosystem has its own data frame object defined: data frames, tibbles and data tables. Although tibbles and data tables inherit behavior from their common ancestor data frame, some small differences make them hard to re-use in different ecosystems. Choose your door wisely.
The base R Package Ecosystem
The base R package is already integrated into the basic R installation. Thus, it can be easily used even within very restrictive IT landscapes. It is also an appropriate choice for environments, where frequent package installations and updates might be unfeasible.
The base R package has already stood the test of time and is considered to be very stable, with only very few changes even over major version updates. Chances are high, that some dated R code would still work after years, even on different machines or operating systems.
However, base R does not have the fastest performance for large data sets, compared to other packages and tools. In addition, due to its long history, some base R functions lack consistency and make common workflows harder to integrate. The feature set of base R for data manipulation tasks like joins or reshaping/pivoting, is also lacking behind other packages.
Since base R is installed on every machine running R, it is important for every data scientist to know its features. Its power might surprise you, and you never know which machine you end up working with.
The tidyverse Package Ecosystem
The tidyverse package ecosystem provides many packages for data manipulation—most importantly dplyr and tidyr. These packages are well maintained and already widely adopted in the R community. Its clear and consistent syntax makes learning a breeze. Moreover, all common functions (or verbs) can be combined using the pipe
The feature set of tidyverse for data reshaping and joins is unparalleled in the R ecosystem. Through extension packages like dbplyr and sparklyr, you can even write queries for database or hadoop cluster back ends. The respective queries get translated for the specific back end.
On the other hand, tidyverse has many package dependencies and it might be hard to install and maintain these dependencies in specific IT environments and production systems. The tidyverse packages are still subject to change but should become more stable in future versions.
The data.table Package Ecosystem
data.table is a highly optimized, in-memory transformation and query interface for tabular data. It is very well suited for operations like joins, value updates and filters on large tables (e.g. 10M rows+). The main reason for the large speed gains lies in the fact that data.table is very memory-efficient and tries to avoid copies of large tables as much as possible.
Data tables have some additional features compared to conventional data frames. One can apply data transformation functions directly inside the subset operator
[ for example. However, these additional features might lead to constructs which are hard to understand for beginners or non- data table users.
Data table is still one of the fastest in-memory tabular format on the planet. The data.table function
fread(), is currently the fastest function to read large comma-separated files within R (and also among other languages). The biggest reason for using data.table is simple: speed.
Pros and Cons
Depending on the requirements for the use cases, specific package ecosystems stand out against its peers:
- In terms of stability of the code (over years), the base R package should be considered.
- The feature set for data manipulation seems to be broadest in the tidyverse ecosystem.
- The data.table package is (still) the speed champion.
- Interoperability and consistency for different data transformation problems seems to be best handled by the tidyverse ecosystem.
Quiz: Which Package Ecosystem to Choose with Storage Backends?Which R package ecosystem shall be chosen if data transformation code needs to be clean, fast and extensible through many storage backends?
- base R
Quiz: Which Package Ecosystem to Choose for Stability?Which R package ecosystem shall be chosen if data transformation code shall be very stable and not many features are required?
- base R
Quiz: Which Package Ecosystem to Choose for Large Data Sets?Which R package ecosystem shall be chosen if huge data sets need to be processed and therefore maximum performance is required?
- base R
Please visit source website for post related comments.