Shiny and Arrow
This article is originally published at https://www.rstudio.com/blog/
Shiny apps are incredible tools for bringing data science to life. They can help communicate your analysis to non-technical stakeholders, or enable self-service data exploration for any audience. At Ketchbrook Analytics, we care a lot about building production-grade quality Shiny apps; in other words, we strive to ensure that the apps we develop for you will run successfully inside your organization with minimal maintenance.
Building a working Shiny app that runs on your own laptop can be a tricky process itself! However, there are some additional things you need to consider when taking the next step of deploying your app to production so that others can reap the benefits of your hard work. One of these considerations is where to store the data.
“Where should the data live? Should we use a database or flat file(s)? Is our data small enough to fit there?”
Storing Your Data
There are so many options to choose from when it comes to how you want to store the data behind your Shiny app. Sometimes a traditional database doesn’t make sense for your project. Databases can take a while to configure, and if your data isn’t relational then a data lake approach might be a better option. A data lake is just a fancy term for a collection of flat files that are organized in a thoughtful way.
When you think about storing data in flat files, formats like .csv or .txt probably come to mind. However, as your data becomes “big”, transitioning your data to more modern, column-oriented file types (e.g., .parquet) can drastically reduce the size of the file containing your data and increase the speed at which other applications can read that data.
The Benefits of .parquet
First, let’s dig a little deeper into the advantages of .parquet over .csv. The main benefits are:
- smaller file sizes
- improved read speed
The compression and columnar storage format lead to file sizes that are significantly smaller than if that same data were stored in a typical delimited file. From our experience – and also backed by this great blog post by Tomaž Kaštrun – .parquet typically comes in at a little less than half the size of a data-equivalent .csv; however, this margin widens even further as the data volume increases. Included in Tomaž’s article is this fantastic chart (below) illustrating the read, write, and file size metrics he gathered while experimenting across many different file types and sizes.
Interestingly, .parquet files not only store your data, but they also store data about your data (i.e., metadata). Information such as minimum & maximum values are stored for each column, which helps make aggregation blazing fast.
Still not sold? Maybe you are wondering,
Is “.parquet” a sustainable file format for storing my data, or is it just a fad?
That’s a fair question! The last thing we want to do as data scientists is to create more technical debt for our organization. Rest assured, .parquet format is not going anywhere – many production workflows at major organizations are driven by .parquet files in a data lake.
Voltron Data is the company behind .parquet format and the greater Apache Arrow project. They recently finished their Series A round by raising $110 million in funding to continue to develop this technology. Needless to say, we won’t be seeing .parquet format going away any time soon.
Lastly, unlike .RDS files, .parquet is a cross-platform file storage format, which means you can work with .parquet files from just about any programming language including R. This is where the {arrow} package can help.
The Benefits of {arrow}
The {arrow} package provides major benefits:
- It has the ability to read & write .parquet files (among other file types)
- You can query the data in that file before bringing it into an R data frame, using {dplyr} verbs, which provides for dramatic speed improvements
The combination of {arrow} and {dplyr} also results in lazy evaluation of your data manipulation statements. This means that your {dplyr} functions build a “recipe” of transformation steps that will only evaluate once you are finally ready to bring the transformed data into memory (through the use of dplyr::collect()
). Don’t take our word for it, though; hear it straight from the Apache Arrow team:
“…[A]ll work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, you can select a subset of data from a much larger dataset by collecting the smaller slices from each file – you don’t have to load the whole dataset in memory to slice from it.”
The concept of lazy evaluation with {dplyr} is also paramount when performing data manipulations and summaries on data stored in relational databases. The fact that a data science team can leverage those same principles to analyze data stored in .parquet files, without having to learn a completely new approach, is another massive benefit!
How It All Fits Together in Shiny: A Use Case at Ketchbrook Analytics
We have learned that the combination of {arrow} + {dplyr} + .parquet gives us all of the memory-saving benefits we would get from querying a database, but with the simplicity of flat files.
Ketchbrook was developing a Shiny app for a client, for which the relevant data was stored in a large, single .csv that was causing two problems:
- There wasn’t enough room for the file on their shinyapps.io server
- Even when run locally, applying filters and aggregations to the data from within the app was slow
After converting the large .csv file into .parquet format, the data became one-sixth of the size of the original .csv – plenty of room available on the server for the .parquet data.
Further, executing dplyr::filter()
on the already-in-memory .csv data was taking quite a few seconds for the app to respond. The conversion of the data to .parquet format, coupled with executing the {dplyr} functions against an {arrow} table (instead of an R data frame), drastically reduced the processing time to less than one second.
To demonstrate this powerful combination of {shiny} + {arrow}, Ketchbrook Analytics developed an example Shiny app and accompanying GitHub repository.
Play around with the app, dive into the code, and try incorporating {arrow} into your next Shiny project!
The Proof is in the Pudding (and the File Size)
For our example Shiny app, we created a mock dataset, and stored it in both .txt and .parquet format. You can create this data yourself by running these two scripts.
For comparison, let’s view the size of the data that’s stored in tab-delimited .txt file format:
files <- fs::file_info(
path = list.files(
dir,
full.names = TRUE
)
) |>
dplyr::select(path, size) |>
dplyr::mutate(
path = fs::path_file(path),
file_type = stringr::str_extract(
string = path,
pattern = "[^.]+$" # extract text after period
)
)
files |>
dplyr::filter(file_type == "txt") |>
knitr::kable()
path | size | file_type |
---|---|---|
half_of_the_data.txt | 170M | txt |
the_other_half_of_the_data.txt | 170M | txt |
We can see that the .txt files total 339M in size.
Now let’s look at the data when stored as .parquet file format:
files |>
dplyr::filter(file_type == "parquet") |>
knitr::kable()
path | size | file_type |
---|---|---|
all_of_the_data.parquet | 158M | parquet |
Wow! The same dataset is less than half the size when stored as .parquet as compared to .txt.
The Need for Speed
We saw the storage savings in action – now let’s take a look at the speed improvements.
As a practical example, let’s run a sequence of dplyr::filter()
, dplyr::group_by()
, and dplyr::summarise()
statements against the .txt file:
tic <- Sys.time()
vroom::vroom(
list.files(
path = dir,
full.names = TRUE,
pattern = ".txt$"
),
delim = "\t"
) |>
dplyr::filter(Variable_H > 50) |>
dplyr::group_by(Item_Code) |>
dplyr::summarise(
Variable_A_Total = sum(Variable_A)
)
# A tibble: 1,000 × 2
Item_Code Variable_A_Total
<chr> <dbl>
1 A1G740 49453.
2 A1J731 49481.
3 A1N838 51610.
4 A1O339 52633.
5 A1R990 47588.
6 A2E381 50823.
7 A2J681 51575.
8 A2N118 49840.
9 A2U328 51106.
10 A2W136 48013.
# … with 990 more rows
toc <- Sys.time()
time_txt <- difftime(toc, tic)
time_txt
Time difference of 6.38838 secs
When run against the .txt file, the process takes 6.39 seconds to run.
Now let’s try the same {dplyr} query against the .parquet file:
tic <- Sys.time()
arrow::open_dataset(
sources = list.files(
path = dir,
full.names = TRUE,
pattern = ".parquet$"
),
format = "parquet"
) |>
dplyr::filter(Variable_H > 50) |>
dplyr::group_by(Item_Code) |>
dplyr::summarise(
Variable_A_Total = sum(Variable_A)
) |>
dplyr::collect()
# A tibble: 1,000 × 2
Item_Code Variable_A_Total
<chr> <dbl>
1 Z8B631 49545.
2 J8O195 52941.
3 I5Y383 46572.
4 O8N416 51525.
5 I2E912 49862.
6 D4M22 50317.
7 L1G322 46862.
8 C3C179 51791.
9 N4Q977 49013.
10 L6T273 48561.
# … with 990 more rows
toc <- Sys.time()
time_parquet <- difftime(toc, tic)
time_parquet
Time difference of 1.975953 secs
Wow! It might not seem like much, but the difference between a user having to wait 6.39 seconds for your Shiny app to execute a process versus having to wait 1.98 seconds is incredibly significant from a user experience standpoint.
But don’t just take our word for it. Make your next Shiny app an {arrow}-driven, high-performance experience for your own users!
Thanks for visiting r-craft.org
This article is originally published at https://www.rstudio.com/blog/
Please visit source website for post related comments.